Day 15 - Data Science Ethics

Fall 2022

Dr. Jared Joseph

October 12, 2022

Overview

Timeline

  • Define Ethics
  • Data Science Gone Wrong
  • Our Role
  • Case Studies

Goal

Establish a baseline for ethical data science and show the dangers of data science gone wrong.

What is “Data Ethics”?

Ethics Broadly Defined





The framework with which we decide what is right and wrong.

Not Just Social





All data was ultimately created by people, and will have an impact on people.

Data Science Gone Wrong

Weapons of Math Destruction

Algorithms and their feedback loops can cause serious issues:

Going to College
College rankings are created using many non-academic measures (such as opinion polls of university reputation). This forces schools to prioritize things that don’t necessarily help students. For example, cost of attendance in not used in the U.S. News and World Reports.

Automating Inequality

Not everyone in subject to the same data systems:

LA Comprehensive Intake
A unified system to catalog data about the homeless was meant to streamline access to services. However, it created a massive repository of highly sensitive information about people they were in effect forced to reveal. This data could be accessed by police without a warrant; equates homelessness and poverty with crime.

Algorithms of Oppression

Data systems are only as good as their inputs, and those inputs are always biased:

Search Engines and Society
Google will finish phrases based on previous searches. In 2013, these were some auto completes provided when given the bold prompt:
  • Women cannot: drive, be bishops, be trusted, speak in church
  • Women should not: have rights, vote, work, box
  • Women should: stay at home, be slaves, be in the kitchen, not speak in church
  • Women need to: be put in their places, know their place, be controlled, be disciplined

The Loop - Credit Scores

Systems Trap People

Credit scores are used to screen for rental housing, car loans, jobs, and more.

If you cannot get a job and cannot save, how do you improve your credit score so that you can get a job and save to improve your credit score?

Our Role

Data Scientists as Experts

People will look to you and trust your decisions regarding data.

  • Data Sets:
    • What do these numbers mean?
    • Can we trust these numbers?
    • What data should we collect?
    • How do we collect it?
  • Modeling:
    • Does X have an impact on Y?
    • How large is that effect?
    • Is the effect significant?
    • How do we alter the effect?

Reproducible Work is Good Work

Professional Ethics

Reasonable people can disagree on specifics

graph TD
    Stakeholders --- You
    Team --- You
    bp[Best Practice] --- You

    You{You}

    You --- sg[Social Good]
    You --- pc[Professional Community]
    You --- s[Yourself]
    
    linkStyle 0 stroke:white
    linkStyle 1 stroke:white
    linkStyle 2 stroke:white
    linkStyle 3 stroke:white
    linkStyle 4 stroke:white
    linkStyle 5 stroke:white

Data Science Oath

I swear to fulfill, to the best of my ability and judgment, this covenant:

I will respect the hard-won scientific gains of those data scientists in whose steps I walk and gladly share such knowledge as is mine with those who follow.

I will apply, for the benefit of society, all measures which are required, avoiding misrepresentations of data and analysis results. I will remember that there is art to data science as well as science and that consistency, candor, and compassion should outweigh the algorithm’s precision or the interventionist’s influence.

I will not be ashamed to say, “I know not,” nor will I fail to call in my colleagues when the skills of another are needed for solving a problem.

I will respect the privacy of my data subjects, for their data are not disclosed to me that the world may know, so I will tread with care in matters of privacy and security. If it is given to me to do good with my analyses, all thanks. But it may also be within my power to do harm, and this responsibility must be faced with humbleness and awareness of my own limitations.

I will remember that my data are not just numbers without meaning or context, but represent real people and situations, and that my work may lead to unintended societal consequences, such as inequality, poverty, and disparities due to algorithmic bias. My responsibility must consider potential consequences of my extraction of meaning from data and ensure my analyses help make better decisions.

I will perform personalization where appropriate, but I will always look for a path to fair treatment and nondiscrimination.

I will remember that I remain a member of society, with special obligations to all my fellow human beings, those who need help and those who don’t.

If I do not violate this oath, may I enjoy vitality and virtuosity, respected for my contributions and remembered for my leadership thereafter. May I always act to preserve the finest traditions of my calling and may I long experience the joy of helping those who can benefit from my work.

Case Studies

Filling in the Blanks

We can fill in missing data using known data by using imputation. This process, while based on statistics, is still just an educated guess.

Rethnicity
Predict ethnicity by name.
wru
Predict race by name.
gender
Predict gender by name.

Proxy Perils

Q: Should zip code be used as a data point when determining credit scores?

Q: Should race be used as a data point when determining credit scores?

Race and zip code are highly correlated.

Ethics of Efficacy

“We found that a person’s last name was one of the most powerful predictors of if a person would default on a loan. Why would we exclude one of our most useful metrics?”

Data Protection No Matter the Cost

The Health Insurance Portability and Accountability Act (HIPAA) protects personal health care data from being shared without necessity. There are exceptions for suspected abuse and other serious situations.


In my field of criminology, researchers will sometimes add an arbitrary health question to their interview/survey questions, so that they can have their research materials covered by HIPAA. This prevents law enforcement from being able to subpoena their records.

Unintended Uses

By merging data on campaign financing and the receipt of civic services, we can if political contributions influence how quickly the city responds to requests for help.


This can help us understand how corruption operates within a city.


However, this will involve merging data from individual people, their businesses, and their political activities. All of this data is open, but no one would have a reasonable expectation that this is how their data would be used.


Is this ethical?

For Next Time

Topic

Project 1

To-Do

  • Organize into Teams
  • Tell me the teams and I will create private slack channels for each