Solon Barocas & Andrew Selbst, Big Data’s Disparate Impact (Sept. 14, 2013) (unpublished manuscript).
This paper, which received the IAPP Privacy Paper Award at the Privacy Law Scholars Conference in 2014, argues that data mining, or applying statistical analysis to data sets via computer algorithms to reveal patterns helpful to decision-makers, can unintentionally discriminate against historically disadvantaged groups. Part I provides background on the science and logistics of data mining and shows its potential for discrimination. Id. at 5. Part II discusses Title VII anti-discrimination jurisprudence in the context of data mining. Id. Part III discusses possible remedies to the discriminatory potential of data mining, focusing on the difficulties in crafting and passing legislation to address this issue. Id.
In Part I, the authors identify five mechanisms through which data mining can perpetuate bias. Id. at 6. First, data mining defines the “target variable,” or the outcome of interest, and “class labels,” or the different classes of data that the program must distinguish, in order to convert an amorphous problem into an algorithm that computers can use to parse a data set. Id. at 6-7. Defining the target variable is a subjective process, leaving room for the data miner’s subconscious bias to enter the analysis. Id. at 8. Companies often use data mining in employment decisions, so the target variables and class labels correspond to characteristics of a good employee, for example higher sales volume, shorter production time, and longer tenure. Id. at 9. The target variable and class labels determine the results of data mining, and if bias enters into calculations of those elements (for example, if tenure is used as a class label, longer tenure has a disproportionate adverse impact on certain protected classes), then the results of the data mining will also be biased. Id. at 9-10. Second, data mining algorithms learn by example, and in order to teach the algorithms what to look for, data miners use “training data.” Id. at 10. If the training data contains bias, either through subtly mislabeled data examples that reflect societal prejudice or through incomplete data collection that over or underrepresents certain classes, then data mining’s results will be biased. Id. at 10-17. For example, LinkedIn uses a data mining system to recommend employees to companies, and if the algorithm learns that a company disfavors candidates from a certain protected class, it will recommend those candidates less frequently to that company. Id. at 13. Third, data mining involves “feature selection,” where data miners pick input variables to further differentiate the data. Id. at 17. Feature selection can lead to biased results because data miners may fail to include enough features that make pertinent distinctions between members of a protected class, and the features that are included may contain baked-in prejudices, for example attending a prestigious university (members of certain protected classes graduate from prestigious universities at disproportionately low rates). Id. at 17-20. Fourth, criteria that sort data to achieve the desired outcome (for example, factors that accurately predict which applicant will excel at the job) may also serve as “proxies” for membership in a protected class, teaching the data mining algorithm that there is a relationship between the desired outcome and those who are not members of a protected class. Id. at 20. Fifth, data miners can “mask” intentional discrimination with the above-listed factors that lead to unintentional discrimination in data mining. Id. at 21-22.
In Part II, the authors explain why current anti-discrimination law, namely Title VII, is not well equipped to deal with these discriminatory features of data mining. Id. at 23. Discrimination in employment is enforceable under Title VII under two theories: (1) disparate treatment, comprising formal disparate treatment of similarly situated people and intent to discriminate, and (2) disparate impact, or hiring policies that are facially neutral but have a disproportionately adverse impact on protected classes. Id. at 24. Under the disparate treatment theory of liability, both masking intentional bias with an algorithm and using race as a class label or feature (channeling “rational racism,” or the belief that race is an accurate proxy for job success) could result in a violation of Title VII. Id. at 24-25. Plaintiffs will also likely find providing evidence of intentional discrimination in the context of masking very difficult. Id. at 43-45. But masking and rational racism are not the most common ways in which data mining discriminates, so the disparate treatment theory will likely not be helpful in regulating discriminatory data mining. Id. at 30-31. Because it does not require intent, the disparate impact theory has more potential to address discriminatory data mining. Id. at 31. To prove a Title VII violation under this theory, (1) the plaintiff shows an employment practices causes a disparate impact on a protected class, (2) the defendant may defend against liability by “demonstrat[ing] that the challenged practice is job related for the position in question and consistent with business necessity,” which shifts the burden to the plaintiff, and (3) the plaintiff can overcome this defense by showing that the defendant could have used a less discriminatory “alternative employment practice.” Id. Most courts currently apply a more relaxed “job-relatedness”—rather than the stricter “business necessity”—test under the second step, and defendants will likely successfully be able to use this defense to avoid Title VII liability for data mining that leads to discriminatory hiring. Id. at 33-40. Also, because discriminatory data mining results from existing unconscious bias being baked in to many different features of the mining program, it may be difficult for plaintiffs to point to an alternative employment practice that accomplishes the same goals but is less discriminatory. Id. at 40-42.
In Part III, the authors discuss difficulties in reforming Title VII to address discriminatory data mining. Id. at 45. First, factors internal to the data mining process present barriers to legal reform. Id. The subjective factors where a data miner’s unconscious bias may creep into the analysis, including defining the target variable and feature selection, must always include some form of judgment, which limits reform efforts to eradicate discriminatory effects in data mining. Id. at 46, 50-51. But, some groups have had success in reducing discriminatory outcomes by experimenting with many different data mining programs that use different variables and classes. Id. at 47. Reforms could target training data selection by prohibiting the use of training data tainted by prejudice, but this would likely leave little training data to use as most algorithms contain some extent of prejudice. Id. at 47-50. Finally, regulating proxies faces the difficulty of determining how correlated a relevant attribute must be with class membership to be discriminatory, and from a practical standpoint it may be impossible to completely disentangle relevant attributes from membership in protected classes. Id. at 52-54. Second, factors external to the data mining process, such as political and constitutional restraints, present barriers to legal reform. Id. at 46. Because the political climate and many members of the Supreme Court have moved away from supporting the elimination of status-based inequality as a purpose for antidiscrimination law, any reforms to data mining that specify a protected class, especially race, may fail a challenge to its constitutionality. Id.
Amanda Conley, Anupam Datta, Helen Nissenbaum & Divya Sharma, Sustaining Privacy and Open Justice in the Transition to Online Court Records: A Multidisciplinary Inquiry, 71 Maryland L.R. 772 (2012)
The debate about transforming court records from in-person to online access is about “core societal values,” and there is a question of how courts ought to do to control the flow of personal information. Amanda Conley, Anupam Datta, Helen Nissenbaum & Divya Sharma, Sustaining Privacy and Open Justice in the Transition to Online Court Records: A Multidisciplinary Inquiry, 71 Maryland L.R. 772, 776 (2012). In this article, the authors look mainly at state, rather than federal, courts because of the “abundance of personal information” and use New Jersey as their point of reference. Id. The article creates a “line of analysis toward developing such policy and regulation, whether through case law or administrative rules.” Id. at 777.
The authors define case files as “the sum total of documents, media files, and exhibits that are produced and collected by the parities and/or the court as a case makes its way through the system” and court records as “the subset of these documents that remain after a case has been resolved and become part of the permanent, public record.” Id. at 780-81. These records contain a lot of personal information and potentially could include the entire case file. Id. Even if it does not appear at first glance that the information is that private, “when combined with other publicly available data . . . it may provide ample information for identity thieves.” Id. at 782. Usually in state courts, the lawyers and their clients redact sensitive information like Social Security numbers, but some items may slip and it will be even harder when documents are scanned into PDF format to locate and remove sensitive information. Id.
There is a longstanding legal tradition of providing public access to court records. Id. at 785. Access to judicial records is not constitutionally protected, and a restriction on access is dependent on the state, local law and custom. Id. at 785-87. Generally, a person need only show a “legitimate interest in the public record requested in order to gain access.” Id. at 788. Most courts provide copies for a fee and usually once a person has that copy of the publicly available record, “she is free to use them however she chooses, provided that she does not violate any state or federal laws.” Id. at 789. Some records can be “sealed” i.e. excluded due to sensitive information, or some information can be “redacted.” Id. at 791. The authors note that while there are judicial rules and state statutes to help decide what is excluded and included, often in practice those considerations are on the backburner given the custom and convenience of the rules. Id. at 796. A concern is that “when they are in electronic format, court clerks may in some instances, without oversight, decide to simply place all the records online to avoid having to complete paper requests at the courthouse and to provide greater accessibility to the interested parties.” Id. at 797. Courts are reluctant to seal records because of the “fear of intruding on a constitutionally protected right to access court records” as well as the burdens and inconvenience to both courts and attorneys. Id. at 801.
The authors introduce the theory of contextual integrity in which “a right to privacy in personal information (that is, information about persons) in terms of appropriate flow.” Id. at 804. The theory has three categories: actors (subjects, senders and recipients of the information), information types (the nature of the information in question or the “what it’s all about”) and transmission principles (or the terms in which such transfers of information should or should not happen). Id. at 804-05. The theory also has an “evaluative layer that allows an action or practice to be judged on moral and political grounds” meaning that for their purposes, the authors can use this theory to look at the value of privacy to society and individuals. Id. at 806. There is also a discussion about the costs associated with the flows of information such as “time, money and effort that the recipient has to expend to cause a flow of information.” Id. at 808.
The study concludes by sharing their eight-step information retrieval model to capture key elements of information from court records. Id. at 810. The authors conducted an empirical study using two online access search systems, PACER and Google Scholar, as well as searching records at two physical courthouses, Superior Court Clerk’s office in New Jersey and Superior Court of New Jersey. Id. at 814. They found that “between online and local access, there are significant differences in the cost of retrieving various types of personal information about a data subject.” Id. The study notes that it is easy to find personal information at a low cost if there is a reasonable level of background on the subject of the research. Id. at 818. A physical search costs more because of the time required to get to a courthouse and the restrictive nature of the courthouse. Id. at 821-22. Of note, the authors point out that “it is unclear how to enforce such access restrictions online without an expressive identification and authentication infrastructure” which in itself will be costly though technically feasible. Id. There is a concern about “intimately personal and sometimes even embarrassing information” that could cause “scandal, defamation, harassment, ridicule or unnecessary attention and embarrassment” if put online. Id. The article mentions that open justice “must yield to the need to secure the administration of justice in the unusual event that publicity would be to its detriment.” Id. at 837.
In the final section of the paper, the authors provide brief descriptions of alternatives. The first is for the courts to “sanitize records by redacting proper names and possibility other immediately identifying information.” Id. at 839. The authors believe that sanitizing records prior to posting them online “promises protections against threats to privacy without significantly compromising values and goals of the courts.” Id. at 841. The second option is to “retain the status quo for local access but produce a sanitized version for the open, indexable Web.” Id. at 843. The third option would be to build a structure into the digital records to have differential access. Id. at 844. All options should be discussed in further detail according to the authors, and there should be further discussion of the courts’ responsibilities “as creators of databases of
personal information.” Id. at 847.
Professor Rob Kitchin’s overview of “the data revolution” is the best monograph we have discovered on open and big data. It defines the issues of open and big data and the potential consequences of the data revolution. In a balanced way, and without the hyperbole of trade press books on big data, Kitchin explains that the data revolution has implications for governance, management of business, and even understanding of science and knowledge. Therefore, the movement needs a “more critical and philosophical engagement.” Kitchin extensively reviews the literature to highlight the contours of such an engagement. Here we focus on chapters one and three, the most relevant to open data, but others define big data and situate it in a larger historical and scientific context.
The discourse of big and open data is often infused with the idea that data are the objective, raw material of the world. In chapter one (available online), Kitchin explores this and other philosophical understandings of data. Relying on Bowker and Star, he elucidates the idea that data are defined in a “normative, political, and ethical” process “that is often contested and has consequences for subsequent analysis, interpretation and action.” This means that the open data movement must critically examine how data are assembled. This need is all the more urgent because of the emphasis open data places on allowing anyone to analyze data and come to her own conclusions. This raises the risk that researchers use data without, “knowing the politics of why and how such databases were constructed, the technical aspects of their generation, or having personal familiarity with the phenomena captured.”
Chapter three analyzes the open data movement, setting forth the benefits of open data, but paying particular attention to overlooked problems with its conceptualization and the politics of data. A transparency agenda can be politics by another means, directed at weakening a state, or in privatizing an asset held only by the state. Transparency movements are also selective in target; if openness is an end in itself, it may also militate in favor of weakening intellectual property regimes, and in the opening of proprietary corporate data. Finally, it is assumed that open data’s openness will benefit all, but open data systems do not require a duty of beneficence for access, nor do they require some diffusion of benefits of their use to the public. There is a risk that open data empowers the empowered because most citizens cannot access, interpret, and translate this data into action; whereas the most empowered users are probably already located in data companies, with greater skills, and with access to proprietary data to enhance open data. Kitchin does not do this with the intent of locking up datasets; instead, he intends to uncover the confusion and hidden politics of the movement.
The book’s bibliography is available here: http://thedatarevolutionbook.wordpress.com/
Without Consent is an approachable, well written adaptation of Professor Heather MacNeil’s (Professor, Faculty of Information, University of Toronto) master’s thesis on the ethics of disclosing personal information in public records. MacNeil’s book considers the ethical quandary of archivists, who must balance: “…two social values—the individual’s need for privacy and society’s need to understand itself…”
To frame the issue, MacNeil starts with a well-informed and parsimonious summary of the philosophical underpinnings and definition of privacy. She then pivots to discuss the public interests in disclosure, including the conflicts between liberal interests in privacy-autonomy versus interests in historical and social research into marginalized populations. An entire chapter is devoted to “documenting the lives of the laboring and unlettered,” the point of which is to highlight that privacy rights might impede liberal goals of addressing social problems through telling the history of the poor. She observes, “…many social historians have focused their gaze on prisons, hospitals, mental asylums, homes for wayward girls, poor farms, reform school…as a means of determining the dominant assumptions of these institutions and their impact on client populations as well as the society at large.” Record linkage among agencies and examination of individual records may be necessary for good historical research into social processes and the structure of society.
The Stockholm Metropolitan Study is invoked to introduce ethical issues in research into archival records. In it, researchers secretly amassed computerized records on all persons born in Stockholm in 1953. The study came to light in 1986. The study “participants” were tracked in every fashion, including through different social service agencies. Data collected included school records, records of sexual problems, taxes, living situation, etc. How could this advanced, liberal democracy with a strong tradition of privacy laws have allowed this to happen?
One answer comes from researchers’ interest in academic freedom, but MacNeil quickly dispenses with the idea that a right to know could justify such privacy invasions. The more powerful argument comes from utilitarianism: “a scientific end justifies the means used to achieve it.” But such reasoning taken to an extreme could justify the Nazi medical experiments, the Tuskegee syphilis experiment, or the “tearoom trade” investigation. The question then becomes how to balance society’s interest in research versus individual autonomy.
Utilitarian supporters of research often argue that they are bound by a norm of beneficence that tilts the balance toward disclosure of personal information. MacNeil examines this point, noting that the benefits of most social science research (as opposed to medical research) are speculative. Openness advocates trumpet these speculative benefits and are willing to cumulatively account for them, but at the same time, they denigrate privacy harms as speculative and say that they cannot be accounted for. Utility calculations are further jaundiced because the privacy risks that researchers ask society to take are often unevenly distributed, with investigation focusing on lower-socioeconomic-status individuals who may be unable to object. Researchers also substitute their own cost/benefit values for those of data subjects, and base their opinions on harm without consulting the subject population.
MacNeil’s point about researcher beneficence deserves examination in the context of open data. For us, MacNeil’s work points to a critical question: what are the social obligations of open data users? If open data are truly open, what if the scope of users includes malicious actors, foreign governments, or even terrorists?
MacNeil then turns to a rights-based approach, which “requires that the social research involving human subjects be judged within a framework of moral reasoning that focuses on principles that are shared between people and to which we can imagine people contractually agreeing.” It is this rights-based approach that frames MacNeil’s
recommendations, a large part of which focuses on an institutional-review-board type entity to arbitrate openness questions.
Several aspects of MacNeil’s observations should be revisited in light of technological change. MacNeil points to David Flaherty’s argument that research and statistical uses “…do not directly affect a particular person on the basis of the specific data in question…” MacNeil also discusses anonymization techniques as protective of privacy. Modern advances in de-anonymization and record-linkage technologies undermine both of these points.
Evgeny Morozov, To Save Everything, Click Here (2013)(Chapter 3: So Open it Hurts)
Evgeny Morozov starts his chapter on openness with a turn on Louis D. Brandeis’ famous quote, “Sunlight is said to be the best of disinfectants.” Morozov quips, “disinfectants, alas, are of little use to sunburn victims.” Morozov delivers a dark view of openness arguments, showing how the liberal value of transparency can be illiberal, especially when openness becomes an end in itself, rather than as a means to accountability. Transparency advocates, perhaps unwittingly, buy into a rational choice theory paradigm of problems when calling for openness.
Morozov’s argument is nested in a larger objection he makes to invocations of “the Internet.” According to Morozov, thinking about technology policy through the lens of “the Internet,” causes policymakers to adopt a defeatist posture, unable to conceive of technical interventions as these might “break the Internet:” “Solutions are not assessed based on their merits but rather on how well they sit with the idea of a free, open, transparent ‘network’ and its ‘architecture.’” The alternative is to deconstruct the term, and to examine the individual technologies and platforms that are involved in any given policy.
Morozov surveys the literature of a number of commentators on open records and transparency, including Professors Peter Winn (warning that openness in court records could lead to less cooperation with government) Joel Reidenberg (warning of the transparent citizen; the problem of losing control over data with subsequent aggregations causing less transparency and more inaccuracy), Onora O’Neill (how transparency can undermine trust; how trust is a more important public value than transparency), Yu and Robinson (summarized above) Conley et al. (summarized above) and Johnson et al. (a Mcluhanesque argument that uses an extended metaphor of a house of mirrors to describe the distorting effects of campaign finance disclosure).
Morozov argues Silicon Valley companies use transparency as a instrument to pry information out of the government they can then use for advertising purposes. He calls for a more critical look at the incentives and choices made by Google and other transparency advocates: “While Internet-centrists tend to be populist and unempirical, Internet realists start with no assumptions about the intrinsic values of ‘openness’ and ‘transparency’—let alone their inherent presence in digital networks—and pay particular attention to how these notions are involved and manifested in particular debates and technologies. While Internet-centrists believe that ‘openness’ is good in itself, Internet realists investigate what the rhetoric of ‘openness’ does for governments and companies—and what they do for it.”
Morozov concludes that information, “needs to be collected and distributed in full awareness of the institutional and cultural complexity of the institutional environment in which it is gathered. Sometimes preserving the social relations that enable that environment to exist—for example, to make policing of crimes possible—might require producing data that is only half transparent or half accessible…The tyranny of openness—the result of our infatuation with Internet-centrism—must be resisted.”
(As a historical note, Brandeis’ quote was delivered in an attack against the trusts, and he concluded that transparency was not enough to address them, and endorsed the creation of an unparalleled federal regulatory agency, the Federal Trade Commission, to serve both an information-forcing and prosecutorial role.)
Neil M. Richards & Jonathan H. King, Big Data Ethics, 46 Wake Forest L. Rev. 393 (2014)
Society is currently on the cusp of a Big Data Revolution that will determine the societal notions of the limits of big data predictions. The authors frame the definition of big data in terms of the impact it will have on society by referring to big data as the extraction of “new insights or create new forms of value, in ways that change markets, organizations, the relationship between citizens and governments and more.” Id. at 394. Because big data allows for the collection and storage of increasing amounts of information, institutions are able to analyze data in a way that limits individual privacy, confidentiality, transparency, identity, and free choice. Id. at 395. Thus, the development of Big Data Ethics is essential to preserving those notions in society. Id.
The recent developments in computing technology allow for the storage and analysis of large quantities of data, and more importantly metadata, which threatens privacy, confidentiality, and free choice. Id. at 400. Metadata is the “data about data themselves” which is produced whenever individuals interact with technology. Id. at 402. The storage and accessibility to metadata allows various institutions to analyze and “reverse engineer past, present, and even future breaches of privacy confidentiality, and identity,” Id. at 393, to compile comprehensive data profiles about individuals, which can be sold to various agencies. Id. at 404. Those agencies can use the data profiles for a wide variety of activities, which range from targeting consumers with individualized advertisements to informing law enforcement of potential crime hotspots. Id. at 407. Such potential for privacy invasion underscores the need for Big Data Ethics to protect individuals from excessive intrusions of privacy and confidentiality. Id. at 407.
The authors argue that Big Data Ethics needs a set of rules to regulate the use of big data to most effectively benefit society without unnecessarily intruding on privacy. Id. at 409. The authors further argue that those rules should be governed by the four normative values of privacy, confidentiality, transparency and identity. Id. at 409. Privacy needs to be recognized because rules and regulations governing personal information flows could resolve the essential privacy problem posed by the use of big data. Id. Because “virtually all information exists in intermediate states between completely public and completely private,” shared private information can be confidential. Id. at 413. Transparency is necessary for the regulation of big data because it fosters trust between individuals and the institutions that use big data. Id. at 419. Considerations for identity are necessary to preserve human autonomy. Id. at 422-23. Limiting the allowed types of big data predictions and inferences can preserve individual identity by preventing the implementation of big data inferences in every day life from removing free choice and identity from the individuals. Id.
The authors clarify that Big Data Ethics needs to be implemented via a combination of legal regulations, transparency policies, a set of standard ethical sensibilities concerning information technologies, and consumer review boards. Id. at 429-31. With those policies, big data can be used effectively without infringing on individuals’ rights to privacy and confidentiality. Id. at
Harlan Yu, David G. Robinson, The New Ambiguity of “Open Government,” 59 UCLA L. Rev. Disc. 178 (2012).
The terms “open government” and “open data” were originally used to mean disclosures of government information. Harlan Yu, David G. Robinson, The New Ambiguity of “Open Government,” 59 UCLA L. Rev. Disc. 178, 181 (2012). Though they used to have a significant political impact, in recent years, the terms have come to generally mean open technology. Id. The authors Yu and Robinson argue that the general usage of this vocabulary strains policymakers and restricts usefulness and government accountability. Id. at 182. The authors suggest a separation between the politics of open government and the technologies of open data.
Yu and Robinson explain that though public sector data has increasingly been made available to the public, and in increasingly friendlier formats, this information does not hold governments accountable. Id. at 181. This data can be considered practical and useful to people, however it may not have any political relevance or importance. Id. at 188.
Through an explanation of the origins of open government and the origins of open data, the authors outline the terms’ origins and discuss how the two terms became intertwined. The authors trace the idea of open government immediately after the Second World War through the Freedom of Information Act in 1966, and discuss federal court decisions of the 1970s and 1980s. Id. at 186. Yu and Robinson explain that the term “open data” comes from 1970s science policy with international partners and NASA. Id. at 187.
Later, the Human Genome Project was able to be completed using public funding. Id. at 190. In 1998, Opensecrets.org was one of the first projects to both make data machine readable and promote government transparency. Id. at 192. The authors argue that “open government” is still focused on new information disclosure rather than improving already-public data. The two worlds of open data and open government were able to converge in the 1990s when individuals could utilize the internet. Id. at 190. Though data was available from various governments, it was hard for third-parties to utilize the data and innovate upon existing programs. Id.
A spectrum for the two ideas is presented in order to judge data’s usefulness and government accountability. Government data needs to be adaptable in order for new uses; to this end, offline data is considered inert and can be altogether considered useless on the spectrum. Id. at 207. (Disclosures of data are also judged on a scale of service delivery – convenience, or higher quality of life – and public accountability – serving a purely civic role. Id.