Funded Projects

These six projects were selected for funding in response to our call for proposals on open data.

Managing the Inferential Possibilities of Open Government Data

This project will develop ways to reason through the fears that open government data will facilitate privacy-violating inference-making. In particular, it will evaluate as an empirical and technical matter when and under what conditions government datasets are likely to support inferences that have adverse consequences for citizens, focusing on applications of machine learning that reveal the range of surprising inferential possibilities opened up by certain datasets.

This project takes as its point of departure the fact that we have a robust debate about anonymization now only because we have had systematic demonstrations of successful attacks. It therefore seeks to assemble a corresponding set of demonstrations for the inferences facilitated by machine learning, showcasing both its surprising reach as well as its serious limits.

The project will entail (1) a systematic analysis of existing empirical work, which assesses the strength of both the methods and the findings; (2) compiling case studies of failed attempts at inference and underwhelming results that reveal fundamental limits to inference-making; and (3) interviews with practitioners that make explicit the heuristic upon which they rely to assess the promise or limited potential of different datasets for successful inference-making.

Investigators: Dr. Solon Barocas and Professor Arvind Narayanan

Privacy and Court Records: Online Access and the Loss of Practical Obscurity

Court records contain a number of types of information that could be characterized as sensitive or private, ranging from social security numbers to the names of minor children involved in sexual abuse. Little work has been done, however, to study how often this information appears in judicial records and the context in which it appears. The lack of empirical data hamstrings court personnel and other archivists who are attempting to balance privacy interests with the public’s right of access, as well as scholars looking to adapt privacy law and First Amendment doctrines to deal with the flood of public records going online.

This proposal aims to fill this critical gap in our knowledge by applying the tools of social science research to an extensive set of North Carolina Supreme Court records. Through sampling and content coding, we will (1) catalog the types of “sensitive information” that appear in the records; (2) determine the frequency of appearance of this information; and (3) analyze the context in which the information appears.

The results of this research will be valuable for a number of reasons. First, it will help us to add much needed detail to the term “sensitive information” as it applies in the context of judicial records. Second, an understanding of the context in which the various types of information appear will help policymakers and judges better evaluate the harm to privacy interests that might arise from the disclosure of sensitive information in court briefs and related records. Third, this research will have practical implications for court personnel and archivists as they develop rules and practices for electronic filing of court records or digitization of older records. Finally, this research will also be valuable to privacy scholars who can use our empirical data to ground their normative arguments.

Investigators: Professors David Ardia and Anne Klinefelter

Municipal Open Data

Municipalities across the US perceive the potential benefits to their organizations and the public at large from making the datasets they collect available online to the public. However, the same municipalities along with numerous scholars and public policy advocates are increasingly concerned about the consequences of releases of data about local residents. In particular, public entities collect and maintain databases that include personally identifiable and financially meaningful information about the people within their jurisdiction, and releases of data without consideration of privacy could have an adverse impact on individuals or society. Similarly, datasets released that allow the categorization of individuals into groups can raise concerns for social equity. The purpose of this research is to assist municipalities by way of a case study in Seattle on the City’s past and present releases of data, public preferences and awareness of open data releases, and evolving formats and implications of such releases with the adoption of new technologies. Furthermore, this research includes collaboration with the City for formulating a set of criteria and procedures for governing the release of datasets to the general public.

Investigators: Professors Jan Whittington and Ryan Calo

Towards a More Advanced Model for Privacy-Aware Government Data Releases

This collaboration between the Berkman Center for Internet & Society at Harvard University and the Program on Information Science at MIT Libraries will explore approaches to preserving privacy and utility in government releases of information. Using an interdisciplinary analytical framework, the team will critically examine real-world models used by governments to make data available to the public, and assess the advantages and disadvantages of alternative risk assessment or disclosure limitation methods for sharing sensitive information. The outcome will include recommendations for designing data releases that are informed by recent advances in data privacy from the fields of computation, statistics, law, and social science.

Investigators: Professor Urs Gasser, David O’Brien, Dr. Micah Altman, and Alexandra Wood

Reconciling Fair Information Principles and Open Data policies

Public sector bodies are viewed as key sources of open data. Governments around the world have made opening up data a priority and an integral part of their wider open government agendas. However, there are widespread concerns that releasing government data sets with personal information threatens privacy and related rights and interests.

We propose to examine this tension between open data policy and privacy interests through the Fair Information Principles (FIPs), with special focus on their elaboration in the EU data protection laws. The Fair Information Principles are the common core of most data privacy laws and guidelines around the world, including those in the US.

The project will combine legal analysis by IVIR, a leading research centre in the field of information law at the University of Amsterdam, with empirical research using state-of-the-art digital research methods by the University of Amsterdam Media Studies Department’s Digital Methods Initiative (DMI). The empirical study will highlight which actors (e.g. government, civil society, private sector) are talking about open data and privacy, what issues they are concerned about, and how these issues are being presented. As well as informing our legal analysis, it will contribute towards better understanding the most pressing legal issues in public policy debates in this area.

Investigators: Professors Mireille van Eechoud and Richard Rogers, Dr. Frederik Zuiderveen Borgesius, and Jonathan Gray.

Open Public Health Data as a Model for Cybersecurity

Researchers and practitioners routinely complain about the lack of data to support cybersecurity research and action. One could argue that the lack of open data in cybersecurity perpetuates information asymmetries within the market, and limits the ability of some actors to engage in cybersecurity enhancing actions. If open data may be made to include cybersecurity data, we must identify policy choices that will ensure that openness is consistent with, and takes account of competing values. Public health presents a similarly complex set of values and tensions between individual and broader social interests—and is an area where surveillance and analysis is used to advance the goals of identifying, containing, and managing disease.

We propose to compare these two information environments. First we will consider existing and proposed policies that support open data on the public health side—their rationale, the politics of their adoption, their strengths and limitations, and their current efficacy and relevance. We will consider their utility in the context of cybersecurity—what gains might similar policies produce? What risks might they create? Second, by reexamining the system through the lens of cybersecurity, what data practices or platforms could be adopted by the government to enhance researcher access to sensitive government data without releasing underlying datasets? We’ve chosen to explore these two areas, as we believe they present some of the most complicated areas to consider disclosure, yet may also offer great societal gains given the public goods nature of the problem.

Investigators: Professor Deirdre Mulligan and Elaine Sedenberg