The ABCs of Data Management: Sampling, Weights, and Missing Data

Center for the Study of Law and Society
Miniseries in Empirical Research Methods

"The ABCs of Data Management:
Sampling, Weights, and Missing Data (using STATA)"

Friday, April 9, 2010, 9 a.m. - 12 noon. Lunch to follow.
JSP Seminar Room, 2240 Piedmont Avenue, Berkeley

Dr. Su Li
Statistician, School of Law,
University of California, Berkeley

The Workshop covers the following topics:
1) Understanding the opportunities and constraints of third-party data sets.
2) Using syntax files (command files) for data management.
3) Selecting appropriate samples.
4) Applying weights and developing relative weights.
5) How to deal with the missing values in the data (including list-wise deletion;
imputation and multiple imputation; other techniques)
6) Constructing new variables (recoding or combining several variables)

Abstract

One of the most challenging and time-consuming issues for researchers who conduct quantitative empirical studies is how to “clean up” the existing datasets, especially survey data usually collected by a third party, and make them usable for the analysis. This workshop introduces participants to the basics of data cleaning while addressing the most commonly seen data management issues and problems. The data cleaning or data management process requires researchers to have a good understanding about the nature of the datasets under consideration since the majority of the existing survey data only include a subset of the target population (e.g. "all American adults") collected using probability sampling methods. Depending on the nature of the sampling methods used on the survey, oversampling may have occurred, which requires the correction of it through weights. Moreover, in many cases, what a researcher is interested in is only a specific group of the target population (e.g. "female American adults"), which requires the researcher to specify the sampling restrictions beforehand. Another important component of data cleaning/management is to locate the missing cells in existing datasets and find a way to reduce the amount of missing data. Only after all the missing values, weights, and sample selection issues of a dataset have been solved, can the data be used for a statistical analysis. In order to keep track of the data management measures taken and reduce the amount of human errors involved in the data cleaning process, it is also useful to write syntax commands for every data cleaning/management activity applied to the original data. This workshop also introduces participants to strategies for constructing syntax and /commands files for the purpose of data management.

In order to keep track of the data management measures taken and reduce the number of human errors involved in the data cleaning process, it is also useful to write syntax commands for every activity of modifications on the original data. This workshop also introduces how to construct syntax/commands files and how to manage the command files for the purpose of data management.

Video of event
Click here to download Quicktime.

Additional materials:
Powerpoint Slides
Handouts