
By Leslie A. Gordon
When law professor Justin McCrary was a UC Berkeley graduate student, he routinely brought the economics department server to its knees by analyzing 40 years of crime data, or roughly a half-million observations. A few years later, another one of McCrary’s research projects involved analyzing data related to more than a hundred million birth certificates. This time, the server survived.
For many academics, “that kind of scale was unimaginable,” according to McCrary, who teaches labor and economics at Berkeley Law. “And the scale of data marches on each day. In a current research project, we’re analyzing a half trillion quotes for publicly traded securities.”
These so-called Big Data projects are “exciting,” McCrary said, because of the information they reveal. In recent years, Big Data analysis has led to conclusions about where meth labs are being set up, how the “mood” of the Twitter community affects the stock market, and how parents are two-and-a-half times more likely to ask Google “Is my son gifted?” than “Is my daughter gifted?”
Large-scale projects help us understand our world, McCrary explained, but conducting this kind of research isn’t easy.
Social scientists use sophisticated software to analyze Big Data, and today’s requisite programs are almost unrecognizable from the software used just ten years ago. That poses a huge challenge to the traditional model of graduate education, McCrary said.
Empirical analysis is changing so rapidly that professors don’t always know how to process the data. For example, many students today embark on projects that involve “scraping” data off a website. But most faculty, even those who are empirically oriented, don’t know how to scrape.
Data explosion
Enter D-Lab, the Social Sciences Data Laboratory, a place where UC Berkeley social scientists go to learn about processing Big Data. A venue for methodological exchange, it’s both a physical space (currently in Barrows Hall) and also a virtual space for learning from—and interacting with—academics confronting Big Data challenges. McCrary is D-Lab’s first director, taking the helm this July.
Focused on research design, D-Lab provides faculty and students with both an infrastructure and services, including classes on new software packages and computer lab drop-in hours.
“This is amazingly efficient compared to how students used to learn to process data. Posting to Facebook is a lot faster than sending an e-mail to each of your friends individually, while learning software is easier and faster in a group setting,” said McCrary, who is also a faculty research fellow at the National Bureau of Economic Research, co-director of the law school’s Law and Economics Program, and a fellow at Berkeley’s Criminal Justice Research Program in the Institute for Legal Studies.
In partnership with the Berkeley Institute for Data Science, D-Lab, which is rolling out in phases, brings together students and faculty with related interests and assists them in storing data for projects that involve confidential information.
“All of these activities are inherently multi-disciplinary because every social science involves empirical analysis; D-Lab has folks from political science, sociology, history, psychology, demography, education and economics,” among other disciplines, according to McCrary.
Most universities are at least a little bit behind the “explosion of data” because the landscape is shifting so rapidly. But “Berkeley is investing at the right time and in a way that builds on its terrific strengths not only in the social sciences, but also in statistics, computer science, and engineering,” he said.