Author(s): Jason Schultz
Year: 2012
Abstract:
This
case raises many legal, technical, and epistemological issues related to
the future of higher education, research, and scholarship – especially
those efforts that seek to take advantage of “big data” analytics and
methodologies. Advances in computer technology and the availability of
digital texts will allow scholars of the humanities a chance to do what
biologists, physicists and economists have been doing for decades –
analyze massive amounts of data. Large-scale quantitative projects like
those being undertaken at the Stanford Literary Lab are unearthing
previously unknowable information about individual works, and entire
genres of literature.
Researchers working in Information
Retrieval frequently use text mining and computer-aided classification
to identify and retrieve relevant documents. Using similar techniques,
researchers in the Digital Humanities are able to identify and retrieve
relevant texts, often from unlikely places. Humanities researchers can
thereby expand their traditional study of a few canonical works to a
study of any one of the several million books in the larger archive of
literary history – an archive that has hitherto remained hidden because
of the limitations of humans’ reading capacity.
In this amicus
brief scholars from disciplines including law, computer science,
linguistics, history and literature ask the court to consider the impact
on this vital area of research when ruling on the legality of mass
digitization. Specifically, the brief addresses whether United States
copyright law should stand as an obstacle to statistical and
computational analysis of the millions of books owned by the nation’s
great university libraries.
The brief argues that, just as
copyright law has long recognized the distinction between protection for
an author’s original expression (e.g., the narrative prose describing
the plot) and the public’s right to access the facts and ideas contained
within that expression (e.g., a list of characters or the places they
visit), the law must also recognize the distinction between copying
books for expressive purposes (e.g., reading) and nonexpressive
purposes, such as extracting metadata and conducting macroanalyses. We
amici urge the court to follow established precedent with respect to
Internet search engines, software reverse engineering, and plagiarism
detection software and to hold that the digitization of books for
text-mining purposes is a form of incidental or intermediate copying to
be regarded as fair use as long as the end product is also nonexpressive
or otherwise non-infringing.
Keywords: copyright, text-mining, digital humanities, nonexpressive use, hathitrust
Link: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2102542