The rise of large-scale digitized book collections—such as those provided by Google Books, the HathiTrust and the
Internet Archive—is enabling a fundamentally new kind of text analysis that exploits the scale of collections to ask
questions not possible with smaller corpora. Many of these research questions are driven by historically deep textual
collections—corpora that span several decades or centuries in their publication. Moretti (2007) analyzes the changing
distribution of British novels from 1740-1900; Heuser and Le-Khac. (2012) chart the changing semantic fields of
British novels from the long 19th-century; Wilkens (2013) analyzes the changing attention given to geographical
locations throughout American texts published before and after the Civil War.

Historical analysis of this kind depends on accurate metadata for the books in the collection; in order to measure
how genre changes over time, for example, we need to be relatively confident in the dates we believe the books were
written. For HathiTrust and the other large-scale collections mentioned above, metadata comes from bibliographic
records from the original digitizing library; the date information available to us is the date of publication of the
specific digitized work. As figure 1 shows, however, this date of publication can be very far removed from the original
date of composition: Jane Austen’s Pride and Prejudice was originally published in 1813, but none of the 107 editions
represented in the HathiTrust are attended with a date of publication this early; most, as illustrated here, date from the
20th century.

Figure 1: Sample record from the HathiTrust illustrating the date of publication for Pride and Prejudice, originally published in

Relying on the given publication date for a text has the potential to lead to substantial bias in any subsequent
computational analysis, simply as a function of treating texts like Pride and Prejudice as being typical for other books
originally published in 1906. Underwood (2016) notes this is one of the “three nasty problems" inhibiting work in
large-scale literary history. Our proposal involves fundamental research to help solve this problem, by a.) working with
librarians to develop best practices for annotating the first date of publication for texts in the HathiTrust, b.) annotating
those dates for a sample of 15,000 texts, and c.) developing computational models to predict the first date of publication
using that annotated data. Concisely, in Functional Requirements for Bibliographic Records (FRBR) terms, we seek to
identify the publication date of a work as distinct from the publication date of a particular manifestation. Since many
research projects are now leveraging the HathiTrust for this kind of large-scale analysis, we expect this project to have
high impact both within the Berkeley community and beyond.

Project type