Our Text Thresher software will enable digital humanities researchers to enlist the help of crowd workers and volunteers so they can scale-up traditional ‘content analysis’ projects to identify and extract complex information from thousands (or even millions) of documents and/or web pages. With this capacity, digital humanists and social scientists will be able to pull useful data out of digital publications to explain and predict the dynamics of police/protester interactions; record trends in local climate change mitigation efforts; expose and quantify the long-term stagnation and broken promises of political speech; reveal cross-temporal and cross-cultural differences in the construction of gender and race categories; track the life of judicial opinions across cases and through time; and much more.

Text Thresher improves the social science practice of content analysis, making it vastly more transparent and scalable to millions of documents. Text Thresher is a web interface operating in crowd-work environments like Amazon's Mechanical Turk, CrowdFlower, and CrowdCrafting. The interface allows researchers to clearly specify hand-labeling and text-classification tasks in a user-friendly workflow that maximizes crowd-worker accuracy and efficiency. As crowd workers label and extract data from thousands of documents using Text Thresher, they simultaneously generate training sets enabling machine learning algorithms to augment or replace researchers' and crowd workers' efforts. Output is viewable in a web-based database and as labels layered over original document text.

Project type