arrowCIRSS Home arrow Events arrow E-Research Roundtable arrow Event Details

E-Research Roundtable - New tools from the HathiTrust Research Center for digitized text analysis at scale: The HathiTrust+Bookworm tool and the Extracted Features dataset

Wednesday, February 10, 2016
12:30pm - 2:00pm

341 LIS

Event Details

Session leaders: Sayan Bhattacharyya and Peter Organisciak
As library digitization efforts produce large quantities of digitized textual content, they create the conditions of possibility for novel inferencing techniques at scale, raising tantalizing possibilities for producing new knowledge about history, linguistics, literary studies and other related fields. However, this possibility will be realized only if tools and infrastructure to explore and analyze textual data can rise to the challenge posed by the data’s scale and access restrictions. To this end, the HathiTrust Research Center (HTRC), based jointly at the University of Illinois and Indiana University, is creating novel capabilities to enable scholars to have access, for research purposes, to the millions of works that constitute the content of the HathiTrust Digital Library. We will discuss two such capabilities — (1) the HathiTrust+Bookworm (HT+BW) tool; and (2) the HTRC Extracted Features (EF) dataset.

The first, HT+BW, is an NEH-funded multi-university initiative for visualizing language usage trends; the current prototype supports nearly five million books. The second, the HTRC Extracted Features Dataset, makes available, for the same works,  certain kinds of extracted quantitative data at the page level.

In this talk, we will describe:

Technical challenges posed by the massive scale of the HathiTrust Digital Library content, and how HT+BW and the HTRC EF dataset are meeting some of these challenges. Epistemic issues foregrounded by pedagogical uses of HT+BW and the HTRC EF dataset.

About the presenters:

Peter Organisciak is a Postdoctoral Research Associate with the HathiTrust Research Center and a GSLIS graduate. He has a background in digital humanities and large-scale text analysis.

Sayan Bhattacharyya is a Postdoctoral Research Associate [Council on Library and Information Resources (CLIR) Fellow] with the HathiTrust Research Center. He received his PhD in Comparative Literature from the University of Michigan, Ann Arbor.

Related People