Class Projects

Year/Class Student Project
Ana Lucic and Catherine Blake Automatically Summarizing Medical Literature

Scientific articles in medicine often report results in the form of a comparison, where a new intervention is compared with either a control group or another intervention (or both). Comparisons are particularly important for Comparative Effectiveness Research (CER) which recently recommended that “CER should directly compare tests or active treatments - so called head-to-head comparisons - of viable clinical alternatives within the current stand or practice (which in some cases may be no intervention)” (Sox, Helfand et al. 2010). Although from a language processing perspective comparison sentences have been called “almost notorious for its syntactic complexity” (Bresnan, 1973), our group has been developing methods that automatically discern comparison from non-comparison sentences (Park & Blake, 2012). This work addresses the more difficult task of associating a particular role to noun phrases in a comparison sentence. Specifically our goal is to automatically identify the two entities (agent and object) that are being compared and the way in which those entities are compared (basis of comparison). We describe how this goal can be framed as a classification problem, the features used, and the preliminary results which show an average accuracy of 75%, 85% and 72% for the agent, object and basis of the comparison respectively using a Support Vector Machine (SVM) classifier on sentences that are less than 40 words long. This work was presented as a research poster at the iSchool 2014 Research Showcase.
Bresnan, W. J. (1973). “Syntax of the Comparative Clause Construction in English.” Linguistic Inquiry 4(3): 275-343.
Hoon Park, D, & Blake, C. (2012). “Identifying comparative sentences in full-text scientific articles.” Association of Computational Linguistics, Workshop on Detecting Structure in Scholarly Discourse. July 12, Jeju South Korea.
Sox, H. C., M. Helfand, J. Grimshaw, K. Dickersin, P. L. M. Editors, D. Tovey, J. A. Knottnerus and P. Tugwell (2010). "Comparative effectiveness research: challenges for medical journals." PLoS Med 7(4): e1000269.
2014/LIS590AD Jooho Lee and Catherine Blake Information integration: a case study of air quality in Chicago and St. Louis.

Government agencies are increasingly making raw data available to citizens, but merely having access to data is not sufficient to realize the potential of “big data”. Answering questions in science, business, and public policy requires data integration which is challenging when data from different sources are used for different reasons. This project provides a detailed case study of how to integrate public data to understand the relationship between demographic factors and air quality. Demographic factors from US Census and American Community Survey were collected for two major cities (Chicago and St. Louis) and then integrated with air quality from the US Environmental Protection Agency (US EPA). Results show that air quality has improved in both cities between 2000 and 2012. Determining correlations between ethnicity, education, level of income and air quality warrant further exploration. This work was presented as a research poster at the ASIS&T 2014, Seattle, WA conference.
2014/LIS590AD Elizabeth Surbeck and Catherine Blake The Impact of Ground Ozone on Asthma: A Case Study Using Project INDICATOR in Champaign-Urbana

The purpose of this analysis is to examine the correspondence between short-term ozone exposure and asthma-related hospital reports, adopting from the methods used in “Meta-analysis of the Association between Short-Term Exposure to Ambient Ozone and Respiratory Hospital Admissions” by Meng Ji, Daniel S. Cohan, and Michelle L. Bell. This meta-analysis was published in 2011 on studies that have connected asthma cases in hospitals and ozone measurements in air quality reports. Studies seem to suggest that there is a strong correlation between asthma and ozone level increases in the outdoor air. The point of duplicating their methods is to see if one could take real time collected data from a town area like Champaign-Urbana and prove or disprove the results and methods of this meta-analysis. The datasets to attempt fulfilling this purpose have been drawn from Project Indicator and the Environmental Protection Agency’s (EPA) Air Quality System (AQS). The tools required to accomplish this analysis included R and MS Excel and methods for analysis included linear regression to understand the interaction of the quantitative data from the two main data variables. The results of the study suggested the possibility of a connection between asthma and ozone though additional factors such as weather patterns may provide a better defined answer to this analysis’s concerns. More information about this project is available in the final report and course presentation.
2013/LIS590AG Craig Evans To What Extent is it Possible to Detect Seasonal Shift through Agricultural Data?

Inspired by the Cherry Blossom Diaries of Kyoto Japan that stretch back centuries to medieval times, this project used contemporary crop report data created by the US Department of Agriculture to map the shift in seasonal plantings for crops. The USDA CropReport data provides a rich data set of crop growth statistics broken down by different crops, different stages in crop development, and by state. Data does not stretch back the centuries of the Kyoto diaries, but is a weekly snapshot of crop conditions from April 1995 through to October 2013. A file was produced each week with 3 exceptions - 2 being extreme weather related that shut down government services in Washington DC which delayed the report, and 1 being due to the 2013 government sequester.

With this data, and inspired by the cherry blossom diaries, the question of whether it is possible to discern a shift in the seasonal growth patterns is asked. Crop planting time is dependent upon a number of factors, but in this study we focus on the local weather conditions, since optimal planting and harvesting time is highly dependent upon the temperature and precipitation. With this close connection, is it possible to say that time of crop planting is analogous to the length of a growing season, and its start and end dates in a given year. Does the seasonal crop data indicate a shift in seasons across the US? Is such a shift uniform or does it vary on a state by state basis? More information about this project is available in the final report and course presentation .
2013/LIS590AG Henry Gabb Correlation between Consumer Product Usage, Toxicant Exposure Levels, and Infant Physical Characteristics

The Illinois Kids Development Study (IKIDS) is studying whether prenatal exposure to bisphenol A (BPA), phthalates, or triclosan affects physical and cognitive development in infants. These toxicants are known endocrine disruptors. BPA mimics estrogen while phthalate anti-androgenic properties decrease testosterone. Triclosan is thought to mimic thyroid hormone. Exposure to these toxicants can come from many environmental sources. Phthalates are sometimes found in common household items, e.g.: plastic, cosmetics and perfumes, building materials, food wrappers, textiles, and toys. BPA is found in plastic, electronics, detergent, and food and drink containers. These toxicants are known to cause birth defects in animals but less is known about their effect on fetal and infant development.

IKIDS provided a static dataset from a pilot study of 200 women and their newborns. Maternal data include medical history, demographics and lifestyle, urinalysis, and dietary and product usage diaries. These data could help identify potential sources of toxicant exposure and exposure levels. Neonatal data consists of various physical (e.g., anogenital distance) and behavioral (e.g., memory, recognition, and attention span) measures. The neonatal data could help assess the risk of toxicant exposure on fetal and infant development. To augment the IKIDS dataset, a database of consumer products and their ingredients was created by scraping information from retail websites. Mining the IKIDS product usage diaries and this product ingredient database could point to common ingredients that are not currently thought to be hazardous. For example, the three toxicants being studied by the IKIDS projects were in widespread use before their safety was called into question.
Catherine Blake and Henry A. Gabb Parameter tuning: Exposing the gap between data curation and effective data analytics

The “big data” movement promises to deliver better decisions in all aspects of our lives from business to science health, and government by using computational techniques to identify patterns from large historical collections of data. Although a unified view from curation to analysis has been proposed, current research appears to have polarized into two separate groups: those curating large datasets and those developing computational methods to identify patterns in large datasets. The case study presented here demonstrates the enormous impact that parameter tuning can have on the resulting accuracy, precision, and recall of a computational model that is generated from data. It also illustrates the vastness of the parameter space that must be searched in order to produce optimal models and curated in order to avoid redundant experiments. This highlights the need for research that focuses on the gap between collection and analytics if we are to realize the potential of big data. This work was presented as a research poster at the ASIS&T 2014, Seattle, WA conference.
2014/LIS590AD Jinlong Guo Information Behaviors at the Edge of Reason: the Role of Uncertainty, Science, and Culture on Environmental Policy

We all expect government agencies to use high quality evidence when creating public policy; however, in a democratic society, cultural values also play a role. For example Europe and America have different policies with respect to genetically modified foods and nuclear energy. Our goal is to explore information behaviors that surround the process of transforming scientific evidence into public policy in order to uncover where (and how) cultural values might be embedded. We analyzed more than a thousand citations from three scientific reviews that were conducted by scientists based in the EU and US on di(2-ethylhexyl) phthalate (DEHP), a plastic softener that is controversial because of potential impacts for industry and public health. Our analysis suggests that culture may influence public policy by establishing the initial scope of a review, by determining which evidence should be included, and by framing h ow evidence is presented This work was presented as a research poster at the ASIS&T 2014, Seattle, WA conference.
2014/LIS590AD Matthias H. Landt Sentiment Analysis as a Tool for Understanding Fiction

The purpose of this project was to apply sentiment analysis techniques to the text of a work of fiction. The game script was ripped from the English fan-translated version of the demo of the Japanese visual novel Umineko no Naku Koro Ni (translation: When the Seagulls Cry), in such a way that each sentence within the text could be attributed either to the specific character speaking or to the narrator.

This text was compared against Bing Liu’s Opinion Lexicons – a list of words associated with positive sentiment, and a list of words associated with negative sentiment. By linking the processed dataset to the opinion lexicons, sentiment analysis was performed in order to attempt to assess four aspects of the work: (1) the overall tone of the text, including how the tone changed as the story progressed; (2) the speaking styles of the individual characters, in regards to their use of sentiment words; (3) the relationships between the characters, based on how sentiment was used in the text when one character directly referenced another character’s name; and (4) how sentiment was used to portray the various characters by the narrative perspective in the text. Data collection and initial preprocessing were conducted by M. Landt. Subsequent transformations were conducted by M.Landt, C.Evans and C. Blake, as part of the a grant from the Institute of Museum and Library Services [RE-05-12-0054-12]. More information about this project is available in the final report and course presentation.
2014/LIS590AG Garrick T. Sherman Extracting Summary Sample Characteristics from Epidemiology Tables

Epidemiology papers frequently provide a tabular summary describing the sample included in the study. Unfortunately, because this information is generally stored within the report itself, it is necessary for readers to physically acquire any given paper in order to ascertain whether its sample has relevance to their own research. Making this information available as a form of metadata would greatly benefit researchers in their information seeking activities. This project proposes and tests a system that receives an epidemiology article as input and then 1) identifies and extracts tables from the text; 2) parses tables into a structured format; 3) classifies tables as containing sample characteristics or not, and 4) aligns factors within that table to fit a larger, automatically constructed ontology.

This system was accomplished through various means. Table identification, extraction, and parsing were all achieved through extensive coding, as were the experimental factor alignment approaches, which employed slot-filling techniques. Classification was performed in the Oracle Data Miner using textual features from tables. Results were promising, with up to 87% accuracy in the classification task using a naive Bayes classifier and up to 100% average precision in the factor alignment task.

The system's strong performance represents a proof-of-concept for the possibility of designing a more robust implementation. The metadata captured by such a system has the potential to improve researcher productivity by, for example, facilitating the creation of faceted search systems to enable precise document discovery on the basis of sample characteristics. More information about this project is available in the final report and course presentation.
2014/LIS590AG Svetlozara Stoytcheva Exploring Cultural Differences in Language Usage: The Case of Negation

Prior research suggests that speakers of Asian languages are more likely to use negation than English speakers. Our goal in this work is to explore this theory using empirical data from news stories. Specifically, we used natural language processing to compare negation usage in two newspapers: the New York Times and Xinhua News (English Edition). Overall, negation represents 0.55% of typed dependencies in the New York Times (versus 0.18% in Xinhua News). Additionally, 9.28% of sentences and 86.56% of articles in the New York Times contain one or more instances of negation (compared to 3.33% of sentences and 24.94% of articles in Xinhua News). In contrast to the prevalent theory, negation is approximately three times more common in the New York Times than in Xinhua News (English Edition). This work was presented as a research poster at the ASIS&T 2014, Seattle, WA conference.