JavaScript is disabled for your browser. Some features of this site may not work without it.
Towards More Intelligent Extraction of Information from Documents

Author
Du, Xinya
Abstract
The speed with which new online content becomes available has exacerbated the well known problem of information overload and motivated the innovation of techniques that help people read and consume information. More specifically for NLP, it has motivated researchers to design general models that can extract from large collections of documents informative structured information, such as information about events – what’s happening around the world. The extracted structured information (e.g., participants, locations, objects involved in an event) is essential for a variety of downstream tasks such as knowledge base population, question answering, and document analysis. In recent years, with the progress in research on deep learning, the community has seen improvements on many sentence-level information extraction tasks such as named entity recognition and relation extraction. But less progress has been made on document-level extraction problems (where the elements to be extracted are spread across the document), despite the fact that document-level extraction is closer to what is needed by end users. Existing methods largely ignore the document-level context and split the full extraction problem into separate tasks which cause error propagation. Plus, they rely heavily on manually annotated resources developed for a fixed domain-specific output schema, and, as a result, are not data-efficient or general enough to handle unanticipated schema changes at deployment time. In this dissertation, we introduce models and frameworks to address these shortcomings of prior work. To better incorporate the document-level context we propose a multi-granularity machine reader, which interprets sentences in the context of preceding and following sentences. To help neural network-based models better capture the output structure and dependencies between events, we propose a generative learning-based framework for the extraction problem, which tackles this complicated task in one pass, avoiding error propagation introduced by traditional pipeline-based systems. Finally, we formulate the (zero-shot) event extraction problem as a question answering task and develop a question answering-based framework, to allow the model to conduct extraction for roles given few/no annotated examples. To further exploit the advantages of the QA-based framework, we propose a learning-based method that automatically generates synthetic question-answer pairs for data augmentation purposes.
Description
155 pages
Date Issued
2021-08Subject
Event Extraction; Information Extraction; Natural Language Processing
Committee Chair
Cardie, Claire T.
Committee Member
Hopcroft, John E.; Trummer, Immanuel
Degree Discipline
Computer Science
Degree Name
Ph. D., Computer Science
Degree Level
Doctor of Philosophy
Type
dissertation or thesis