21-07-2012, 04:17 PM
Automatic document summarization
Automatic document summarization.pdf (Size: 92.6 KB / Downloads: 29)
Introduction
In this project, we will build a query-focused multi-document summarization system and headline generator.
As this implies, there are two major parts to the assignment. In the first (two) part(s), we will build a
sentence extraction system. In the second part, we will build a headline generator that takes as input the
output of the sentence extraction system. All of the data is on cade in cs5964/p3/data or on the web at
The data in this assignment is a bit more complicated that in assignments past. In the data directory,
you’ll find two sub-directories: docs and models, and two files, topics.pp and topics.ss. The docs
director contains all the source documents we’ll use in this assignment. If you look in docs, you’ll find 100
subdirectories, plus a file called filelist. Each directory, eg d0615f, is a collection of documents on the
same topic. If we look in d0615f, we’ll find 50 files, some of which are called *.pp, some of which are *.ss
(actually, there’s a one-to-one mapping between the two). These are the actual documents for this document
collection. For instance, if we look at NYT19990828.0021.pp, we see that it’s a story about the Kansas
board of education. The difference between the .pp and the .ss files is that the .pp files are the result
of standard tokenization, but no additional processing. The .ss files are the result of running the porter
stemmer, lower-casing and removing stop words. You’ll need both versions for the assignment.
A Cosine-based Sentence Extractor
In the second part of this assignment, we will build a sentence extraction system based on cosine similarity.
At a high level, what we’ll do is the following. For each (query, document set) pair, we’ll compute the cosine
similarity between the query and each sentence in each document in the document set. We’ll then sort the
sentences by their cosine similarity (most similar first) and extract until we hit a 250 word limit. For cosine
similarity, we’ll just use tf (term frequency), not tf-idf.
The only small complication in this assignment is that we will compute cosine similarities with respect to
the .ss files, but the summaries we produce will be based on the .pp files. This is why you need both.
Here’s how I recommend solving this part:
Making Headlines
In the final part of the assignment, we will learn to map the extracts produced by the cosine system into
headlines. The general rubric we will follow is: to produce a headline from the extract, loop over the words
in the extract (in order) and choose to keep or remove them. The resulting headline will be scored by a
language model (that I’ll provide for you). In order to ensure that the resulting summary length is what
we want (in our case, we’ll do 20 words), we’ll compose this with a pfst that will only accept strings of the
proper length.
We’ll base our “should this word be included in a summary or not” probabilities on the first model summary
for each document set. For your convenience, I have extracted these into the data/truth file. This is in the
exact same format as the summaries you have been producing. Note: since we’re doing this in a channel
model, the probabilities we are computing are of the form: probability of including a words in the document,
given the headline.
Improving Either Model
If you’re feeling adventurous, I invite you to try to do some of the things we talked about in class to try to
improve either the headline generation system or the sentence extraction system. If you can non-trivially
improve on either of them (say, by a 10% relative improvement), I’ll give you some extra credit. You should
hand in a description of what you did, your code, the resulting output, and your scores.