Automatic document summarization

**seminar ideas** · 21-07-2012, 04:17 PM

Automatic document summarization

.pdf

Automatic document summarization.pdf (Size: 92.6 KB / Downloads: 29)

Introduction

In this project, we will build a query-focused multi-document summarization system and headline generator.
As this implies, there are two major parts to the assignment. In the first (two) part(s), we will build a
sentence extraction system. In the second part, we will build a headline generator that takes as input the
output of the sentence extraction system. All of the data is on cade in cs5964/p3/data or on the web at
The data in this assignment is a bit more complicated that in assignments past. In the data directory,
you’ll find two sub-directories: docs and models, and two files, topics.pp and topics.ss. The docs
director contains all the source documents we’ll use in this assignment. If you look in docs, you’ll find 100
subdirectories, plus a file called filelist. Each directory, eg d0615f, is a collection of documents on the
same topic. If we look in d0615f, we’ll find 50 files, some of which are called *.pp, some of which are *.ss
(actually, there’s a one-to-one mapping between the two). These are the actual documents for this document
collection. For instance, if we look at NYT19990828.0021.pp, we see that it’s a story about the Kansas
board of education. The difference between the .pp and the .ss files is that the .pp files are the result
of standard tokenization, but no additional processing. The .ss files are the result of running the porter
stemmer, lower-casing and removing stop words. You’ll need both versions for the assignment.

A Cosine-based Sentence Extractor

In the second part of this assignment, we will build a sentence extraction system based on cosine similarity.
At a high level, what we’ll do is the following. For each (query, document set) pair, we’ll compute the cosine
similarity between the query and each sentence in each document in the document set. We’ll then sort the
sentences by their cosine similarity (most similar first) and extract until we hit a 250 word limit. For cosine
similarity, we’ll just use tf (term frequency), not tf-idf.
The only small complication in this assignment is that we will compute cosine similarities with respect to
the .ss files, but the summaries we produce will be based on the .pp files. This is why you need both.
Here’s how I recommend solving this part:

Making Headlines

In the final part of the assignment, we will learn to map the extracts produced by the cosine system into
headlines. The general rubric we will follow is: to produce a headline from the extract, loop over the words
in the extract (in order) and choose to keep or remove them. The resulting headline will be scored by a
language model (that I’ll provide for you). In order to ensure that the resulting summary length is what
we want (in our case, we’ll do 20 words), we’ll compose this with a pfst that will only accept strings of the
proper length.
We’ll base our “should this word be included in a summary or not” probabilities on the first model summary
for each document set. For your convenience, I have extracted these into the data/truth file. This is in the
exact same format as the summaries you have been producing. Note: since we’re doing this in a channel
model, the probabilities we are computing are of the form: probability of including a words in the document,
given the headline.

Improving Either Model

If you’re feeling adventurous, I invite you to try to do some of the things we talked about in class to try to
improve either the headline generation system or the sentence extraction system. If you can non-trivially
improve on either of them (say, by a 10% relative improvement), I’ll give you some extra credit. You should
hand in a description of what you did, your code, the resulting output, and your scores.

priyankasarraf · 17-10-2013, 01:30 PM

how i build a tool for automatic document summarization...plzzz suggest me any tool for this...

roy_don · 18-10-2013, 11:37 AM

part of this assignment, we will build a sentence extraction system based on cosine similarity.
At a high level, what we’ll do is the following. For each (query, document set) pair, we’ll compute the cosine
similarity between the query and each sentence in each document in the document set. We’ll then sort the
sentences by their cosine similarity (most similar first) and extract until we hit a 250 word limit. For cosine
similarity, we’ll just use tf (term frequency), not tf-idf.
The only small complication in this assignment is that we will compute cosine similarities with respect to
the .ss files, but the summaries we produce will be based on the .pp files. This is why you need both.
Here’s how I recommend solving

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Rich Internet Application for Weekly Automatic College Timetable Generation	presentation Abstract	1	793	09-09-2017, 02:36 PM Last Post: jaseela123
	Online Document Generator Application	jaseelati	0	243	25-08-2017, 09:32 PM Last Post: jaseelati
	Research Document Search Using ElasticSearch	mkaasees	0	225	25-08-2017, 09:32 PM Last Post: mkaasees
	Automatic Webpage Update Informer	nit_cal	0	11,050,746	25-08-2017, 09:32 PM Last Post: nit_cal
	Document Archive	nit_cal	0	14,172,835	25-08-2017, 09:32 PM Last Post: nit_cal
	Development of an auto-summarization tool	computer science crazy	0	12,044,942	25-08-2017, 09:32 PM Last Post: computer science crazy
	Development of an Auto-Summarization Tool	Electrical Fan	0	15,252,728	25-08-2017, 09:32 PM Last Post: Electrical Fan
	Development of Web Based Document Version Controller	computer science crazy	0	8,814,454	25-08-2017, 09:32 PM Last Post: computer science crazy
	CONVERSION OF RELATED TABLES TO A SINGLE XML DOCUMENT	nit_cal	0	7,163,050	25-08-2017, 09:32 PM Last Post: nit_cal
	AUTOMATIC GENERATION OF C COMPILER BACKENDS FROM MACHINE DESCRIPTIONS	nit_cal	0	6,987,759	25-08-2017, 09:32 PM Last Post: nit_cal

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.