16-11-2012, 02:34 PM
Dialog Generation for Voice Browsing
Voice.pdf (Size: 5.55 MB / Downloads: 57)
ABSTRACT
In this paper we present our voice browser system, HearSay,
which provides efficient access to the World Wide Web to
people with visual disabilities. HearSay includes contentbased
segmentation of Web pages and a speech-driven interface
to the resulting content. In our latest version of
HearSay, we focus on general-purpose browsing. In this paper
we describe HearSay’s new dialog interface, which includes
several different browsing strategies, gives the user
control over the amount of information read out, and contains
several different methods for summarizing information
in part of a Web page. HearSay selects from its collection of
presentation strategies at run time using classifiers trained
on human-labeled data.
INTRODUCTION
The World WideWeb has become an indispensable aspect
of our society, used for education, commerce, medicine and
entertainment. However, the primary means of accessing the
Web is via browsers designed for visual modes of interaction
(e.g., Internet Explorer, Firefox, etc.). This limits access for
an entire community of people with visual disabilities. This
target population faces particular difficulties in accessing,
scanning and summarizing/distilling information on a Web
page or group of pages, filling out Web forms, and using
Web search facilities.
Creating audio browsable Web content has become the
focus of intensive research efforts by industrial enterprises
(e.g., IBM) and standardization organizations (e.g., W3C).
New markup languages, such as VoiceXML [9], SALT [8] and
XHTML+Voice [11], and new voice browser systems, such
as IBM’sWebSphere Voice Server, have emerged to facilitate
the creation, publishing, and exchange of audio browsable
Web content. However, adapting to voice browser technology
still remains a significant burden for many Web content
providers. Furthermore, while current screen readers and
voice browsers are useful for reading HTML documents, they
impose significant overhead on users.
HearSay Architecture
The architecture of the HearSay voice browser is shown in
Figure 1. It includes three basic components: the Browser
Object Interface, the Content Analyzer, and the Interface
Manager. The Browser Object Interface1 fetches pages from
Web servers. Special features include automatic form fillouts
and retrieval of pages pointed to by navigable links
that require execution of JavaScript.
The Content Analyzer partitions an input Web page into
a logical structure of segments containing related content
elements by analyzing the page’s structure and content. The
output of the Content Analyzer is a partition tree of the
content in the input page.
Content Analysis
Here we describe the content analysis algorithm that HearSay
uses to partition a Web page into semantically related segments.
It is based on our previous work on structural and
semantic analysis of Web content [24, 29, 25, 26]. Content
analysis (see [24] for details) is based upon the observation
that semantically related items in content-rich Web pages
exhibit consistency in presentation style and spatial locality.
Exploiting this observation, a pattern mining algorithm
working bottom-up on the DOM tree of a Web page aggregates
related content in subtrees. Briefly, the algorithm initially
assigns types, reflecting similarities in structural presentation,
to leaf nodes in the DOM tree and subsequently
restructures the tree bottom-up using pattern mining on
type sequences. The DOM tree fragment for the page in
Figure 2(a) is shown in Figure 3(a). The type of a leaf node
is the concatenation of HTML tags on the root-to-leaf path
and that of an internal node (or partition) is composed from
the types of its child nodes. In the restructured tree, known
also as the partition tree, there are three classes of partition:
(i) group - which encapsulates repeating patterns in
its immediate children type sequence, (ii) pattern - which
captures each individual occurrence of the repeat, or (iii)
block - when it is neither group nor pattern. Intuitively the
subtree of a group node denotes homogenous content consisting
of semantically related items. For example, observe
how all the headline news in the central part in Figure 2(a)
are rooted under the group node in the partition tree. The
leaf nodes of the partition tree correspond to the leaf nodes
in the original DOM tree and have content associated with
them. The partition tree resulting from structural analysis
of the DOM in Figure 3(a) is shown Figure 3(b). The
partition tree represents a logical organization of the page’s
content.
Breadth-First and Depth-First Navigation
In BFN, all the child partitions in a partition are presented
to the user, who then selects one for further browsing. This
strategy is straightforward and gives users an overview of the
available selections from which they can choose. However,
if a partition has many children, it can be hard for a user to
listen to and remember all the browsing choices. Consider
the category news section of the New York Times shown in
Figure 4(a). The partition tree of this particular section is
shown in Figure 4(b). There are in fact 20 child partitions
of this partition, too many for the user to remember [23].
DFN is used in cases like these. In DFN, each child of
a partition is presented individually, with the user given a
yes/no choice about whether to navigate into that partition
right after it is presented. An alternative to DFN would be
to use BFN with barge-in, so that a user could interrupt the
system with “navigate” right after hearing about a partition
of interest. However, with speech input the use of barge-in
leads to more speech recognition errors. In addition, with
DFN the user never has to listen to the children of a partition
more than once (because the system resumes presenting
children at the location where the user last made a choice),
whereas with BFN+barge-in, the user would have to listen
to the whole list of children of a partition at each return to
the root of the partition.
Navigation: Searching vs. Browsing
It is well-known that user activities overWeb pages during
navigation consist of two basic types: searching and browsing.
Previously, researchers have looked at how users switch
between strategies across sequences of Web pages [12, 33].
Here, we apply these ideas to navigation across partitions
(possibly within a single Web page).
In the New York Times homepage shown in Figure 6(a),
there are basically two big partitions (labeled 1 and 2). Partition
1 is the header of the page, while partition 2 contains
the main content. Partition 2 is further divided into three
partitions: a menu on the left-hand side, a set of headline
news items in the middle, and a set of other news stories
and related content on the right-hand side. A visitor to this
page looking for news is probably not interested in listening
to partition 1 or the menu in partition 2. Instead, the user
will search to partition 3, the headline news items. At this
point, her activity will turn from searching to browsing, i.e.
listening to the news stories.
CONCLUSION AND FUTURE WORK
We have described our new HearSay Web browser for people
with visual disabilities. HearSay is designed for efficient,
broad-coverage voice-driven Web browsing. In this paper,
we focused on the general-purpose browsing and content presentation
strategies employed in HearSay. In future work,
we plan to conduct a complete evaluation of HearSay and
refine our browsing and presentation strategies accordingly.