01-12-2012, 02:09 PM
Event Driven Web Video Summarization by Tag Localization and Key-Shot Identification
1Event Driven Web Video.pdf (Size: 1.93 MB / Downloads: 72)
Abstract
With the explosive growth of web videos on the Internet,
it becomes challenging to efficiently browse hundreds or
even thousands of videos. When searching an event query, users
are often bewildered by the vast quantity of web videos returned by
search engines. Exploring such results will be time consuming and
it will also degrade user experience. In this paper, we present an approach
for event driven web video summarization by tag localization
and key-shot mining.We first localize the tags that are associated
with each video into its shots. Then, we estimate the relevance
of the shots with respect to the event query by matching the shotlevel
tags with the query. After that, we identify a set of key-shots
from the shots that have high relevance scores by exploring the repeated
occurrence characteristic of key sub-events. Following the
scheme in [6] and [22], we provide two types of summaries, i.e.,
threaded video skimming and visual-textual storyboard. Experiments
are conducted on a corpus that contains 60 queries andmore
than 10 000 web videos. The evaluation demonstrates the effectiveness
of the proposed approach.
INTRODUCTION
RECENT years have witnessed the explosion of multimedia
contents on the web. For example, YouTube, as
one of the primary video sharing websites, serves over 100 million
distinct videos and 65 000 uploads daily [1]. The growing
number of videos has motivated a real necessity to provide
effective tools to support retrieval and browsing. However,
given an event query, search engines may return thousands or
even more videos that are diverse and noisy. The evolution of
the entire event is not directly observable by simply watching
these videos. Even worse, some videos are indeed weakly or
not relevant to the query. These facts distract users from the
gist of the event and force them to painstakingly explore the
returned videos for an overview of the event.
RELATED WORK
This work aims to generate event driven web video summarization.
The workflow is similar to multimedia question answering:
given a query, return a video answer. In mechanics,
it is related to video summarization as we need to summarize
multiple web video for a concise summary. Here we briefly introduce
multimedia question answering and video summarization.
We also introduce some related work on video tagging.
Multimedia Question Answering
Question answering (QA) research was introduced in 1990s
and gained popularity following the TREC evaluations [7].
Given that the vast amount of information on the web is in
the form of multimedia, multimedia QA has been proposed
in [8] for the reason that multimedia answer can be more
informative for many questions. An early video QA system
is presented in [9] for news videos. It adopts a similar architecture
as text-based QA, with video content analysis being
performed at various stages of QA pipeline to obtain precise
video answers. Following this work, several video systems
are proposed with most of them relying on the use of text
transcript derived from video optical character recognition
(OCR) and automatic speech recognition (ASR) outputs [10].
Video Summarization
Video summarization can be categorized into static and
dynamic summaries [41], [43]. Static summary presents the
content on a static storyboard with an emphasis on its importance
or relevance, whereas dynamic summary (also known
as video skimming) combines video and audio information to
generate a shorter video clip [14]. It should be highlighted that,
in some cases, these two categories can be transposed into each
other. From 2005, rush summarization has been established as
a task of TREC Video retrieval evaluation (TRECVID) [19].
With the advance of Web 2.0, more external information is
available on the internet [15]. In this scenario, Neo et al. [16]
enhance news video search by leveraging extractable video
semantics coupled with relevant external information resources
to support event-based analysis. In [17], a hierarchical video
content description and summarization strategy supported by
a joint textual and visual similarity is presented. The approach
adopts a video content description ontology and utilizes video
processing to construct semi-automatic video annotation for a
multi-layer video summary with different granularities. Wu et
al. [3] propose a method for threading and auto-documenting
news stories according to topic themes. Story clustering is
performed by exploiting the duality between stories and textual-
visual concepts through a coclustering algorithm. While
most video summarization algorithms focus on processing
a single video, there also exist several research efforts on
multi-video summarization. Li and Merialdo [42] propose a
maximal marginal relevance method that iteratively selects
frames into the summary by considering both audio and visual
information.
CONCLUSION AND DISCUSSION
This paper presents a event driven web video summarization
approach based on tag-localization and key-shot mining.
It first localize the tags that are associated with each video into
its shots. The relevance scores of the shots with respect to the
event query are then estimated. After that, a set of keyshots
are identified by performing near-duplicate keyframe detection.
We propose a clustering method to reduce the time cost by
avoiding an exhaustive traversal of all keyframe pairs. Experiments
have demonstrated the effectiveness of the approach.We
have also studied two other key-shot identificationmethods, i.e.,
AP-based and search-based methods, but the experimental results
demonstrate that they perform much worse than the NDKbased
methods.