01-10-2016, 11:25 AM
1457189309-acmcs12.reranking.survey.final2014.pdf (Size: 1.78 MB / Downloads: 25)
The explosive growth and widespread accessibility of community contributed media content on the Internet have led to a surge
of research activity in multimedia search. Approaches that apply text search techniques for multimedia search have achieved
limited success as they entirely ignore visual content as a ranking signal. Multimedia search re-ranking, which reorders visual
documents based on multimodal cues to improve initial text-only searches, has received increasing attention in recent years.
Such a problem is challenging because the initial search results often have a great deal of noise. Discovering knowledge or visual
patterns from such a noisy ranked list to guide the re-ranking process is difficult. Numerous techniques have been developed
for visual search re-ranking. The purpose of this paper is to categorize and evaluate these algorithms. We also discuss relevant
issues such as data collection, evaluation metrics, and benchmarking. We conclude with several promising directions for future
research.
INTRODUCTION
The proliferation of capture devices and the explosive growth of online social media have led to the
countless private image and video collections on local computing devices such as personal computers,
cell phones, and personal digital assistants, as well as the huge yet increasing public media collections
on the Internet [Boll 2007]. For example, the most popular photo sharing site—Flickr [Flickr], reached
five billion photo uploads in 2011, as well as 3-5 million new photos uploaded daily [Kennedy et al.
2007]. Facebook held more than 60 billion photos shared by its communities as of 2011 [Facebook],
while Youtube streams more than one billion videos per day worldwide [YouTube].
Such explosive growth and widespread accessibility of visual content have led to a surge of research
activity in visual search. The key problem is retrieving visual documents (such as images, video clips,
and Web pages containing images or videos) that are relevant to a given query or user search intention
from a large-scale database. In the last decade, visual search has attracted a great deal of attention, though it has been studied since the early 1990s (referred to as content-based image/video retrieval
[Lew et al. 2006], [Li et al. 2007], [Rui et al. 1999]). Many research demonstrations and commercial
applications have been developed. Due to the great success of text search, most popular image and
video search engines, such as Google [Google], Bing [Bing], Yahoo! [Yahoo!], and so on, build upon text
search techniques by using the non-visual information (such as surrounding text and user-provided
tags) associated with visual content. This kind of multimedia search approach cannot always achieve
satisfying results as it entirely ignores the visual content as a ranking signal [Chang et al. 2007],
[Datta et al. 2008], [Hauptmann et al. 2008a], [Hsu et al. 2007], [Snoek and Worring 2009].
To address the problems of visual search approaches, multimedia search re-ranking has received
increasing attention in recent years. It is defined as the reordering of visual documents based on the
information manifested in the initial search results or a knowledge base to improve the search performance.
This information actually consists of multimodal cues that can be the knowledge or specific
patterns mined from the initial ranked list, query examples, or any available auxiliary knowledge.
From another perspective, re-ranking can be viewed as a post-process of core search. Figure 1 shows a
general process for multimedia search re-ranking. A visual document might be an image, a video clip,
or a Web page containing images or videos. Given an initial ranked list of visual documents returned
by any search approach, visual search re-ranking improves search performance by reordering these
visual documents based on the multimodal cues. For example, in a real image search engine, the initial
text search results can be re-ranked, according to the visual similarity to a given example [Bing],
[Google] or color style (e.g., color or grey). Another example in an image retrieval system is to re-rank
or filter the images accord-ing to some predefined categories [Fergus et al. 2004], [Wnuk and Soattoh
2008]. In the settings of object retrieval, a geometric verification step is usually introduced to re-rank
the results returned from the bag-of-words (BoW) model based on checking the spatial configurations
[Philbin et al. 2007], [Jegou et al. 2010]. The challenges associated with multimedia search re-ranking
can be attributed to the following factors:
—Unsatisfying initial search performance. The initial search results usually contain a small portion
of relevant documents. For example, the best automatic video search only achieves about 10% of
the mean average precision (MAP) in TRECVID 2008 [TRECVID] 1
. The most popular BoW model
(without re-ranking) for object retrieval can only achieve about 30%-70% of MAP on the Oxford build
facade dataset, depending on codebook training and visual features [Chum et al. 2007], [Chum et al.
1 This performance was conducted over 380 hours’ video with 24 queries in total, which indicates that on average
between 2 and 3 of the top 10 returned video clips are estimated to contain the desired video.
MULTIMEDIA SEARCH RE-RANKING: METHODOLOGIES
In this section, we review existing approaches to visual search re-ranking. We first position re-ranking
as a key component in a typical visual search system and provide a Bayesian formulation for overviewing
re-ranking problem. We then classify re-ranking approaches into four categories and discuss each
of them in details. We also discuss other recent techniques for re-ranking, such as query suggestion
and user interface, as well as the re-ranking methods in text domain.
2.1 The Role of Re-ranking in Multimedia Search
A typical visual search system consists of several components, including query analysis, an index
module, uni-modal search (e.g., text, visual, and concept searches), and re-ranking. Figure 3 shows
a generic visual search framework. Usually, the query in a visual search system consists of a piece of
a textual query (e.g., “find shots in which a boat moves past”) and/or probably a set of query examples
(e.g., objects, images, or video keyframes or clips) 2
. Via query analysis, the meaningful or important
keywords and their expanded related terms are obtained based on the textual query. Meanwhile, the
visual examples can be mapped to some relevant high-level concepts by the pre-trained classifiers for concept-based search (e.g., boat, water, and outdoor). Specifically, the confidence scores from those classifiers
can be treated as the weights for the corresponding concepts (i.e., hidden text), and further used
in a text-alike search (e.g., inverted index based on term and document frequency) or used as a feature
vector in a concept space for searching via query-by-example (QBE). Moreover, a set of low-level visual
features (global and local features) is extracted to represent these query examples for visual-based
search. These multimodal queries are fed into individual search models, such as text, concept, and
visual-based searches, respectively. For example, a text-based search may use the speech recognition
transcript, the closed caption available from the program channel, or the recognized captions embedded
in video frames through Optical Character Recognition (OCR). The confidence vectors from concept
detectors or low-level feature vectors can be used in same way as the QBE or tf-idf scheme [Baeza-Yates
and Ribeiro-Neto 1999] for searching. More comprehensive introductions of content-based image and
video retrieval can be found in [Datta et al. 2008], [Kennedy et al. 2008a], [Lew 2000], [Lew et al. 2006],
[Smeulders et al. 2000], [Snoek and Worring 2009], [Yan and Hauptmann 2007b], [Philbin et al. 2007].
Based on these initial search results, as well as some knowledge, a re-ranking module is applied to
aggregate the search results and reorder the initial document list to improve the search performance.
In this paper, re-ranking is defined as improving the initial text-based search accuracy by considering
the other search results and the prior knowledge. We can observe from Figure 3 that re-ranking plays
a key in the visual search framework to improve the initial search performance.
2.2 A Bayesian View of Multimedia Search Re-ranking
Visual re-ranking problem can be traced back to the late 1990s when researchers started focusing on
improving content-based image retrieval (CBIR) results via relevance feedback techniques [Benitez
et al. 1998], [Rui et al. 1998], [Zhou and Huang 2002], [Zhou and Huang 2003]. It emerged as an independent
research topic and attracted increasing intention beginning in the early 2000s. In the most
common formulation, the re-ranking problem can be reduced to a problem of re-estimating relevance
for each document that has been ranked in an initial search result (e.g., a ranked list of documents
searched by a text-only approach, a ranked list of objects returned by a bag-of-words model, etc.). Intuitively,
this estimation is usually based on some knowledge mined from the initial search results or
the queries, prior knowledge from the Web, some domain-specific knowledge databases, or the interactions of users. Once we can estimate the relevance of each document, we can re-rank these documents
according to their relevance scores and obtain the best ranked list of these documents.
More formally, the re-ranking problem can be formulated as finding the optimal ranked list from the
perspective of Bayesian theory as follows. Let D (D = {di}
N
i=1) denote the collection of documents to be
ranked or reranked, where di is a single document (such as an object, an image, a video, or a clip) and
N is the number of documents in D. Let r0 denote the initial ranked list and r the best ranked list.
In these lists, each document di has a corresponding ranking order or relevance score ri with respect
to the query q. Note that q may be a piece of terms, or visual examples (i.e., a set of objects/images or
video clips), or any combination of them. Let R denote the set of all possible ranked lists (r0, r ∈ R) and
r = [r1, r2, . . . , rN ]
T
, where ri (0 6 ri 6 1) is the relevance score for the i-th visual document di
. If we
only consider the rank order of each document in the r, then the space of R is N!, which can be huge if
N is big enough. Let K denote the knowledge for guiding the re-ranking process. From a probabilistic
view, given the initial ranked list r0 and prior knowledge K, re-ranking can be formulated as to derive
the optimal list r
∗ with the maximum a posterior probability,
r
∗ = arg max
r∈R
p(r|r0, K). (1)
According to Bayes’ formula, the posterior p(r|r0, K) is proportional to the product of the conditional
prior probability and the likelihood,
p(r|r0, K) = p(r|K)p(r0|r, K)
p(r0|K)
∝ p(r|K)p(r0|r, K), (2)
where p(r|K) indicates the conditional prior of the ranked list given the prior knowledge, and p(r0|r, K)
indicates the likelihood of the initial ranked list given the “true” list. Intuitively, the conditional prior
p(r|K) actually expresses how the “true” list r is consistent with the knowledge K. In other words, the
prior knowledge acts as the basis of re-ranking to ensure that there is the maximal consistency between
the ranked list and itself. For example, p(r|K) can be modeled by the visual consistency between
the reranked list and the knowledge in terms of some dominant patterns. The likelihood p(r0|r, K) expresses
the probability that the initial ranked list aligns with the “true” list and the knowledge. For
example, it can be modeled based on the disagreement between the reranked and initial lists. Note that
r0 is given based on text-based search, thus we can assume that r0 and K are independent from each
other [Tian et al. 2008]. Fig. 4 shows the graphic model representation of the relationship between K,
r, and r0. Then we can rewrite it as:
p(r0|r, K) , p(r0|r). (3)
By plugging equation (3) into (2), we can obtain the formulation of re-ranking from the Bayesian
perspective,
r
∗ = arg max
r∈R
p(r|K)p(r0|r). (4)
Equation (4) indicates the optimization of the reranked list in terms of two somewhat conflicting
objectives, i.e., maximizing the consistency to the knowledge and minimizing the disagreement with
the initial ranked list. Thus, the central problem of re-ranking is the modeling of consistency between