16-01-2013, 12:47 PM
LEARNING SIMILARITY METRICS FOR EVENT IDENTIFICATION IN SOCIAL MEDIA
INTRODUCTION
The ease of publishing content on social media sites brings to the Web an ever increasing amount of content captured during and associated with real-world events. Sites like Flickr, YouTube, Facebook and others host user-contributed content for a wide variety of events. These range from widely known events, such as presidential inaugurations, to smaller, community-special events, such as annual conventions and local gatherings. By automatically identifying these events and their associated user-contributed social media documents, which is the focus of this paper, we can enable powerful local event browsing and search, to complement and improve the local search tools that Web search engines provide. In this paper, we address the problem of how to identify events and their associated user-contributed documents over social media sites. In one scenario, consider a person who is thinking of attending “All Points West," an annual music festival that takes place in early August in Liberty State Park, New Jersey. Prior to purchasing a ticket, this person could search the Web for relevant information, to make an informed decision. Unfortunately, Web search results are far from revealing for this relatively minor event: the event's website contains marketing materials, and traditional news coverage is low. Overall, these Web search results do not convey what this person should expect to experience at this event. In contrast, user-contributed content may provide a better representation of prior instances of the event from an attendee's perspective. A user-centric perspective, as well as coverage of a wide span of events of varying type and scale, make social media sites a valuable source of event information. Identifying events and their associated documents over social media sites is a challenging problem, as social media data is inherently noisy and heterogeneous. In our \All Points West" example, some photographs might contain the event's name in the title, description, or tag _elds, while many others might not be as clearly linked, with titles such as \Radiohead" or \Metric" and descriptions such as \my favorite band." Photographs geo-tagged with the coordinates of Liberty State Park, and taken on August 8, 2008, are likely to be related to this event, regardless of their textual description, but not every photograph taken on August 8, 2008, or titled \Radiohead," necessarily corresponds to this event. Overall, social media documents generally include information that is useful for identifying the associated events, if any, but this information is far from uniform in quality and might often be misleading or ambiguous.
PROBLEM DEFINITION
Given a set of social media documents associated with events, the problem that we address in this paper is how to 292 identify the events that are reected in the documents (e.g., President Obama's inauguration, or Madonna's October 6, 2008 concert in Madison Square Garden), and to correctly assign the documents that correspond to each event. We cast our problem as a clustering problem over social media documents (e.g., photographs, videos, social network group pages), where each document includes a variety of \context features" with information about the document. Some of these features (e.g., title, description, tags) are manually provided by users, while other features (e.g., upload or content creation time) are automatically generated.
Problem Definition. Consider a set of social media documents where each document is associated with an (unknown) event. Our goal is to partition this set of documents into clusters such that each cluster corresponds to all documents that are associated with one event.
As the formal definition of\event,"we adopt the version used for the Topic Detection and Tracking (TDT) event detection task over broadcast news .