09-09-2016, 01:48 PM
1454124044-TWEETPAPERs.docx (Size: 8.91 KB / Downloads: 5)
ABSTRACT
Twitter has become one of the most important communication channels with its ability of providing the most up-to-date and newsworthy information. Considering wide use of Twitter as the source of information, reaching an interesting tweet for a user among a bunch of tweets is challenging. As a result of huge amount of tweets sent per day by hundred millions of users, information overload is inevitable. In order for users to reach the information that they are interested easily, recommendation of tweets is an essential task. To extract information from this large volume of tweets, Named Entity Recognition (NER), is already being used by researchers. Commonly used NER methods on formal texts such as newspaper articles are built upon on linguistic features extracted locally. However, considering the short and noisy nature of tweets, performance of these methods is inadequate on tweets and new approaches have to be generated to deal with this type of data. Recently, tweet representation based on segments in order to extract named entities has proven its validity in NER field. Along with named entities extracted from tweets via tweet segmentation, user’s retweet and mention history, and followed users are also considered as strong indicators of interest and a model representing user interest is generated. Reducing Twitter users’ effort to access tweets carrying the information of interest is the main goal of the study, and a tweet recommendation approach under a user interest model generated via named entities is presented.
INTRODUCTION
Twitter has become a place to share and disseminate timely information. Organizations have been reported to create and monitor Twitter streams which are targeted to collect and understand opinion of users. Targeted Twitter stream is usuallyconstructed by tweets which are filltered with predefined criteria to select (like tweets published from a geographical region, tweets match with predefined key words). Due to its invaluable business value of timely information from these tweets, it is imperative to understand tweets’ language for a large body of downstream applications, such as named entity recognition (NER) event detection and summarization opinion mining, sentiment analysis, and many others.
The error-prone and short nature of tweets often make the word-level language models for tweets less reliable because the length of a tweet is limited (i.e., 140 characters)and writing styles are not restricted, tweets can contain grammatical errors, spelling mistakes, and informal abbreviations. For example, given a tweet “He call me, no answer. my phone in the bag, i eatin ,” there is no clue to guess its true theme by disregarding word order (i.e., bag-of-word model). The situation is further the limited with context provided by the tweet. That is, more than one explanation for this tweet could be derived by different readers if the tweet is considered in isolation. On the other hand,despite the noisy nature of tweets, the core semantic infor-
mation is well preserved in tweets in the form of namedentities or semantic phrases.
We focus on the task of tweet segmentation.The goal of this task is to split a tweet into a sequence of consecutive n-grams (n 1 , each of which is called a segment.A segment can be a named entity (e.g., a movie title “finding nemo”), a semantically meaningful information unit (e.g.,“officially released”), or any other types of phrases which appear “more than by chance”. Because these segments preserve semantic meaning of the tweet more precisely than each of its constituent words does, the topic of this tweet can be better captured in the subsequent processing of this tweet. For instance, this segment-based representation could be used to enhance the extraction of geographical location from tweets because of the segment “circle line. In fact, segment-based representation has shown its effectiveness over word-based representation in the tasks of named entity recognition and event detection. A named entity is valid segment; but a segment may not necessarily be a named entity. To achieve high quality tweet segmentation, we propose a generic tweet segmentation framework, named HybridSeg.
HybridSeg learns from both global and local contexts, and has the ability of learning from pseudo feedback.
PROPOSED WORK:
The aim of this is work to achieve high quality tweet segmentation and to make Tweets are highly time-sensitive. Tweets are posted for information sharing and communication. The named entities and semantic phrases are well preserved in tweets. The global context derived from Web pages (e.g., Microsoft Web N-Gram corpus) or Wikipedia therefore helps identifying the meaningful segments in tweets. The method realizing the proposed framework that solely relies on global context is denoted by HybridSegWeb. Tweets are highly time-sensitive so that many emerging phrases like “She Dancin” cannot be found in external knowledge bases. However, considering a large number of tweets published within a short time period (e. g., a day) containing the phrase, it is not difficult to recognize “She Dancin” as a valid and meaningful segment. We therefore investigate two local contexts, namely local linguistic features and local collocation. Observe that tweets from many official accounts of news agencies, organizations, and advertisers are likely well written. The well preserved linguistic features in these tweets facilitate named entity recognition with high accuracy. Each named entity is a valid segment. The method utilizing local linguistic features is denoted by HybridSegNER. It obtains confident segments based on the voting results of multiple off-the-shelf NER tools.
RELATED WORK:
Before developing the tool it is necessary to determine the time factor, economy n company strength. Once these things r satisfied, ten next steps are to determine which operating system and language can be used for developing the tool. Once the programmers start building the tool the programmers need lot of external support. This support can be obtained from senior programmers, from book or from websites. Before building the system the above consideration are taken into account for developing the proposed system.
Combined analysis of Named-Entity Recognition and Entity Linking A.Sil analyzed that MSNBC dataset can be used for joining Named- Entity Recognition (NER) system and Entity Linking (EL) system together to make the joint predictions where Named Entity Recognition (NER) is used to find the names which is present in the text, and that names can be connected as the entries in structured or semi-structured repositories like Wikipedia. Entity Linking (EL) system is the process of finding whether a name that appears in text indicates an entity, which appears to be known in an already recognized set of named entities, such as a relational database or the set of articles in Wikipedia. Named Entity Recognition (NER) system cannot able to connect to the Entity Linking (EL) system directly, when NER system failed to detect any mentions. This paper clearly shows that, first NER system will be done and it is followed by the Entity Linking (EL) system. MSNBC dataset has been used for analyzing the results. On using NER it reduces 60% of error and on using EL 68% of error can be reduced.
Emoticon Smoothed Language Models for Twitter Sentiment Analysis The author, K.L.Liu mainly focuses on machine learning based text classification problem which is to mainly identify the attitude or opinion of the tweets. The process of identifying the opinions (“what others think”) is called as Sentiment Analysis (SA). For this, [5] some may use manually labeled data to train fully supervised models, while others use some noisy labels, such as emoticons and hashtags, for model training. But in this paper, the author had found that combining both the manually labeled data and noisy labeled data is the best strategy for this approach. They used manually labeled data for providing the training to the language model. This trained model may contain noisy labels which can then be smoothened by using different emoticons. Emoticons can either contain a positive “” or negative emoticon “” where negative emoticon is not considered. This different emoticon uses some hashtags (eg. #buy) or smileys to identify the sentiment types. However the accuracy of these methods is not satisfactory due to the noisy labels.
Opinion Summarization using hashtags and Human annotated semantic tags in Twitter The author, X.Meng extracted how the automatic opinion summarization poses a greatest challenge to the summarization system by considering the users opinions or attitude which are posted as the tweets in the twitter. The main focus of the paper is to identify the specific topic related opinion summaries such as [6] celebrities and brands. They used hashtags and human annotated semantic tags for calculating the similarity among them to provide better interpretation and representation. Next, they grouped #hashtags into coherent topics by adopting affinity algorithm. Then they focused on the entity related opinions. ie; when an entity is provided they collected the opinions for it. Finally, the summary is generated from different opinions which can be based on various topics and opinions produced by the users in the twitter.
IMPLEMENTATION OF SYSTEM
The system mainly involves six phases; data gathering, knowledge base construction, data preprocessing,named entity recognition, user interest model generation based on named entities and finally recommendation.
Data Gathering is the process of collecting a Twitter user’s data, including user’s friends’ posts as well as user’s own posts. In this phase, user-friend relationship is also extracted and friends’ relative ranking is generated as an output.
Knowledge Base Construction is the process of generating a graph-based knowledge base of Turkish Wikipedia article titles and their links to each other, in order to validate named entity candidates generated as an output of Named Entity Recognition phase. Keeping this knowledge base up to date is also included in this phase. Although other phases iteratively follow each other and one’s output is the other’s input, this phase is independent and conducted in parallel.
Data Preprocessing includes removing unnecessary parts of tweet texts such as mentions, hashtags, smileys, vocatives, links etc. Since informal writing style is commonly adopted in tweets, this phase is also responsible from normalizing the tweet text such as getting rid of unnecessarily repeated characters, slang words, correcting asciification related problems.
Named Entity Recognition is the next phase of data preprocessing phase. In this phase, tweet segmentation on preprocessed tweets is carried out by means of global context and segments as candidate named entities are generated. Then, these candidates are validated as named entities or ignored by usage of previously constructed knowledge base of Turkish Wikipedia article titles.
User Interest Model Generation phase is a must. In this phase, using named entities extracted from user’s and user’s friends’ tweets and user-friend relationships, a user interest model is generated. In other words, a Twitter user is represented via weighted named entities.
Tweet Recommendation is the last phase, where two kinds of recommendation applications applied by comparing candidate tweets with the generated user interest model. Tweet classification which is the task of deciding whether a candidate tweet is interesting for th e user or not, and tweet ranking which aims to sort tweets from the most recommendable to the least recommendable are performed in this phase.
Summarization
Document summarization can be a vital solution to reduce the information overload problem on the web. This type of summarization capability assist users to see in quick look what a collection is about and provides a new mode of arranging a huge collect of information. The clustering-based method to multi-document text summarization can be useful on the web because of its domain and language independence nature.
Ranking
Ranking looks for document where more than two independent existences of identical terms are within a specified distance, where the distance is equivalent to the number of in between words/characters. We use modified proximity ranking. It will use keyword weight age function to rank the resultant documents.
CONCLUSION
Tweet segmentation assist to stay the semantic meaning of tweets, which consequently benefits in lots of downstream applications, e.g., named entity recognition. Segment-based known as entity recognition methods achieve much better correctness than the word-based alternative.The aim of this work is to achieve high quality tweet segmentation and to make Tweets are highly time-sensitive has been achieved.