Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: Truth Validation and Veracity Analysis with Information Networks
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Truth Validation and Veracity Analysis with Information Networks
[attachment=28995]
Motivation
Why truth validation and veracity analysis?
Information sharing
Sharing trustable, quality information
Identifying false information among many conflicting ones
Information security
Protecting trustable information and its sources
Identifying which information providers are suspicious ones: frequently providing false information
Tracing back suspicious information providers via information networks
Truth Validation and Veracity Analysis by Information Network Analysis
The trustworthiness problem of the web (according to a survey):
54% of Internet users trust news web sites most of time
26% for web sites that sell products
12% for blogs
TruthFinder: Truth discovery on the Web by link analysis
Among multiple conflict results, can we automatically identify which one is likely the true fact?
Veracity (conformity to truth):
Given a large amount of conflicting information about many objects, provided by multiple web sites (or other information providers), how to discover the true fact about each object?
Xiaoxin Yin, Jiawei Han, Philip S. Yu, “Truth Discovery with Multiple Conflicting Information Providers on the Web”, TKDE’08
Conflicting Information on the Web
Different websites often provide conflicting info. on a subject, e.g., Authors of “Rapid Contextual Design”
Mapping It to Information Networks
Each object may have a set of conflicting facts
E.g., different author names for a book
And each web site provides some facts
How to find the true fact for each object?
Basic Heuristics for Problem Solving
There is usually only one true fact for a property of an object
This true fact appears to be the same or similar on different web sites
E.g., “Jennifer Widom” vs. “J. Widom”
The false facts on different web sites are less likely to be the same or similar
False facts are often introduced by random factors
A web site that provides mostly true facts for many objects will likely provide true facts for other objects
Mutual Consolidation between Confidence of Facts and Trustworthiness of Providers
Confidence of facts ↔ Trustworthiness of web sites
A fact has high confidence if it is provided by (many) trustworthy web sites
A web site is trustworthy if it provides many facts with high confidence
The TruthFinder mechanism, an overview:
Initially, each web site is equally trustworthy
Based on the above four heuristics, infer fact confidence from web site trustworthiness, and then backwards
Repeat until achieving stable state
Analogy to Authority-Hub Analysis
Facts ↔ Authorities, Web sites ↔ Hubs
Difference from authority-hub analysis
Linear summation cannot be used
A web site is trustable if it provides accurate facts, instead of many facts
Confidence is the probability of being true
Different facts of the same object influence each other
Computation Model: t(w) and s(f)
The trustworthiness of a web site w: t(w)
Average confidence of facts it provides
The confidence of a fact f: s(f)
One minus the probability that all web sites providing f are wrong
Experiments: Finding Truth of Facts
Determining authors of books
Dataset contains 1265 books listed on abebooks.com
We analyze 100 random books (using book images)
Experiments: Trustable Info Providers
Finding trustworthy information sources
Most trustworthy bookstores found by TruthFinder vs. Top ranked bookstores by Google (query “bookstore”)
Beyond TruthFinder: Extensions
Limitations of TruthFinder:
Only one version of truth
But people may have different, contrasting opinions
Not consider the time factor
But truth may change with time, e.g., Obama’s status in 2008 and 2009
Needed Extensions
Multiple versions of truth or opinions
Evolution of truth
Philosophy
Truth is a relative, evolving, and dynamically changing judgment
Multiple Versions of Truth
Watch out of copy-cats!
Copy-cat: Some information providers or even new agencies simply copy each other
Falsity could be amplified by copy-cats
How to judge copy-cats: Always copying in certain dimensional space
Treat copy-cats as one instead of multiples
Transition/Evolution of Truth
Truth is not static: It changes dynamically with time
Associating different versions of truth with different time periods
Clustering statements based on time durations
Statements
Identifying clusters (density-based clustering)
Distinguishing time-based clusters from outliers
Information providers
Leaders, followers, and old-timers
Information-network based ranking and clustering
Powerful analysis by information network analysis
Why RankClus?
More meaningful cluster
Within each cluster, ranking score for every object is available as well
More meaningful ranking
Ranking within a cluster is more meaningful than in the whole network
Address the problem of clustering in heterogeneous networks
No need to compute pair-wise similarity of objects
Mapping each object into a low measure space
What type of objects to be clustered: Target objects (specified by user)
Clustering of target objects can induce a sub-network of the original network
Algorithm Framework - Summary
Step 0. Initialization
Randomly partition target objects into K clusters
Step 1. Ranking
Ranking for each sub-network induced from each cluster, which serves as feature for each cluster
Step 2. Generating new measure space
Estimate mixture model coefficients for each target object
Step 3. Adjusting cluster
Step 4. Repeat Step 1-3 until stable
Focus on A Bi-type Network Case
Conference-author network, links can exist between
Conference (X) and author (Y)
Author (Y) and author (Y)
Use W to denote the links and there weights
W =
Step 1: Feature Extraction — Ranking
Simple Ranking
Proportional to degree counting for objects
E.g., number of publications of authors
Considers only immediate neighborhood in the network
Authority Ranking
Extension to HITS in weighted bi-type network
Rules:
Rule 1: Highly ranked authors publish many papers in highly ranked conferences
Rule 2: Highly ranked conferences attract many papers from many highly ranked authors
Rule 3: The rank of an author is enhanced if he or she co-authors with many authors or many highly ranked authors
Rules in Authority Ranking
Rule 1: Highly ranked authors publish many papers in highly ranked conferences
Rule 2: Highly ranked conferences attract many papers from many highly ranked authors
Rule 3: The rank of an author is enhanced if he or she co-authors with many authors or many highly ranked authors
Example: Authority Ranking in the 2-Area Conference-Author Network
Given the correct cluster, the ranking of authors are quite distinct from each other
Example: 2-D Coefficients in the 2-Area Conference-Author Network
The conferences are well separated in the new measure space
Scatter plots of two conferences and component coefficients
Time Complexity Analysis
At each iteration, |E|: edges in network, m: number of target objects, K: number of clusters
Ranking for sparse network
~O(|E|)
Mixture model estimation
~O(K|E|+mK)
Cluster adjustment
~O(mK^2)
In all, linear to |E|
~O(K|E|)