21-07-2014, 10:20 AM
EFFICIENT SEMISUPERVISED MEDLINE DOCUMENT CLUSTERING WITH MESH-SEMANTIC AND GLOBAL-CONTENT CONSTRAINTS
EFFICIENT SEMISUPERVISED.docx (Size: 259.21 KB / Downloads: 9)
Introduction
To start with we focus on the most searching biomedical text that makes use of clustering of Biomedical Documents. For clustering biomedical documents, we can consider three different types of information: the local-content (LC) information from documents, the global-content (GC) information from the whole MEDLINE collections, and the medical subject heading (MeSH) semantic (MS) information. Recently, the performance of MEDLINE document clustering has been enhanced by linearly combining both the LC and MS information. However,
the simple linear combination could be ineffective because of the limitation of the representation space for combining different types of information (similarities) with different reliability. To overcome the limitation, we propose a new semisupervised spectral clustering method, i.e., SSNCut, for clustering over the LC similarities, with two types of constraints: must-link (ML) constraints on document pairs with high MS (or GC) similarities and cannot-link (CL) constraints on those with low similarities. Experimental results show that SSNCut outperformed a linear combination method and several well-known semisupervised clustering methods, being statistically significant.
Scope of the Project:
In our project we have gone for alternative methods where user can search BioMedical text by improving the performance. When user will search any text it has to follow online databases. For searching Biomedical text user can get documents from PubMed,Medline,PMC,Mesh,etc. Those database contain bulk amount of data.Hence retriving documents from their database makes the performance slow.We can provide option where to get documents,either from online databases or from our local database.We will make clustering of all our database documents and can get documents from different clusters.
Scalable Clustering Algorithms with Balancing Constraints
Author: Arindam Banerjee, A Joydeep Ghosh
Year: 2010
In this paper, They propose a general framework for scalable, balanced clustering. The data clustering process is broken down into three steps: sampling of a small representative subset of the points, clustering of the sampled data, and populating the initial clusters with the remaining data followed by refinements. Basic two steps done here as ,1. Populate: First, the points that were not sampled, and hence do not currently belong to any cluster, are assigned to the existing clusters in a manner that satisfies the balancing constraints while ensuring good quality clusters.
2.Refine: Iterative refinements are done to improve on the clustering objective function while
satisfying the balancing constraints all along. Hence, part 1 gives a reasonably good feasible solution, i.e, a clustering in which the balancing constraints are satisfied. Part 2 iteratively refines the solution while always remaining in the feasible space.
Class Diagram:
A class diagram in the UML is a type of static structure diagram that describes the structure of a system by showing the system’s classes, their attributes, and the relationships between the classes.
Private visibility hides information from anything outside the class partition. Public visibility allows all other classes to view the marked information.
Protected visibility allows child classes to access information they inherited from a parent class.
Object Diagram:
An object diagram in the Unified Modeling Language (UML) is a diagram that shows a complete or partial view of the structure of a modeled system at a specific time.An Object diagram focuses on some particular set of object instances and attributes, and the links between the instances. A correlated set of object diagrams provides insight into how an arbitrary view of a system is expected to evolve over time.Object diagrams are more concrete than class diagrams, and are often used to provide examples, or act as test cases for the class diagrams. Only those aspects of a model that are of current interest need be shown on an object diagram.
Activity Diagram:
Activity diagram are a loosely defined diagram to show workflows of stepwise activities and actions, with support for choice, iteration and concurrency. UML, activity diagrams can be used to describe the business and operational step-by-step workflows of components in a system. UML activity diagrams could potentially model the internal logic of a complex operation. In many ways UML activity diagrams are the object-oriented equivalent of flow charts and data flow diagrams (DFDs) from structural development.
Sequence Diagram:
A sequence diagram in UML is a kind of interaction diagram that shows how processes operate with one another and in what order.
It is a construct of a message sequence chart. Sequence diagrams are sometimes called Event-trace diagrams, event scenarios, and timing diagrams.
The below diagram shows the sequence flow of the Compression of View on Anonymous Networks Folded View
Component Diagram:
Components are wired together by using an assembly connector to connect the required interface of one component with the provided interface of another component. This illustrates the service consumer - service provider relationship between the two components. An assembly connector is a "connector between two components that defines that one component provides the services that another component requires. An assembly connector is a connector that is defined from a required interface or port to a provided interface or port."When using a component diagram to show the internal structure of a component, the provided and required interfaces of the encompassing component can delegate to the corresponding interfaces of the contained components.
Conclusion
We have presented a new semisupervised spectral clustering method, i.e., SSNCut, which can incorporate both ML and CL constraints, for integrating different information for document clustering. We have emphasized that our idea behind this paper is to incorporate three different types of document similarities, i.e., the LC, GC and MS similarities. SSNCut realizes this new idea, providing a more flexible framework than a method of linearly combining the three similarities.