21-08-2012, 12:32 PM
Seminar Report On OpenMP
OpenMP.docx (Size: 47.03 KB / Downloads: 20)
ABSTRACT
Pair-wise sequence alignment plays a central role in the post-genomics analysis. Needleman Wunsch is a dynamic programming algorithm which provides an optimal solution to the global alignment of two Genome sequences. This work parallelized this algorithm by partitioning the columns of similarity matrix according to the number of the available Processors and their capacities. In parallel version of BLAST and FASTA master processor reads the query, the database, and the number of processors (n), and splits the database into “n” parts, which is distributed among "n" processors so as to search the database in parallel. After searching the database, the processors send the calculated scores to the master processor, which further sorts the scores and displays the alignments.
Also the serial version of BLAST and FASTA in both space and time domain are implemented.
This work makes a solid contribution to the area of genome sequencing. Experimental results demonstrate the strength of this parallelization. Good speedup has been achieved by this system. It can be used as the part of the standard biological analysis package for analyzing pairs of biological sequences with the support of back-end supercomputing facilities. A comparative study illustrates the advantage of this technique over existing technique.
Introduction
Deoxyribonucleic acid (DNA) is the hereditary molecule of life. Information encoded within DNA confers physical characteristics such as hair color, eye color, and susceptibility to certain diseases. By unraveling the secrets encoded within DNA, numerous biological and medical discoveries have been and continue to be made. One basic technique to glean information from a DNA molecule is to compare it to other DNA molecules. This comparison, generally called an alignment, is fundamental to the emerging field of bioinformatics revealing where sequences are the same and where they are different. The differences could be invaluable, for example, in helping to explain why individuals have different susceptibility to disease.
For an algorithm to align two sequences, the DNA must first be represented in a way understandable to a computer. A deoxyribonucleic acid (DNA) molecule is made up of three parts: phosphates, sugars, and nitrogenous bases. The phosphates and sugars repeat continuously throughout the molecule. However, the nitrogenous bases differ from position to position and are, to a great extent, the information-containing portion of DNA. There are only four common nitrogenous bases in DNA.
The alignment problem
Using our representation of DNA as a string, we now formally state the alignment problem. An alignment between two input sequences, s and t over the alphabet Σ = {A,C,G,T}, expresses an equivalence relationship between the pair of sequences by generating two sequences s' and t' of equal length by inserting spaces (or gaps) into s and t. An optimal alignment is one that minimizes the number of times that gaps are inserted while simultaneously minimizing the number of times that mismatches occur (i.e. a C aligned with a T). There are various scoring schemes that give different penalties for adding gaps as opposed to mismatches, but the particular scheme used does not affect the core of the algorithm described here. To illustrate the idea of an alignment
Organization
Chapter 2 describes need of parallelization, definition of parallelization and parallelization approach. Chapter 3 represents various existing sequence search algorithms like BLAST, FASTA and Dynamic programming. it also deals with implementation of their serial versions. Chapter 4 contains details of our work on parallel algorithms.
It describes parallel methods in detail. Their efficiency and complexities are calculated theoretically. Results and Conclusions are given in Chapter 5.
Need for parallelization
As CPU speeds increase, the size of the problems we wish to solve using those CPUs increase likewise. Therefore, there will never come a time where we do not need more computing power than we have.
There are two ways of getting more power: buy faster CPUs, or buy more CPUs. From a programming point of view, buying a faster CPU is by far the most feasible choice, because it does not involve re-writing code to take advantage of parallelism in the problem being solved. But from an economical point of view, buying more CPUs is a significantly cheaper way of achieving more raw power, as CPU prices do not correspond linearly with their speed.
We think it is pretty clear, that there will always be a need for exploiting parallelism in computing problems, so that those problems can be executed on parallel architectures. However, there is also a need to free the programmers from thinking about this parallelism.
There are compilers out there, that try to free the programmer from thinking about parallelism. High Performance Fortran is a language that, among other things, tries to make some of this abstraction from parallel programming a reality.
Study of various algorithms
The Dynamic Alignment Algorithm
Standard alignment algorithms utilize dynamic programming methods. The seminal work in this field is that of NEEDLEMAN and WUNSCH. Understanding this work is important because it is subsequently used in the parallel algorithm and also because it clarifies the computational requirements of this traditional method. The method described dynamic programming with the simple scoring scheme of one point for a match with no penalty for inserting spaces or gaps.
For sequences A and B of lengths m and n, respectively, first construct an m x n matrix M. Let Ai and Bj represent the ith and jth nucleotide of sequences A and B, respectively and let Mi,j represent the cell of M that corresponds to both Ai and Bj. Assign a value to each Mi,j based on the substitutability of nucleotide Ai for Bj. each value of matrix entry is maximum of three values – match or mismatch score, horizontal gapped value, vertical gapped value.