31-10-2016, 10:04 AM
1462250395-DirectedResearchReportnew.docx (Size: 1.22 MB / Downloads: 4)
Evolution of Genome Science:
DNA sequencing has come a long way since the days of two-dimensional chromatography in the 1970s. With the advent of capillary electrophoresis (CE)-based sequencing in 1977, scientists gained the ability to sequence the full genome of any species in a reliable, reproducible manner. A decade later, Applied Bios stems introduced the first automated,CE based sequencing instruments –the AB370 in 1987 and the AB3730xl in 1998-instruments that became the primary workhorses for the NIH-led and Celera-led Human Genome Projects. While these first generation instruments were considered high throughput for their time, the Genome Analyzer emerged in 2005 and took sequencing runs from 84 kilo base (kb) per run to giga base (Gb) per run. The short read, massively parallel sequencing technique was a fundamentally different approach to sequencing that revolutionized sequencing capabilities and launched the “next generation “in genome science. From that point forward, the data output of next generation sequencing has outpaced Moore’s law –more than doubling each year
In 2005, with the Genome Analyzer, a single sequencing run could produce roughly one giga base of data. By 2014, the rate climbed to a 1.8 teargases of data in a single sequencing run-an astounding 1000x increase. It is remarkable to reflect on the fact that the first human genome, famously co published in science and nature in 2001, required 15 years to sequence and cost nearly 3 billion dollars. In contrast, the HiSeqX Ten released in 2014 can sequence over 45 human genomes in a single day for approximately $1000 each (Figure 2).
Beyond the massive increase in data output, the introduction of NGS technology, has transformed the way scientists think about genetic information. The $1000 dollar genome enables population-scale sequencing and establishes the foundation for personalized genomic medicine as part of standard medical care. Researchers can now analyze thousands to tens of thousands of samples in a single year.
Whole Genome Sequencing:
Whole genome sequencing (also known as WGS, full genome sequencing, complete genome sequencing, or entire genome sequencing) is a laboratory process that determines the complete DNA sequence of an organism's genome at a single time. This entails sequencing all of an organism's chromosomal DNA as well as DNA contained in the mitochondria and, for plants, in the chloroplast.
Whole genome sequencing should not be confused with DNA profiling, which only determines the likelihood that genetic material came from a particular individual or group, and does not contain additional information on genetic relationships, origin or susceptibility to specific diseases. Also unlike full genome sequencing, SNP genotyping covers less than 0.1% of the genome. Almost all truly complete genomes are of microbes; the term "full genome" is thus sometimes used loosely to mean "greater than 95%". The remainder of this article focuses on nearly complete human genomes.
High-throughput genome sequencing technologies have largely been used as a research tool and are currently being introduced in the clinics. In the future of personalized medicine, whole genome sequence data will be an important tool to guide therapeutic intervention. The tool of gene sequencing at SNP level is also used to pinpoint functional variants from association studies and improve the knowledge available to researchers interested in evolutionary biology and hence may lay the foundation for predicting disease susceptibility and drug response.
Exome Sequencing:
Exome sequencing (also known as Whole Exome Sequencing or WES) is a technique for sequencing all the protein-coding genes in a genome (known as the exome). It consists of first selecting only the subset of DNA that encodes proteins (known as exons), and then sequencing that DNA using any high throughputDNA Sequencing technology. There are 180,000 exons, which constitute about 1% of the human genome, or approximately 30 million base pairs, but mutations in these sequences are much more likely to have severe consequences than in the remaining 99%.The goal of this approach is to identify genetic variation that is responsible for both Mendelian and common diseases such as Miller syndrome without the high costs associated with whole-genome sequencing.
Sequence Alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or revolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the edit distance between strings in a natural language or in financial data.
Sorting of Aligned DNA sequence
The query sequence is broken down into multiple segments and aligned with the reference sequence. The position of each query segment with respect to the reference sequence and the segmented sequence is extracted using the SAMTools platform. The FASTQ file of the query sequence is converted to the sam file format which is used to sort the query sequence segments to align it to the reference sequence.
FASTQ File Format
FASTQ file format is a text-based format for storing a biological sequence and its corresponding quality scores. The sequence letter and the quality score are encoded in ASCII character. It was originally developed at the Welcome Trust Sanger Institute to bundle a FASTA sequence and its quality data.
Once the sequencing is finished, the data becomes available for download as "FASTQ" text files, in which each short read takes up four lines. The first line (starting with an @) is a read identifier, the second is the DNA sequence, the third another identifier (same as line 1, but starting with a +(or sometimes only consisting of a +)) and the fourth is a Phred quality score symbol for each base in the read. The quality score is based on the ASCII character code used by computer keyboards. Illumina's current sequencing pipeline uses an offset of 64, so that an @ (ASCII code 64) is 0, and h (ASCII code 104) is 40 (other versions of the pipeline might use different offnsets, however. If you have data with a different offset value, you will need to modify your commands accordingly to inform programs that this is the case). The quality score for each base ranges from -5 to 40 and is defined as Qphred =-10 log10(p), where p is the estimated probability of a base call being wrong. So a Qphred of 20 corresponds to a 99 % probability of a correctly identified base.
SAM Tools
SAMTools is a popular open-source tool used in next-generation sequence analysis. SAM Tools helps in evaluation and variation of the aligned reads. It is also helpful in identifying out the genomic variants like small insertions and deletions. SAMTools interacts with and post-processes short DNA sequence read alignments in theSAM(=Sequence Alignment/Map) and BAM (=Binary Alignment/Map) file formats. These files are generated as output by short read aligners like BWA. SAM files are human-readable text files, and BAM files are simply their binary equivalent. BAM files are typically compressed and more efficient for software to work with than SAM. SAM Tools makes it possible to work directly with a compressed BAM file, without having to un-compress the whole file. Additionally, since the format for a SAM/BAM file is somewhat complex - containing reads, references, alignments, quality information, and user-specified annotations - SAMtools reduces the effort needed to use SAM/BAM files by hiding low-level details.
SAMtool commands follow a stream model, where data runs through each command as if carried on a conveyor belt. This allows combining multiple commands into a data processing pipeline. We have utilized the command sort to sort the aligned sequences in the BAM file format. The short description of the command is as follows:
SAM Tools sort
The sort command sorts a BAM file based on its position in the reference, as determined by its alignment. The element + coordinate in the reference that the first matched base in the read aligns to is used as the key to order it by. The sorted output is dumped to a new file by default, although it can be directed to stdout (using the -o option). As sorting is memory intensive and BAM files can be large, this command supports a sectioning mode (with the -m options) to use at most a given amount of memory and generate multiple output file. These files can then be merged to produce a complete sorted BAM file.
Example Command Sequence is as follows:
Convert a bam file into a sam file.
samtools view sample.bam > sample.sam
Convert a sam file into a bam file. The -b option compresses or leaves compressed input data.
samtools view –bS sample.sam > sample.bam
Extract all the reads aligned to the range specified, which are those that are aligned to the reference element named chr1 and cover its 10th, 11th, 12th or 13th base. The results is saved to a BAM file including the header. An index of the input file is required for extracting reads according to their mapping position in the reference genome, as created by samtools index.
Sort
samtools sort unsorted_in.bam sorted_out
Read the specified unsorted_in.bam as input, sort it by aligned read position, and write it out to sorted_out.bam, the bam file whose name (without extension) was specified.
SAMTools Results:
A sample file in the sam (sam_reads_aligned.sam) file format is sorted using the sort command available within the SAMTools platform. The resulting bam file is converted to a sam file using the view conversion command available in the tool.
IMPLEMENTATTION OF MERGE SORT USING FPGA
MERGE SORT
In computer science, merge sort (also commonly spelled merge sort) is an efficient, general-purpose, comparison-basedsorting algorithm. Most implementations produce a stable sort, which means that the implementation preserves the input order of equal elements in the sorted output.
The merge sort is a recursive sort of order nlog(n).
It is notable for having a worst case and average complexity of O(nlog(n)), and a best case complexity of O(n) (for pre-sorted input). The basic idea is to split the collection into smaller groups by halving it until the groups only have one element or no elements (which are both entirely sorted groups). Then merge the groups back together so that their elements are in order. This is how the algorithm gets its "divide and conquer" description.
ALGORITHM
Conceptually, a merge sort works as follows:
1. Divide the unsorted list into n sublists, each containing 1 element (a list of 1 element is considered sorted).
2. Repeatedly merge sublists to produce new sorted sublists until there is only 1 sublist remaining. This will be the sorted list.