13-08-2014, 12:24 PM
SEMINAR REPORT ON NEURAL NETWORKS AND GENOMIC ENGINEERING
NEURAL NETWORKS.doc (Size: 108 KB / Downloads: 19)
INTRODUCTION
Recent developments in the genomics arena have resulted techniques that can produce large amounts of expression level data. One such technique is the Microarray technology that relies on the hybridization properties of nucleic acids to monitor DNA or RNA abundance on a genomic scale. Microarrays have revolutionized the study of genome by allowing researchers to study the expression of thousands of genes simultaneously for the first time. It is being predicted that this is essential to understanding the role of genes in various biological functions.
The action of discovering patterns of gene expression is closely related to correlating sequences of genes to specific biological functions, and thereby understanding the role of genes in biological functions on a genomic scale. The ability to simultaneously study thousands of genes under a host of differing conditions presents an immense challenge in the fields of computational science and data mining. New computational and data mining techniques need to be developed in order to properly comprehend and interpret expression with the above goal in mind.
ANNS are used as a solution to various problems; however, their success as an intelligent pattern recognition methodology has been most prominently advertised. The most important, and attractive, feature of ANNs is their capability of learning (generalizing) from example (extracting knowledge from data). ANNs can do this without any prespecified rules that define intelligence or represent an expert’s knowledge. This feature makes ANNs a very popular choice for gene expression analysis and sequencing. Due to their power and flexibility, ANNS have even been used as tools for relevant variable selection, which can in turn greatly increase the expert’s knowledge and understanding of the problem.
From Cell to Gene
Cells are the basic structural and functional units of life. The blueprint for all cellular structures and functions of the organism is contained in an organelle known as the nucleus. This hereditary material is present in the form of long thin structures referred to as chromatin material. During cell division Chromatin material separates, shortens and thickens to form chromosomes. Every cell in the body contains the complete set of chromosomes of the organism. The number of chromosomes in the cell is specific to the type of species. Human beings have a set of 22 pairs of autosomes and one pair of sex chromosomes. Linearly located on the chromosomes are genes. Genes can be described as the physical and functional units of heredity. Chromosomes are constituted of an organic polymer known as DNA which is the acronym for Deoxyribonucleic acid.
GENOME INFORMATICS
Driven largely by the vast amounts of DNA sequence data, a new field of computational biology has emerged: Genome Informatics. The study includes functional genomics, the interpretation of the function of DNA sequence on a genomic scale; comparative genomics, the comparisons among genomes to gain insight into the universality of biological mechanisms and into details of gene structure and function; and structural genomics, the determination of the structure of all proteins. Thus genome informatics is not only a new area of computer science for genome projects but also a new approach of life science.
Genome informatics research is moving rapidly, with advances being made on several fronts. Methods for gene recognition and gene structure prediction provide the key to analyzing genes and functional elements from anonymous DNA sequence and to deriving protein sequences. Sequence comparison and database searching are the pre-eminent approaches for predicting the likely biochemical function of new genes or genome fragments. Information embedded within families of homologous sequences and their structures, which are derived from molecular data from humans and other organisms across a wide spectrum of evolutionary trees, provides effective means to detect distant family relationship and unravel gene functions.
ARTIFICIAL NEURAL NETWORKS
An artificial neural network is a parallel computational model composed of densely interconnected adaptive processing elements called neurons. It is an information-processing system that certain performance characteristics in common with the biological neural networks. It resembles the brain in that knowledge is acquired by the network through a learning process and that the interconnection strengths known as synaptic weights are used to store the knowledge.
A neural network is characterised by its pattern of connections between the neurons referred to as network architecture and its method of determining the weights on the connections called training or learning algorithm. The weights are adjusted on the basis of data. In other words, neural networks learn from examples and exhibit some capability for generalisation beyond the training data. This feature makes such computational models very appealing in application domains where one has little or incomplete understanding of the problem to be solved, but where training data is readily available. Neural networks normally have great potential for parallelism, since the computations of the components are largely interdependent of each other.
Artificial neural networks are viable computational models for a wide variety of problems. Already, useful applications have been designed, built, and commercialised for various areas in engineering, business and biology. These include pattern classification, speech synthesis and recognition, adaptive interfaces between human and complex physical systems, function approximation, image compression, associative memory, clustering, forecasting and prediction, combinatorial optimisation, nonlinear system modelling, and control. Although they may have been inspired by neuroscience, the majority of the networks have close relevance or counterparts to traditional statistical methods such as non-parametric pattern classifiers, clustering algorithm, nonlinear filters, and statistical regression models
NEURAL NETWORK FOUNDATIONS
Artificial Neural networks (ANNs) belong to the adaptive class of techniques in the machine learning arena. ANNS are used as a solution to various problems; however, their success as an intelligent pattern recognition methodology has been most prominently advertised. Most models of ANNs are organized in the form of a number of processing units called artificial neurons, or simply neurons, and a number of weighted connections referred to as artificial synapses between the neurons. The process of building an ANN, similar to its biological inspiration, involves a learning episode. During learning episode, the network observes a sequence of recorded data, and adjusts the strength of its synapses according to a learning algorithm and based on the observed data. The process of adjusting the synaptic strengths in order to be able to accomplish a certain task, much like the brain, is called “learning”. Learning algorithms are generally divided into two types, supervised and unsupervised. The supervised algorithms require labeled training data. In other words, they require more a priori knowledge about the training set.
TRAINING OF NEURAL NETWORKS
Neural networks are models that may be used to approximate, summarise, classify, generalise or otherwise represent real situations. Before models can be used they have to be trained or made to ‘fit’ into the representative data. The model parameters, e.g., number of layers, number of units in each layer and weights of the connections between them, must be determined. In ordinary statistical terms this is called regression. There are two fundamental types of training with neural networks: supervised and unsupervised learning. For supervised training, as in regression, data used for the training consists of independent variables (also referred to as feature variables or predictor variables) and dependent variables (target values). The independent variables (input to the neural network) are used to predict the dependent variables (output from the network). Unsupervised training does not have dependent (target) values supplied: the network is supposed to cluster the data automatically into meaningful sets.
The fundamental idea behind training, for all neural networks, is in picking a set of weights and then applying the inputs to the network and gauging the network’s performance with this set of weights. If the network does not perform well, then the weights are modified by an algorithm specific to each architecture and the procedure is then repeated. This iterative process continues until some pre-specified criterion has been achieved. A training pass through all vectors of the input data is called an epoch. Iterative changes can be made to the weights with each input vector, or changes can be made after all input vectors have been processed. Typically, weights are iteratively modified by epochs.
DATA ENCODING
Once a desired input-output mapping task is determined, the design of a complete system then involves the choices for a pre-processor and a post-processor, in addition to the neural network model itself. For molecular applications, the input sequence encoding method converts molecular sequences into input vectors of the neural networks. Likewise, an output encoding method is used in the post-processing step to convert the neural network output to desired forms.
Molecular applications involve the processing of individual residues, sequence windows of n-consecutive residues, or the complete sequence string. Correspondingly the coding may be local involving only single or neighbouring residues in short sequence segment or global involving long-range relationship in full-length sequence or long sequence segment. The sequence encoding methods may be categorised as direct encoding or indirect encoding. Direct encoding converts each individual residue to a vector, whereas indirect coding provides overall information of a complete string. Direct encoding preserves positional information, but can only deal with fixed-length sequence windows. On the other hand, indirect encoding disregards the ordering information, but can be used for sequences of either fixed or variable lengths.
Direct Input Sequence Encoding
In direct encoding, each molecular residue can be represented by its identity or features. The most commonly used method for direct encoding involves the use of indicator vectors. The indicator vector is a vector of binary numbers that has only one unit turned on to indicate the identity of the corresponding residue. Here, a vector of 4 units with 3 zeros and a single 1 is required for a nucleotide, so the 4 nucleotides may be represented as 1000 (A), 0100 (T), 0010 (G), 0001 ©. The spacer residue may be represented as 0000, without an additional unit. Likewise a vector of 20 input units is needed to represent an amino acid. Or a vector of 21 units may be used to include a unit for he spacer in regions between proteins. These binary representations are dubbed as BIN4, BIN20 and BIN21. The lengths of their input vectors are 4 X n, 20 X n, and 21 X n, respectively, where n is the total number of nucleotide or amino acid residues in the sequence window. An alternative to the sparse encoding scheme of BIN4 is a dense representation that uses two units for 4 nucleotides (e.g., 00 for A, 01 for T, 10 for G, and 11 for C). a comparative study however showed that the BIN4 coding was better, possibly due to its unitary coding matrix with identical Hamming distance among each vector.
NETWORK DESIGN
While designing a network for application in genome informatics several factors have to be taken into consideration. The type of network architecture and learning algorithm chosen are closely linked to the type of application the network is to be involved in. To optimise the neural network design, important choices must be made for the selection of numerous parameters. Many of these are internal parameters that need to be tuned with the help of experimental results and experience with the specific application under study. The following discussion focuses specifically on back-propagation design choices for the learning rate, momentum term, activation function and termination criteria.
The selection of a learning rate is important in finding the true global minimum of the error distance. The convergence speed of back-propagation is directly related to the learning parameter. Too small a learning rate will slow down training progress, whereas too large a learning rate may simply produce oscillations between relatively poor solutions.
A momentum rate can be helpful in speeding convergence and avoiding local minima. In back propagation with momentum, the weight change is in a direction that is a combination of the current gradient and the previous gradient. Momentum allows the net to make reasonably large weight adjustments as long as the corrections are in the same general direction for several patterns, while using a smaller learning rate to prevent a large response to the error from any one training pattern. An activation function for a back propagation network should satisfy several important characteristics. It should be continuous and differentiable. For computational efficiency it should have a derivative that is easy to compute.
The back propagation network cannot, in general, be shown to converge, nor are there well defined stopping criteria. The training is usually stopped by a user-determined threshold value (tolerance) for the error function, or a fixed upper limit on the number of training iterations called epochs. The termination can also be based on performance of a validation data set used to monitor generalisation performance during learning and to terminate learning when there is no more
Self Organizing Map
A self-organizing map (SOM) is a type of neural network approach first proposed by Kohonen. SOMs have been used as a divisive clustering approach in many areas including genomics. Several groups have used SOMs to discover patterns in gene expression data. Tamayo and colleagues use self-organizing maps to explore patterns of gene expression generated using Affymetrix arrays, and provide the GENECLUSTER implementation of SOMs. Tamayo and his colleagues explain the implementation of SOM for expression level data as follows: “An SOM has a set of nodes with a simple topology and a distance function on the nodes. Nodes are iteratively mapped into kdimensional “gene expression space” (in which the ith coordinate represents the expression level in the ith sample)”. A SOM assigns genes to a series of partitions on the basis of the similarity of their expression vectors to reference vectors that are defined for each partition. The summary of the basic SOM algorithm is perhaps best described in Quackenbush’s review: “First, random vectors are constructed and assigned to each partition. Second, a gene is picked at random and, using a selected distance metric, the reference vector that is closest to the gene is identified. Third, the reference vector is then adjusted so that it is more similar to the vector of the assigned gene. The reference vectors that are nearly on the two-dimensional gird are also adjusted so that they are more similar to the vector of the assigned gene. Fourth, steps 2 and 3 are iterated several thousand times, decreasing the amount by which the reference vectors are adjusted and increasing the stringency used to define closeness in each step. As the process continues, the reference vectors converge to fixed values. Last, the genes are mapped to the relevant partitions depending on the reference vector to which they are most similar”.
SOMs, like the gene shaving approach, have the distinct advantage that they allow a priori knowledge to be included in the clustering process. Tamayo explains this and other advantages of the SOM approach as follows: “The SOM has a number of features that make them particularly well suited to clustering and analyzing gene expression patterns. They are ideally suited to exploratory data analysis, allowing one to impose partial structure on the clusters (in contrast to the rigid structure of hierarchical clustering, the strong prior hypotheses used in Bayesian clustering, and the nonstructure of k-means clustering) facilitating easy visualization and interpretation. SOMs have good computational properties and are easy to implement, reasonably fast, and are scalable to large data sets”.
INTRODUCATION TO RBF
The essence of the difference between the operation of radial basis function networks and multilayer perceptrons can be seen in Fig. 4.3a. MLPs classify data by the use of hyperplanes that divide the data space into discrete areas; radial basis function clusters the data into a finite number of ellipsoid regions. Classification is then finding which ellipsoid is closest for a given test data point. The hidden units of a RBF network are not the same as used for a MLP, and the weights between input and hidden layer have different meanings. Transfer functions typically used include the Gaussian function, spline functions and various quadratic functions: they are all smooth functions, which taper off as distance from a centre point increases.
For radial basis function networks, each hidden unit represents the centre of a cluster in the data space. Input to a hidden unit in a RBF is not the weighted sum of its inputs, but a Distance measure: a measure how far the input vector is from the centre of the basis function for that hidden unit. If x and are vectors, the Euclidean distance between them is given by where x is an input vector and is the location vector, or centre of the basis function for hidden node j. the hidden node then computes its output as a function of the distance between the input vector and its centre. For the Gaussian RBF the hidden unit output is where Dj is the Euclidean distance between an input vector and the location vector for hidden unit j; hj is the output of the hidden unit j and is a measure of the size of the cluster j(variance).
DATA SETS
Genome sequences are composed of four kinds of nucleotides: Adenine, Cytosine, Guanine and Thymine (referred to as A, C, G, T). A segment of continuous occurrence of nucleotides is called a motif. For example, the sequence segment ‘ATC’ is a motif of length three. Given a motif length, the frequencies of the motifs compose the profile of a gene sequence, which is also called the genomic signature of the sequence. A sequence is divided into subsequences (windows) of a fixed size. Genomic signatures are built from each window. It has been shown that the “intergenomic differences” of signatures are generally higher than “intragenomic differences”. Consequently, features of signatures from a given genome can be learned with a properly designed neural network. In these set of experiments from “PRACTICE OF NEURAL NETWORKS FOR GENOME SIGNATURE ANALYSIS” by LIANGYOU CHEN and LOIS C. BOGGESS a fixed window size of 100 bases to divide the genome sequences. The motif length of three nucleotides is used for generation of signatures for each of the 100-base sequence windows. An example signature is captured in the following table:
CONCLUSION
Artificial neural networks thus have paved the way for automatic analysis of biological data. The simultaneous analysis of millions of genes at a time has driven the new field of computational biology—Genome informatics.
Application of artificial neural networks in Genome informatics has great significance in this arena. The artificial neural network attains its knowledge through the process of learning, which is similar to its inspiration---the Human brain.
But like all clouds with a silver lining Genomic engineering can be misused too, which can be a real threat to the man kind. Let us hope that man puts his brain for the welfare of his fellow beings and not against them