18-01-2013, 01:48 PM
Theoretical Bioinformatics and Machine Learning
Introduction
This course is part of the curriculum of the master of science in bioinformatics at the Johannes
Kepler University Linz. Machine learning has a major application in biology and medicine and
many fields of research in bioinformatics are based on machine learning. For example one of the
most prominent bioinformatics textbooks “Bioinformatics: The Machine Learning Approach” by
P. Baldi and S. Brunak (MIT Press, ISBN 0-262-02506-X) sees the foundation of bioinformatics
in machine learning.
Machine learning methods, for example neural networks used for the secondary and 3D structure
prediction of proteins, have proven their value as essential bioinformatics tools. Modern measurement
techniques in both biology and medicine create a huge demand for new machine learning
approaches. One such technique is the measurement of mRNA concentrations with microarrays,
where the data is first preprocessed, then genes of interest are identified, and finally predictions
made. In other examples DNA data is integrated with other complementary measurements in order
to detect alternative splicing, nucleosome positions, gene regulation, etc. All of these tasks are performed
by machine learning algorithms. Alongside neural networks the most prominent machine
learning techniques relate to support vector machines, kernel approaches, projection method and
belief networks. These methods provide noise reduction, feature selection, structure extraction,
classification / regression, and assist modeling. In the biomedical context, machine learning algorithms
predict cancer treatment outcomes based on gene expression profiles, they classify novel
protein sequences into structural or functional classes and extract new dependencies between DNA
markers (SNP - single nucleotide polymorphisms) and diseases (schizophrenia or alcohol dependence).
In this course the most prominent machine learning techniques are introduced and their mathematical
foundations are shown. However, because of the restricted space neither mathematical or
practical details are presented. Only few selected applications of machine learning in biology and
medicine are given as the focus is on the understanding of the machine learning techniques. If the
techniques are well understood then new applications will arise, old ones can be improved, and
the methods which best fit to the problem can be selected.
Basics of Machine Learning
The conventional approach to solve problems with the help of computers is to write programs
which solve the problem. In this approach the programmer must understand the problem, find
a solution appropriate for the computer, and implement this solution on the computer. We call
this approach deductive because the human deduces the solution from the problem formulation.
However in biology, chemistry, biophysics, medicine, and other life science fields a huge amount
of data is produced which is hard to understand and to interpret by humans. A solution to a
problem may also be found by a machine which learns. Such a machine processes the data and
automatically finds structures in the data, i.e. learns. The knowledge about the extracted structure
can be used to solve the problem at hand. We call this approach inductive, Machine learning is
about inductively solving problems by machines, i.e. computers.
Researchers in machine learning construct algorithms that automatically improve a solution
a problem with more data. In general the quality of the solution increases with the amount of
problem-relevant data which is available.
Problems solved by machine learning methods range from classifying observations, predicting
values, structuring data (e.g. clustering), compressing data, visualizing data, filtering data, selecting
relevant components from data, extracting dependencies between data components, modeling
the data generating systems, constructing noise models for the observed data, integrating data from
different sensors,
Introductory Example
In the following we will consider a classification problem taken from “Pattern Classification”,
Duda, Hart, and Stork, 2001, JohnWiley & Sons, Inc. In this classification problem salmons must
be distinguished from sea bass given pictures of the fishes. Goal is that an automated system is
able to separate the fishes in a fish-packing company, where salmons and sea bass are sold. We
are given a set of pictures where experts told whether the fish on the picture is salmon or sea
bass. This set, called training set, can be used to construct the automated system. The objective
is that future pictures of fishes can be used to automatically separate salmon from sea bass, i.e. to
classify the fishes. Therefore, the goal is to correctly classify the fishes in the future on unseen
data. The performance on future novel data is called generalization. Thus, our goal is to maximize
the generalization performance.
Supervised and Unsupervised Learning
In previous example a human expert characterized the data, i.e. supplied the label (the class).
Tasks, where the desired output for each object is given, are called supervised and the desired
outputs are called targets. This term stems from the fact that during learning a model can obtain
the correct value from the teacher, the supervisor.
If data has to be processed by machine learning methods, where the desired output is not given,
then the learning task is called unsupervised. In supervised task one can immediately measure
how good the model performs on the training data, because the optimal outputs, the targets.
Reinforcement Learning
There are machine learningmethods which do not fit into the unsupervised/supervised classification.
For example, with reinforcement learning the model has to produce a sequence of outputs
based on inputs but only receives a signal, a reward or a penalty, at sequence end or during the sequence.
Each output influences the world in which the model, the actor, is located.