25-08-2017, 09:32 PM
BioPerl Tutorial
Bioperl_tutorial_I.pdf (Size: 79.64 KB / Downloads: 47)
I. Introduction
I.1 Overview
Bioperl is a collection of perl modules that facilitate the development of perl scripts for bio-informatics
applications. As such, it does not include ready to use programs in the sense that may commercial
packages and free web-based interfaces (eg Entrez, SRS) do. On the other hand, bioperl does provide
reusable perl modules that facilitate writing perl scripts for sequence manipulation, accessing of
databases using a range of data formats and execution and parsing of the results of various molecular
biology programs including Blast, clustalw, TCoffee, genscan, ESTscan and HMMER. Consequently,
bioperl enables developing scripts that can analyze large quantities of sequence data in ways that are
typically difficult or impossible with web based systems.
In order to take advantage of bioperl, the user needs a basic understanding of the perl programming
language including an understanding of how to use perl references, modules, objects and methods. If
these concepts are unfamiliar the user is referred to any of the various introductory / intermediate books
on perl. (I’ve liked S. Holzmer’s Perl Core Language, Coriolis Technology Press, for example). This
tutorial is not intended to teach the fundamentals of perl to those with little or no experience in the perl
language. On the other hand, advanced knowledge of perl - such as how to write a perl object - is not
required for successfully using bioperl.
Bioperl is open source software that is still under active development. The advantages of open source
software are well known. They include the ability to freely examine and modify source code and
exemption from software licensing fees. However, since open source software is typically developed by
a large number of volunteer programmers, the resulting code is often not as clearly organized and its
user interface not as standardized as in a mature commercial product. In addition, in any project under
active development, documentation may not keep up with the development of new features.
Consequently the learning curve for actively developed, open source source software is sometimes
steep.
This tutorial is intended to ease the learning curve for new users of bioperl. To that end the tutorial
includes:
Descriptions of what bio-informatics tasks can be handled with bioperl
Directions on where to find the methods to accomplish these tasks within the bioperl package
Recommendations on where to go for additional information.
A separate tutorial script (tutorial.pl - located in the top bioperl directory) with examples of many
of methods described in the tutorial.
Running the tutorial.pl script while going through this tutorial - or better yet, stepping through it with an
interactive debugger - is a good way of learning bioperl. The tutorial script is also a good place from
which to cut-and-paste code for your scripts(rather than using the code snippets in this tutorial). The
tutorial script should work on your machine - and if it doesn’t it would probably be a good idea to find
out why, before getting too involved with bioperl!
This tutorial does not intend to be a comprehensive description of all the objects and methods available
in bioperl. For that the reader is directed to the documentation included with each of the modules as well
as the additional documentation referred to below.
I.2 Software requirements
I.2.1 Minimal bioperl installation
For a ‘‘minimal’’ installation of bioperl, you will need to have perl itself installed as well as the bioperl
‘‘core modules’’. Bioperl has been tested primarily using perl 5.005 and more recently perl 5.6. The
minimal bioperl installation should still work under perl 5.004. However, as increasing numbers of
bioperl objects are using modules from CPAN (see below), problems have been observed for bioperl
running under perl 5.004. So if you are having trouble running bioperl under perl 5.004, you should
probably upgrade your version of perl.
In addition to a current version of perl, the new user of bioperl is encouraged to have access to, and
familiarity with, an interactive perl debugger. Bioperl is a large collection of complex interacting
software objects. Stepping through a script with an interactive debugger is a very helpful way of seeing
what is happening in such a complex software system - especially when the software is not behaving in
the way that you expect. The free graphical debugger ptkdb (available as Devel::ptkdb from CPAN) is
highly recommended. Active State offers a commercial graphical debugger for windows systems. The
standard perl distribution also contains a powerful interactive debugger - though with a more
cumbersome (command line) interface.
I.2.2 Complete installation
Taking full advantage of bioperl requires software beyond that for the minimal installation. This
additional software includes perl modules from CPAN, bioperl perl extensions, a bioperl xs-extension,
and several standard compiled bioinformatics programs.
Perl - extensions
The following perl modules are available from bioperl (http://bioperlCore/external.shtml)or from
CPAN (http://www.perlCPAN/) are used by bioperl. The listing also indicates what bioperl
features will not be available if the corresponding CPAN module is not downloaded. If these modules
are not available (eg non-unix operating systems), the remainder of bioperl should still function
correctly.
For accessing remote databases you will need:
File-Temp-0.09
IO-String-1.01
For accessing Ace databases you will need:
AcePerl-1.68.
For remote blast searches you will need:
libwww-perl-5.48
Digest-MD5-2.12.
HTML-Parser-3.13
=item *
libnet-1.0703
MIME-Base64-2.11
URI-1.09
IO-stringy-1.216
For xml parsing you will need:
libxml-perl-0.07
XML-Node-0.10
XML-Parser.2.30
XML-Writer-0.4
expat-1.95.1 from http://sourceforgeprojects/expat/
For more current and additional information on external modules required by bioperl, check
http://bioperlCore/external.shtml
Bioperl c \extensions & external bio-informatics programs
Bioperl also uses several c-programs for sequence alignment and local blast searching. To use these
features of bioperl you will need an ANSI C or Gnu C compiler as well as the actual program available
from sources such as:
for smith-waterman alignments- bioperl-ext-0.6 from http://bioperlCore/external.shtml
for clustalw alignments- http://corba.ebi.ac.uk/Biocatalog/Alignm...tware.html
for tcoffee alignmentshttp://
igs-server.cnrs-mrs.fr/~cnotred/Projects_home_page/t_coffee_home_page.html
for local blast searching- ftp://ncbi.nlm.nih.gov/blast
I.3 Installation
The actual installation of the various system components is accomplished in the standard manner:
Locate the package on the network
Download
Decompress (with gunzip or a simliar utility)
Remove the file archive (eg with tar -xvf)
Create a ‘‘makefile’’ (with ‘‘perl Makefile.PL’’ for perl modules or a supplied ‘‘install’’ or
‘‘configure’’ program for non-perl program
Run ‘‘make’’, ‘‘make test’’ and ‘‘make install’’ This procedure must be repeated for every CPAN
module, bioperl-extension and external module to be installed. A helper module CPAN.pm is
available from CPAN which automates the process for installing the perl modules.
For the external programs (clustal, Tcoffee, ncbi-blast), there is an extra step:
Set the relevant environmental variable (CLUSTALDIR, TCOFFEEDIR or BLASTDIR) to the
directory holding the executable in your startup file - eg in .bashrc. (For running local blasts, it is
also necessary that the name of local-blast database directory is known to bioperl. This will
typically happen automatically, but in case of difficulty, refer to the documentation for
StandAloneBlast.pm)
The only likely complication (at least on unix systems) that may occur is if you are unable to obtain
system level writing privileges. For instructions on modifying the installation in this case and for more
details on the overall installation procedure, see the README file in the bioperl distribution as well as
the README files in the external programs you want to use (eg bioperl-ext, clustalw, TCoffee,
NCBI-blast).
I.4 Additional comments for non-unix users
Bioperl has mainly been developed and tested under various unix environments (including Linux) and
this tutorial is intended primarily for unix users. The minimal installation of bioperl *should* work
under other OS’s (NT, windows, perl). However, bioperl has not been widely tested under these OS’s
and problems have been noted in the bioperl mailing lists. In addition, many bioperl features require the
use of CPAN modules, compiled extensions or external programs. These features will probably will not
work under some or all of these other operating systems. If a script attempts to access these features
from a non-unix OS, bioperl is designed to simply report that the desired capability is not available.
However, since the testing of bioperl in these environments has been limited, the script may well crash
in a less ‘‘graceful’’ manner.
Todd Richmond has written of his experiences with BioPerl on MacOs at
http://bioperlCore/mac-bioperl.html
II. Brief introduction to bioperl’s objects
The purpose of this tutorial is to get you using bioperl to solve real-life bioinformatics problems as
quickly as possible. The aim is not to explain the structure of bioperl objects or perl object-oriented
programming in general. Indeed, the relationships among the bioperl objects is not simple; however,
understanding them in detail is fortunately not necessary for successfully using the package.
Nevertheless, a little familiarity with the bioperl object ‘‘bestiary’’ can be very helpful even to the
casual user of bioperl. For example there are (at least) six different ‘‘sequence objects’’ - Seq,
PrimarySeq, LocatableSeq, LiveSeq, LargeSeq, SeqI. Understanding the relationships among these
objects - and why there are so many of them - will help you select the appropriate one to use in your
script.
II.2 Sequence objects: (Seq, PrimarySeq, LocatableSeq, LiveSeq,
LargeSeq, SeqI)
Seq is the central sequence object in bioperl. When in doubt this is probably the object that you want to
use to describe a dna, rna or protein sequence in bioperl. Most common sequence manipulations can be
performed with Seq. These capabilities are described in sections III.3.1 and III.7.1.
Seq objects can be created explicitly (see section III.2.1 for an example). However usually Seq objects
will be created for you automatically when you read in a file containing sequence data using the SeqIO
object. This procedure is described in section III.2.1. In addition to storing its identification labels and
the sequence itself, a Seq object can store multiple annotations and associated ‘‘sequence features’’.
This capability can be very useful - especially in development of automated genome annotation systems,
see section III.7.1.
On the other hand, if you need a script capable of simultaneously handling many (hundreds or
thousands) sequences at a time, then the overhead of adding annotations to each sequence can be
significant. For such applications, you will want to use the PrimarySeq object. PrimarySeq is basically a
‘‘stripped down’’ version of Seq. It contains just the sequence data itself and a few identifying labels (id,
accession number, molecule type = dna, rna, or protein). For applications with hundreds or thousands or
sequences, using PrimarySeq objects can significantly speed up program execution and decrease the
amount of RAM the program requires.
The LocatableSeq object is just a Seq object which has ‘‘start’’ and ‘‘end’’ positions associated with it.
It is used by the alignment object SimpleAlign and other modules that use SimpleAlign objects (eg
AlignIO, pSW). In general you don’t have to worry about creating LocatableSeq objects because they
will be made for you automatically when you create an alignment (using pSW, Clustalw, Tcoffee or
bl2seq) or when input an alignment data file using AlignIO. However if you need to input a sequence
alignment by hand (ieg to build a SimpleAlign object), you will need to input the sequences as
LocatableSeqs.
A LargeSeq object is a special type of Seq object used for handling very long ( eg gt 100 MB)
sequences. If you need to manipulate such long sequences see section III.7.2 which describes LargeSeq
objects.
A LiveSeq object is another specialized object for storing sequence data. LiveSeq addresses the problem
of features whose location on a sequence changes over time. This can happen, for example, when
sequence feature objects are used to store gene locations on newly sequenced genomes - locations which
can change as higher quality sequencing data becomes available. Although a LiveSeq object is not
implemented in the same way as a Seq object, LargeSeq does implement the SeqI interface (see below).
Consequently, most methods available for Seq objects will work fine with LiveSeq objects. Section
III.7.2 contains further discussion of LiveSeq objects.
SeqI objects are Seq ‘‘interface objects’’ (see section II.4) They are used to ensure bioperl’s
compatibility with other software packages. SeqI and other interface objects are not likely to be relevant
to the casual bioperl user.
* Having described these other types of sequence objects, the ‘‘bottom line’’ still is that if you store
your sequence data in Seq objects (which is where they’ll be if you read them in with SeqIO), you will
usually do just fine. *
II.3 Alignment objects (SimpleAlign, UnivAln)
There are two ‘‘alignment objects’’ in bioperl: SimpleAlign and UnivAln. Both store an array of
sequences as an alignment. However their internal data structures are quite different and converting
between them - though certainly possible - is rather awkward. In contrast to the sequence objects - where
there are good reasons for having 6 different classes of objects, the presence of two alignment objects is
just an unfortunate relic of the two systems having been designed independently at different times.
Since each object has some capabilities that the other lacks it has not yet been feasible to unify bioperl’s
sequence alignment methods into a single object (see section III.5.4 for a description of SimpleAlign’s
and UnivAln’s features) . However, recent development in bioperl involving alignments has been
focused on using SimpleAlign and the new user should generally use SimpleAlign where possible.
II.4 Interface objects and implementation objects
Since release 0.6, bioperl has been moving to separate interface and implementation objects. An
interface is solely the definition of what methods one can call on an object, without any knowledge of
how it is implemented. An implementation is an actual, working implementation of an object. In
languages like Java, interface definition is part of the language. In Perl, you have to roll your own.
In bioperl, the interface objects usually have names like Bio::MyObjectI, with the trailing I indicating it
is an interface object. The interface objects mainly provide documentation on what the interface is, and
how to use it, without any implementations (though there are some exceptions). Although interface
objects are not of much direct utility to the casual bioperl user, being aware of their existence is useful
since they are the basis to understanding how bioperl programs can communicate with other
bioinformatics projects such as Ensembl and the Annotation Workbench (see section IV)