03-08-2012, 11:37 AM
Bioinformatics analysis of mass spectrometry-based proteomics data sets
Bioinformatics analysis of mass spectrometry-based proteomics data sets.pdf (Size: 885.17 KB / Downloads: 93)
1. Introduction
Biological systems function via intricately orchestrated cellular
processes in which various cellular entities – RNAs, metabolites
and proteins participate in a tightly regulated manner. Proteins
are at the ‘executive core’ of these cellular events, and their altered
behaviors have been implicated in myriad disease pathologies,
which also makes them by far the major class of drug targets.
Therefore, understanding the structure, dynamics and interactions
of proteins has been at the heart of biomedical research from its
very inception. Due to limitations of biochemical methods and allied
technologies, such studies have traditionally been carried out
on single proteins, rather than the entire population of expressed
proteins in a cell or tissue, the ‘proteome’ [1]. The discipline of proteomics
was initially equated with two-dimensional gel electrophoresis,
a low resolution technology that can only analyze the
most abundant proteins in a sample. In recent years mass spectrometry
(MS) has become a powerful technology to study proteins
on a large-scale [2,3]. Combined with innovative experimental
strategies [4], and advances in computational methods [5,6], MSbased
proteomics now enables global study of cellular proteomes.
This relatively novel development has led to a surge of qualitative
and quantitative data at the proteome level, which has posed
analytical challenges hitherto unseen by protein researchers. The
mapping of complex proteomics data to biological processes has
become impossible by manual means, and the need for computer-
aided data analysis is essential for further progress of the
field. Proteomics is today at the same crossroads that genomics
was at a decade ago in terms of tackling these challenges. Bioinformatics,
the scientific field dealing with analyzing large numbers of
genes or their transcripts, in fact emerged largely from that challenge
[7]. It has evolved to deal with a multitude of different biological
data types and should now be well-equipped to aid
proteomics [8]. Indeed, proteomics researchers are already actively
collaborating with bioinformaticians for comprehensive functional
analysis and systematic knowledge mining of complex data sets.
We subscribe to the definition of bioinformatics as a mean for
functional analysis and data mining of data sets leading to biologically
interpretable results and insights. In this review we highlight
recent advances, results, and challenges in proteome based bioinformatics
research. Thus the scope of our review is downstream
of the related and partially overlapping field of ‘computational proteomics’
which blends mathematical, computational and statistical
algorithms to address key problems related to protein identification
and quantitation from raw mass spectrometry data.
2. Bioinformatics for qualitative proteomics
Until a few years ago proteomics was largely a qualitative discipline.
The proteomic experiment typically consisted in identifying
as many proteins as possible in a protein complex, organelle or cell
or tissue lysate. In the course of obtaining the protein identities of
any protein mixture, an enzymatic digestion step is usually employed,
yielding a large collection of proteolytic peptides that are
then analyzed by ‘shotgun’ proteomics [9]. This is illustrated in
0014-5793/$36.00 2009 Federation of European Biochemical Societies. Published by Elsevier B.V. All rights reserved.
doi:10.1016/j.febslet.2009.03.035
* Corresponding author. Fax: +49 89 8578 3209.
E-mail address: mmann[at]biochem.mpg.de (M. Mann).
FEBS Letters 583 (2009) 1703–1712
journal homepage: www.FEBSLetters.org
the upper part of Fig. 1. The peptide inventory, which can include
normal peptides or peptides bearing post-translational modifications
such as phosphorylation, has been bioinformatically analyzed
for various purposes as symbolized by the left part of the figure.
The focus of qualitative proteomics was on the correctness and
the depth of analysis but the result of the experiment was typically
simply a list of proteins. As proteome catalogs started getting larger,
they were nevertheless unyielding to manual analysis due to
the sheer numbers of proteins, and the immediate challenge was
to obtain biological insights into the system being studied (right
part of Fig. 1).
The peptides created and collated in shotgun proteomics projects
are not functional biological entities and they are therefore
usually only of interest for the technology of proteomics itself.
Nevertheless, they can be mined for physiochemical and amino
acid residue patterns using machine learning approaches, which
form the basis for various classification and prediction routines.
This can be useful, for example, in predicting which peptides are
likely to be detected in proteomics experiments and which are unique
for the parent protein – so called proteotypic peptides [10,11].
These peptides can then be specifically targeted by specialized
mass spectrometric techniques such as multiple reaction monitoring
(MRM) during an analysis in which one is only interested in
monitoring the levels of selected proteins [12]. The ‘PeptideAtlas’
has been created for this purpose and extracts proteotypic peptides
and their associated fragmentation spectra from a large number of
submitted proteomic data sets [13].
Peptides sequenced in proteomics projects can be mapped onto
the positions in the genome that code for them. In this way the
peptides provide evidence that the gene is actually expressed
and is not, for example, a pseudogene. This is important because
a large fraction of the predicted genes in the genomes of eukaryotes
do not yet have any direct experimental protein information
associated with them. Peptide atlases have been used to find novel
transcripts, and to refine gene models, in principle leading to augmented
genome and proteome annotations [14,15]. A new sub-discipline
of bioinformatics called ‘comparative proteogenomics’ has
now emerged from such endeavors, which proposes to harness
MS-based proteomics data sets in conjunction with DNA sequence
data sets for large-scale genome and proteome annotation [16]. So
far efforts in this area have mined large-scale and usually low resolution
data. However, in our opinion, genome annotation should
only be done with very high accuracy data that has extremely
low error rates. Such data can now readily be produced by the last
generation of high precision mass spectrometers [3].
Additionally, the peptide identification information can serve as
a very rough indication of protein quantity in the sample. The basic
idea is that the abundance of each protein scales with the number
of identified peptides. One approach, termed peptide or spectral
counting uses the number of times that peptides belonging to a
protein are fragmented as a proxy for its abundance [17,18].
Kislinger et al. elucidated the proteome of six mouse tissues in this
way, followed by bioinformatics analysis of tissue specific proteome
function and regulation [19]. In a related approach, Ishihama
et al. showed that the absolute amount of protein in the sample
studied correlates with the exponential of the Protein Abundance
Index (peptides observed by MS divided by peptides that are
potentially observable) [20]. In a more advanced approach Lu
et al. used the peptide sampling information in a Bayesian framework
to define absolute protein expression measurements (APEX).
This measure provided an estimate of the relative contributions of
transcriptional and translational-level gene regulation in yeast and
Escherichia coli [21]. Nevertheless, these semi-quantitative approaches
are falling out of favor because modern, high resolution
Fig. 1. Bioinformatics analysis paths for qualitative proteomics. The peptide inventory identified by ‘shotgun proteomics’, (left part of figure) can be mined for patterns by
machine learning approaches–such as proteotypic peptides. The identified peptides can be mapped to genomic coordinates for identifying novel ORFs and for augmenting
genome annotations. Post translationally modified (PTM) peptides such as phosphopeptides can be analyzed for sequence motifs. The right hand side shows analysis
directions after the peptide identifications are consolidated into protein identifications. The proteins can be examined for their physiochemical properties to uncover MS
sampling and identification biases towards acidic or basic proteins or high molecular weight proteins in a sample, for example. On the functional level these proteins can be
integrated with annotational databases like Gene Ontology and PFAM to find enriched biological processes, functions, cellular components and protein domains. Additionally
these proteins can be mapped to network and pathway databases (STRING, KEGG) to visualize them in their modular functional contexts.
1704 C. Kumar, M. Mann/ FEBS Letters 583 (2009) 1703–1712
instruments allow direct comparison of peptide signals across
experiments, in so called ‘label-free’ and much more accurate
quantitative proteomics (see below).
Mass spectrometry is especially well suited to analyze posttranslational
modifications (PTMs) on peptides. In contrast to
unmodified peptides, these PTM-bearing peptides are of great biological
interest because they reveal sites of functional changes to a
protein. Large-scale studies of phosphorylation, ubiquitination,
acetylation and many other PTMs are now possible, especially if
the modified peptides can be specifically enriched with respect
to unmodified ones [22–24]. In phosphorylation studies, the site
specific information is used to extract enriched sequence motifs,
which in turn provides insights into proteome regulation by upsteam
kinases, modular protein domain mediated interactions,
and also a basis for prediction of novel PTM sites [25–27]. Recently,
an approach called NetworKIN was reported that mines large-scale
phosphorylation data sets in the context of the protein–protein
interaction network topology to predict kinase substrates in phosphorylation
networks [28]. The analysis of PTMs by mass spectrometry
– especially in a quantitative format as described below
– is increasing exponentially and will be one of the main contributions
of MS-based proteomics to biology.
Several types of bioinformatic analyses are almost invariably
performed on a measured proteome. Analyzing proteins for sequence
features like transmembrane domains and signal peptides
can provide clues about features of studied sample – for instance
in cases where the membrane proteome is enriched, or where
one needs to determine the fraction of secreted proteins [29]. This
also helps in ascertaining and correcting experimental or MS-identification
related sampling biases in proteome catalogs [30]. For
example, Shi et al. used a kernel density based approach to correlate
the isoelectric point (pI) and molecular weight feature space of
an MS-based mouse liver proteome to show that it was much more
unbiased than earlier 2D-gel based studies, and was largely representative
of complete mouse proteome (with Pearson correlation
of 0.98) [31]. The same approach was subsequently applied to
compare in-gel digestion and isoelectric focusing separation methods
in a study of a Drosophila cell line [32].
On the level of protein catalogs, the bioinformatics analysis typically
involves integration of proteome data with annotational databases,
such as Gene Ontology (GO) [33], protein domains
(InterPro, PFAM) [34] and pathway database (KEGG) [35] – to
determine if any of these properties are over or underrepresented.
This type of analysis may directly yield functional insights into the
data set and is easily accomplished using standard tools. DAVID
[36], GoMiner [37], Cytoscape [38] plug-ins like BINGO [39] are
examples of readily available software that can be used. In our laboratory,
we typically employ Bioconductor [40] within the R statistical
platform [41]. This requires some more programming
experience but offers broader capabilities and flexibility in
analysis.
Adachi et al. studied the proteome of 3T3-L1 adipocytes and
performed advanced bioinformatic mining of a qualitative data
set [42]. The proteome was first mapped to GO, KEGG and InterPro
databases, providing insights into adipocyte biology and functions.
Statistical enrichment tests were performed to find significant
over-represented GO and InterPro categories, which in turn related
to signal transduction, redox system, protein transport, translation,
transcription, protein degradation, fatty acid synthesis, and
phospholipid biosynthesis; all characteristic functions of adipocytes.
Putative biological functions were assigned to more than
50% of un-annotated proteins identified in 3T3-L1 cells through sequence
similarity based annotation transfer [43]. Additionally, a
novel tool for functional association of proteins to protein-models
prototypical of the function of interest (insulin mediated vesicular
traffic in this case) was employed [44]. This led to the association
of several proteins in the adipocyte proteome with this function,
at least one of which was later independently validated [45].
Combining proteomic data sets with complementary ‘omics’
data sets such as transcriptome data can reveal interesting facets
of cellular functions. In an early example, Mootha et al. compared
the mitochondrial proteome with tissue microarray data to show
that for mitochondrial proteins on a bulk level mRNA expression
levels are correlated with protein detection and their abundances.
Exploiting the much more readily available microarray data sets,
they also found that mitochondrial proteins show tissue specific
patterns of expression and regulation to a much greater extent
than previously recognized [46]. Furthermore, they characterized
key transcriptional regulators of mitochondria organelle biogenesis
using expression neighborhood analysis, which identified these
proteins by the co-regulation of their messages with the messages
corresponding to the mitochondrial proteome.
In a mouse liver organellar proteome study Foster et al. used
regulatory motif analysis to elucidate key players mediating organelle
biogenesis [47]. Calvo et al. integrated protein identification
information in mitochondria along with a collection of functional
genomics data sets in a naive Bayesian method to predict novel
mitochondrial candidates in humans [48]. Recently, Pagliarini
et al. used tissue wide mitochondrial proteome data from mouse
as an input for phylogenetic profiling across 42 eukaryotic species
and identified 19 new candidates of respiratory chain complex I
(CI). One of them C8orf38 was directly implicated in a lethal CI deficiency
[49]. Graumann et al. for the first time compared the mouse
embryonic stem cell (mESC) proteome with a genome wide chromatin
state map of mESC to show near perfect correlation between
protein expression and the presence of active rather than repressive
chromatin marks [50]. Mapping proteome data to pathway
databases like KEGG and network databases like STRING [51],
MINT/IntAct [52] and HPRD [53] can provide valuable clues about
the presence of signaling pathways and functionally interacting
modules of interest.
3. The nature of quantitative proteomics data
Functional insight most often requires quantitative comparison
between two or more biological states. In the last few years proteomics
has been catapulted into the realm of high-throughput ‘omics’
technologies mainly due to significant advances in two aspects,
first the development of accurate methods of proteome-wide
quantitation, and second the development of computational proteomics
algorithms and software for efficiently harnessing this
quantitative proteomic data.
While mass spectrometry is not inherently quantitative, this
limitation has been successfully overcome by introduction of stable
isotopes into the molecules to be identified. Stable isotope
labeling has been done for decades in small molecule MS and can
be performed in proteomics either by chemical modification of
peptides after tryptic digestion or by metabolic labeling of intact
proteins during cell culture [54]. For example, iTRAQ is a commonly
used technique in quantitative proteomics in which amino
groups of peptides (lysine side chain and the N-terminus) are
chemically labeled by isotopically different forms of the derivatizing
agent. The most widespread metabolic labeling technique is SILAC
(stable isotope labeling by amino acids in cell culture) [55,56].
As the name indicates, SILAC incorporates the heavy labeled amino
acid into the entire proteome in the course of normal cell metabolism
and proliferation. SILAC therefore does not require any chemical
derivatization. It is generally considered the most accurate
quantitation strategy because all peptides of a protein are labeled
and because processing of proteins occurs after samples have already
been combined and therefore cannot contribute to any quan-
C. Kumar, M. Mann / FEBS Letters 583 (2009) 1703–1712 1705
titation error. In our laboratory, the SILAC approach has enabled
comprehensive quantitation of the yeast proteome [57], and now
facilitates routine measurement of expression changes of 4000–
6000 proteins in more complex eukaryotic cells [32,50].
Alternatively, protein quantification without the use of isotopic
labels is emerging as a practical approach in MS-based proteomics.
This ‘label-free quantitation’ is especially important where isotope
labeling is not feasible or scalable, for example in many instances
in which tissues or clinical samples are measured. A precursor of
quantitative proteomics for patient samples is the SELDI method,
in which a low resolution MALDI spectrum is taken to be indicative
of the state of the proteome [58]. However, despite an extensive
literature on patient classification using these patterns, the accuracy
of such results and of the underlying data have been questioned
[59,60]. In contrast, high resolution instruments are now
making it possible to directly compare the integrated peptide ion
signals between experiments. With the inclusion of sophisticated
MS data (signal) processing algorithms and advanced statistical
procedures accurate label-free proteomics appears to become feasible,
which may herald the beginning of successful clinical and
in vivo tissue proteomics endeavors.
Modern proteomic experiments generate gigabytes of data for a
typical experiment. While advanced algorithms for protein identifications
have been in vogue for nearly a decade [61,62], robust algorithms
for extracting protein quantitation information from the
multidimensional MS data structures have only recently started
to emerge. Retrieving protein identification and quantitation from
MS data is an intensive multi-level algorithmic endeavor, now studied
in the above mentioned sub-discipline of ‘computational proteomics’.
It spans a gamut of computational, statistical and machine
learning algorithms especially applied or developed for peptide
and protein identification and quantitation. In comparison to efforts
aimed at microarray data, the lack of standardized and comprehensive
quantitation software for MS data has been one of the major
challenges and bottleneck for proteomics. Empirical methods like
spectral counting have been employed for protein quantitation
(see above). Even methods that took the great complexity of proteomics
data [63] into account, were inherently of low accuracy because
they were developed for low resolution MS data. Therefore,
they fail to deliver when MS data is highly resolved and fine grained
as generated by the latest generation of mass spectrometers. Our
laboratory has developed MaxQuant – a suite of integrated algorithms
specifically designed for high resolution, quantitative MS
data based on state-of-the-art data reduction, correlation analysis
and graph theory [6]. Mueller et al. recently reviewed other existing
computational proteomics frameworks and software [64].
Fig. 2. Quantitative data generation in MS based proteomics. Contemporary mass spectrometry based proteomics combines sophisticated experimentations, advanced MS
instrumentation and computational proteomics platforms to generate high dimensional data sets. These data sets can come from either isotopic-labeled samples (such as
SILAC) (left side of the figure) which typically are used in following temporal trajectories of cellular signaling events or in comparative proteomic phenotyping experiments
across two or more levels. Proteomes that were not isotopically labeled (right side of the figure) like tissue or clinical samples can be analyzed by a ‘label-free’ approach.
Computational software such as MaxQuant enables parallel processing of these complex data sets generating a multidimensional data matrix which contains a wealth of
information on peptide and protein identity, PTMs and their quantitative ratios.
1706 C. Kumar, M. Mann/ FEBS Letters 583 (2009) 1703–1712
These computational proteomics frameworks need to handle
the very large size of MS data sets that can readily be generated today.
Complex experimental schemes need to be accommodated –
enabling parallel processing of samples generated by isotope-labeled
and non-labeled proteomic experiments in replicates and
cross-over (isotope-swapping) experiments (Fig. 2). Accurate
quantitation, associated statistic, and quality control metrics need
to be generated and reported for thousands of proteins in each project.
One end product of such endeavors is a resultant matrix containing
expression values of thousands of proteins across many
conditions but also containing information pertaining to the peptides
identified, their uniqueness to the protein and so on. Such
matrices are much more complex than data structures generated
by microarrays, and pose at least as difficult analysis and interpretation
challenges that accrue from their high dimensionality [65].
Furthermore, the wealth of proteomic information needs to be
mapped onto existing biological knowledge to generate new insights.
The general task for bioinformatics in this context is to provide
the framework for systematic knowledge mining of such
proteomics data sets thereby mapping them back onto their biological
context.