PEPTIDE SEARCH ENGINES

Genaro Pimienta
Jun 11, 2024
8 min read

Updated: Feb 14

In this blogpost I talk about peptide search engines—the algorithms used in MS-proteomics to assign peptide sequences to raw mass spectra.

For those not familiar with proteomics jargon, the abbreviations used throughout the text are:

Data-dependent acquisition (DDA)

Data-independent acquisition (DIA)

Mass spectrometry-based proteomics (MS-proteomics), AKA shotgun proteomics

Peptide fragmentation spectra (MS2)

Precursor ion spectra (MS1)

Peptide-spectrum match (PSM)

Target-decoy strategy (TDS)

Trans-Proteomic Pipeline (TPP)

SOME NOTES BEFORE WE START

When performing shotgun (untargeted) proteomics experiments, two data collection approaches are available:

Data-dependent acquisition (DDA)
Data-independent acquisition (DIA)

In this and the following blogposts, I will focus on the DDA approach, and for convenience, will refer to it as shotgun proteomics (Figure 1).

Below is a figure with an oversimplified explanation of a typical bottom-up shotgun proteomics. I have explained what bottom-up proteomics is in my previous blog post: “The Pre-Proteomics Era”.

A proteomics platform is a suite of algorithms, which can include one or more peptide search engines, often interfaced with statistical analysis tools, and visualization algorithms.

The most popular proteomics platforms and their corresponding peptide search engines are:

Proteome Discover™ — SEQUEST™ HT and Mascot®.
Crux — Comet and X!Tandem.
MaxQuant — Andromeda.
FragPipe — MSFragger.
Trans-Proteomic Pipeline — Comet, SEQUEST, Mascot®, and X!Tandem.

Figure 1. Bottom-up shotgun proteomics with data collected in DDA mode. The bottom-up approach is the most common shotgun proteomics. workflow, and for it to happen, proteins must be digested with a site-specific protease (typically trypsin). Shown in this cartoon is a simplified illustration of a capillary column with an electrospray ionization tip. In DDA mode, the mass spectrometer scans the peptides ionized (precursor ions) entering the instrument and selects the most abundant ones for fragmentation in a high energy collision dissociation (HCD) cell filled with neutral gas molecules (e.g., Argon) — other dissociation strategies can be used. The precursor ions selected for fragmentation are isolated when passing through a quadrupole with a narrow isolation window (0.5-1.5 Da). The precursor ion isotopic cluster and the fragment ion masses are determined by a mass analyzer, in this case an Orbitrap. The cartoons of peptide sequences shown in the bottom left of the figure are hypothetical examples of tryptic peptides, one of them bearing an alkylated cysteine. The red line in the precursor ion isotopic cluster inset indicates the monoisotopic ion isolated by the mass spectrometer for downstream HCD fragmentation.

SEQUEST

Published in 1994, SEQUEST was the first algorithm that could interpret shotgun proteomics data in an automated manner.

“An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database” — 1994.

The concept was simple and elegant. SEQUEST compared the fragment ion masses predicted from theoretical peptides in a protein database, to the ones measured in a shotgun proteomics experiment. The best peptide-spectrum matches (PSMs) were shortlisted, based on a significance score (cross-correlation score), and the corresponding peptide sequences used to infer protein identifications. Figure 2 below is a simplified explanation of the PSM workflow.

At the time of its publication, SEQUEST was a major advancement in computational shotgun proteomics. Conceptually, the PSM approach paved the way for the many search engines to come.

Figure 2. Simplified description of the PSM workflow. The PSM concept pioneered by SEQUEST can be explained in seven steps. These steps are performed iteratively for all the experimental precursor ions (MS1) and their associated fragmentation pattern (MS2) in an input raw data file. Step 1 - MS1 mass and charge value determination. The mass and charge values of each MS1 spectra extracted from the input dataset are calculated. Step 2 - Theoretical MS1 prediction. The proteins in the proteome database are digested in silico assuming trypsin digestion specificity and their mass value and charge calculated (a different specificity is used if a protease other than trypsin was used in the experiment) . Step 3 - Set of theoretical MS1 masses per experimental MS1. Theoretical peptides with mass values significantly close to those of each experimental MS1 are selected for downstream MS2 fragmentation prediction. Step 4 - MS2 prediction. The amino acid sequences associated to the theoretical MS1 selected in Step 3 undergo in silico fragmentation as per the chemical principles assumed by the Mobile Proton Hypothesis. “The Mobile Proton Hypothesis in Fragmentation of Protonated Peptides: A Perspective” — 2010. Step 5- PSM. Each experimental MS2 spectra is cross-correlated to the set of in silico MS2 spectra obtained from steps 3 and 4. Step 6 - Peptide sequence assignment. The amino acid sequence associated to the best PSM in step 6 is assigned to the corresponding experimental MS1 . Step 7 - Protein inference. The peptide sequences obtained are used to infer protein identities. Conceptually, the protein inference problem is nontrivial due to the presence of protein isoforms and homologues, which share overlapping peptide sequences.

In the figure, the amino acids K/R are highlighted red to indicate that the peptides sequences are tryptic. In the left panel, the middle inset labeled PSM is composed of red, black and blue lines. The black and blue lines correspond to experimental and predicted spectra, respectively. The red lines indicate ion masses that matched in the PSM step. The right panel shows a hypothetical example of the "protein inference problem". The experimental and predicted MS2 spectra are compared in mirror image for convenience.

Despite its innovative value, the first version of SEQUEST was computationally taxing and lacked a scoring function to discriminate false from true PSM assignments.

Many improvements to SEQUEST were made in the following years. By 2007 SEQUEST could be parallelized and its PSM scoring was more stringent.

These improvements were implemented by the private sector and academia alike, resulting in two widely used versions of SEQUEST.

SEQUEST™ HT, which is the default search engine in Proteome Discoverer™, a proteomics platform distributed by Thermo Fisher Scientific.

“Protein identification using TurboSEQUEST” — 2005

Comet, an optimized version of SEQUEST, which runs under two different platforms: Crux, and the Trans-Proteomic Pipeline (TPP).

“Comet: an open-source MS/MS sequence database search tool” — 2013

A summary of SEQUEST’s evolution is nicely described in the following review article:

“The SEQUEST Family Tree” — 2015

PROBABILISTIC SEARCH ENGINES

The “probabilistic generation” of search engines started 1999, with the publication of Mascot®, developed by Matrix Science.

“Probability-based protein identification by searching sequence databases using mass spectrometry data” — 1999

Mascot® inherited the cross-correlational concept used by SEQUEST to compute PSMs. What changed was the implementation of a probability score to estimate the statistical significance of the PSM-derived peptide sequence assignments.

Mascot® quickly became popular, due to its probabilistic score and because it could be parallelized with cluster computing and had user-friendly visualizations tools.

The problem with Mascot® was that the details of its probabilistic model remained undisclosed, preventing an objective evaluation of the algorithm’s performance and its comparison with other peptide search engines.

In response to Mascot’s undisclosed algorithm, several open-source probabilistic search engines were developed in the early 2000s.

Table 1 below summarizes the most popular search engines, which are based on the cross-correlational concept pioneered by SEQUEST.

ERROR RATE CONTROL

Despite Mascot's acceptable performance, it soon became clear that much more needed to be done to better control peptide and protein identification error rates.

Important improvements to search engine workflows were developed during 2002-2007. I mention below the three most important:

Machine learning algorithms for PSM rescoring
PeptideProphet
“Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search” — 2002
ProteinProphet
“A Statistical Model for Identifying Proteins by Tandem Mass Spectrometry” — 2003
Percolator
“Semi-supervised learning for peptide identification from shotgun proteomics datasets” —2007
Concatenated target decoy strategy (TDS) to control the false discovery rate during the PSM search.
“Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry” — 2007
The Trans-Proteomic Pipeline (TTP), a suite of algorithms for the statistical validation of search engine results based on the XML file format.

“A uniform proteomics MS/MS analysis platform utilizing open XML file formats” — 2005

It should be mentioned that the TPP was the first proteomics platform to centralize the use of machine learning algorithms, like PeptideProphet, ProteinProphet and Percolator, and many other tools for statistical data exploration and validation. TPP is still in use and is compatible with many search engines, including Comet, SEQUEST, Mascot® and X!Tandem. http://www.tppms.org/

MAXQUANT/ANDROMEDA

The proteomics platform MaxQuant was developed in 2008. The goal was to have a proteomics platform that took full advantage of high-mass accuracy and resolution data, generated by Orbitrap mass analyzers, which became available in 2005.

“MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification” — 2008

“The Orbitrap: a new mass spectrometer” — 2005

Because it lacked a peptide search engine, MaxQuant was bundled to Mascot® for PSM searches. The MaxQuant/Mascot® pipeline was used to analyze the first fully sequenced proteome: Saccharomyces cerevisiae.

“Comprehensive mass-spectrometry-based proteome quantification of haploid versus diploid yeast” — 2008

But bundling Mascot® to MaxQuant was nontrivial, and this limited its use to a handful of reference laboratories.

MaxQuant became fully functional when Andromeda, a search engine with unique features, was added to its structure in 2011.

“Andromeda: a peptide search engine integrated into the MaxQuant environment” — 2011

Shortly after it was made available, the MaxQuant-Andromeda pipeline became widely used by the proteomics community. MaxQuant-Andromeda was used in 2011 to analyze the Human Proteome Draft.

FRAGPIPE/MSFRAGGER

The proteomics platform FragPipe and its search engine MSFragger were published in 2017. FragPipe has continued to be improved and is currently the among the few traditional search engine featuring deep learning algorithms. https://fragpipe.nesvilab.org/

“MSFragger: ultrafast and comprehensive peptide identification in shotgun proteomics” — 2017

FragPipe can be consider a hybrid of the Trans-Proteomics Pipeline and MaxQuant, primarily for three reasons:

It takes in raw data in the XML file format
It uses the machine learning algorithms PeptideProphet, ProteinProphet and Percolate to refine the peptide-spectrum match score
It takes advantage of isotopic resolution to extract precursor ion features and recalibrate masses

The FragPipe/MSFragger pipeline has two unique attributes:

Ultrafast PSM searches by means of fragment ion indexing
Can perform open search PSMs to discover unexpected peptide modifications

AFTERTHOUGHTS

When stress-tested with ultralarge datasets or in “open search” mode, Proteome Discoverer, MaxQuant and FragPipe have —each in a different way— opened a pandora box of unanticipated issues with false discovery rate discrimination.

MaxQuant and Proteome Discoverer™.

The completion of the Human Proteome Draft by two independent teams in 2014, one using MaxQuant, and the other Proteome Discoverer™, revealed weaknesses in false discovery calculations, especially at the protein level, when analyzing very large datasets.

“A draft of the human proteome” — 2014

“Mass-spectrometry-based draft of the human proteome” — 2014

FragPipe.

The publication describing MSFragger revealed contradictory peptide identification results when “open” and “closed PSM workflows were compared. These results suggested weaknesses in the target-decoy search concept for FDR control, widely adopted since its proposal in 2007.

“MSFragger: ultrafast and comprehensive peptide identification in shotgun proteomics” — 2017

The above has reignited a reevaluation of the false discovery rate estimation methods, including an alternative to the decoy strategy currently used.

Recent preprints in BioRxiv address this issue

https://www.biorxiv.org/content/10.1101/2024.06.01.596967v2

https://www.biorxiv.org/content/biorxiv/early/2023/04/08/2023.04.07.535980.full.pdf

This a complex and very important topic, and addressing it is nontrivial, which deserves to be discussed in a separate blogpost.

Stay tuned!

GPR