THE PEPTIDE-SPECTRUM MATCH

Genaro Pimienta
Jul 2, 2024
6 min read

Updated: Feb 14

In my previous blogpost “PROTEOMICS SEARCH ENGINES” I wrote about peptide search engines in the context of the data-dependent acquisition (DDA) approach used in shotgun proteomics.

In this blogpost I discuss the strategies used by peptide search engines to mitigate the false discovery rate when analyzing DDA data.

For those not familiar with proteomics jargon, the abbreviations used throughout the text are:

Data-dependent acquisition (DDA)

False discovery rate (FDR)

Fragment ion masses (MS2)

Mass spectrometry-based proteomics (MS-proteomics), AKA shotgun proteomics

Peptide-spectrum match (PSM)

Post-translational modification (PTM)

Precursor ion (MS1)

Target-decoy competition (TDC)

Target-decoy strategy (TDS)

ORBITRAP MASS SPECTROMETERS

Commercialized in 2005, Orbitrap mass spectrometers made available the level of resolution and mass accuracy, required for isotope-level quantitative proteomics applications.

“The Orbitrap: a new mass spectrometer” — 2005

“The coming age of complete, accurate, and ubiquitous proteomes” — 2013

The field also benefited from the development of MaxQuant (2008) and its in-built search engine Andromeda (2011), which was the first proteomics platform dedicated to high-mass accuracy datasets.

MaxQuant also provided, for the first time, a platform that could preprocess large volumes of spectra with isotopic resolution.

“MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification” — 2008

“Andromeda: a peptide search engine integrated into the MaxQuant environment” — 2011

Combining Orbitrap technology and the MaxQuant platform contributed to two markedly breakthroughs in the proteomics field:

1. The first publications reporting ~10,000 proteins in cell lines using shotgun proteomics.

“Deep proteome and transcriptome mapping of a human cancer cell line” — 2011

“Comparative proteomic analysis of eleven common cell lines reveals ubiquitous but varying expression of most proteins” — 2012

2. The publication of a draft of the human proteome from the analysis of thousands of shotgun proteomics datasets collected from different tissues.

“A draft of the human proteome” — 2014

“Mass-spectrometry-based draft of the human proteome” — 2014

But the sudden surge of large-scale datasets raised an alarm among some experts in the proteomics community: false positive identifications were bound to accumulate due to the high-throughput nature of MS-proteomics and the impracticality of verifying PSM assignments by hand.

“Target-decoy approach and false discovery rate: when things may go wrong” — 2011

“The potential cost of high-throughput proteomics” — 2012

“The problem with peptide presumption and the downfall of targe-decoy false discovery rates” — 2012

Truly said, a reanalysis of the data from the two human proteome drafts revealed weaknesses in the target-decoy strategy (TDS) implemented to control the false discovery rate (FDR) at the protein level.

“Analyzing the First Drafts of the Human Proteome” — 2014

“The potential clinical impact of the release of two drafts of the human proteome” — 2015

But science is, in most cases, self-corrective.

Innovative adjustments to the traditional TDS were soon proposed, and the subject continues to be a focus of attention.

“A Scalable Approach for Protein False Discovery Rate Estimation in Large Proteomic Data Sets” — 2015

FALSE DISCOVERY RATE CONTROL

A proteomics spectra dataset is composed of mass/charge (m/z) values of the peptide ions (MS1) and their corresponding fragment ions (MS2), and among other things, their abundance intensities.

During the peptide-spectrum match (PSM) workflow, a search engine extracts MS1 and MS2 features from the dataset, and compare them to a repertoire of in silico counterparts, which are predicted from a reference proteome database.

The reference proteome database is composed of protein sequences (target) from the proteome of interest. These are concatenated to a decoy version of the proteome database, which has the protein sequences scrambled or reversed in silico.

The PSM search is prone to erroneous peptide sequence assignments, even when using stringent probability scores to estimate the probability that a PSM is significant.

Ideally, the error rate would be corrected with a manual inspection of the results, but, considering the volume of information generated, this task is unfeasible in shotgun proteomics.

Instead, search engines use the target decoy strategy (TDS), which is an indirect way of estimating the FDR. While not perfect, this approach is considered by many the best was to calculate the PSM error rate.

The TDS is based on the following assumptions:

PSMs to target and decoy sequences have an equal probability of occurrence
PSMs to decoy sequences are infrequent random events
Decoy sequences can be thought of as surrogates of false positive hits

Based on these assumptions, it is possible to estimate the false discovery rate (FDR) by dividing the number of decoys PSMs by the portion of target PSMs.

While not the only way to estimate the FDR, this strategy is the most common.

Figure 1 below provides a schematic explanation of how the PSM workflow helps estimating the FDR when coupled to the TDS.

Figure 1. The PSM workflow. The search engine extracts theoretical peptides from the target-decoy database and predicts their fragmentation patterns, based on specified protease specificity, mass shifts induced by amino acid modifications, and collisional fragmentation rules. I explain the PSM workflow as having three steps. Step 1 - Theoretical mass selection. Theoretical peptides are chosen for the PSM workflow if their masses match the one calculated for an MS1 spectrum. A narrow mass tolerance window (5-10 ppm) is used for this to assure specificity. Step 2 - PSM prediction. Experimental MS2 spectra are predicted from the selected theoretical peptides. Step 3 - The target-decoy competition (TDC). PMS from target and decoy MS2 spectra receive a probabilistic score and the one with the highest value is chosen for peptide sequence assignment. To estimate the FDR, the number of decoy matches is divided by the proportion of target ones. The estimated FDR is used to establish a probability score threshold in the target-decoy competition workflow. If for example, a 1% FDR is desired, then the PSM score cutoff will be one that only allows 1% of decoy PSMs.

AFTERTHOUGHT

It is important to point out that while error rate control is purely a computational procedure, a lot can be done at the experimental level to help minimize the occurrence of spurious PSM errors during data analysis: sample preparation, mass spectrometry data collection, and pre-processing feature extraction.

Table 1 below summarizes the many factors that may —alone or in combination— be responsible for an erroneous PSM.

Peptide search engines are prone to erroneous assignments of peptide property features. Amongst these, wrong precursor ion charge calculation is responsible for a large amount of erroneous PSMs (false positive identifications).

Three extreme examples are when:

The precursor ion charge is wrongly calculated.
The mass of an abundant and high-scoring modified peptide happens to be the same of an unmodified theoretical peptide in the target proteome database.
A chimeric MS2 interferes with the PSM assignment.

Large datasets, especially those produced from a poorly prepared sample (low quality spectra an unusually high amount of chemical modifications) accumulate erroneous PSM assignments.

Figure 2 below depicts the PSM in the context of what was discussed in this section.

Figure 2. The PSM workflow in the context of infrequent random events.

This figure depicts many of the circumstances in which the PSM workflow becomes error-prone. The left panel shows three cartoons. The top one is an illustration of the features extracted from a precursor ion isotopic envelop (monoisotopic ion, and charge determination). The two other cartoons depict common peptide sequence modifications in eukaryotic and bacterial proteomes. The right panel depicts the PSM workflow in the context of chimeric MS2, which is shown at the bottom of the figure (red an black vertical lines represent fragment masses from a different MS2). To illustrate the cross-correlational PSM, the mirror comparison of the experimental and predicted MS2 fragment masses is shown in the middle. In this case, blue and red lines indicate the matching masses. The experimental MS2 is inverted.

In my next blogpost I will talk to you about data loss and ways to control it in shotgun proteomics.

For now I hope this post is clear enough.

Stay tuned!

GPR

Disclosure: At BioTech Writing and Consulting we believe in the use AI in Data Science, but do not use AI to generate text or images.

THE PEPTIDE-SPECTRUM MATCH

Recent Posts

Comments