PROSIT

Genaro Pimienta
Feb 17
7 min read

Updated: Feb 22

Last year I dedicated three blogposts to mass spectrometry-based proteomics (MS-proteomics) data analysis:

To continue with this subject, I now discuss how using deep learning algorithms to predict fragment (MS2) ion mass intensities increases the number of peptide sequences identified by a peptide search engine.

From the handful of deep learning-based MS2 ion intensity predictors reported so far, I discuss here Prosit—a widely used and well characterized algorithm, which has in 2025 been updated with new capabilities (Table 1).

Table 1. MS2 ion intensity and iRT predictors — ***Table 1.*** *MS2 ion intensity and iRT predictors*

To better understand this blogpost, I recommend that you read my previous contributions, which talk about deep learning algorithms:

Also helpful will be the following review publications, which discuss the use of deep learning in MS-proteomics.

Deep Learning in Proteomics —2020

Deep learning neural network tools for proteomics —2021

Prediction of peptide mass spectral libraries with machine learning —2023

Rescoring Peptide Spectrum Matches: Boosting Proteomics Performance by Integrating Peptide Property Predictors Into Peptide Identification —2024

You must also understand the peptide-spectrum match (PSM) concept. In Figure 1 below, I describe how peptide search engines compute a PSM from MS1 (precursor ion) and MS2 (precursor fragment ions) spectra.

Figure 1. The peptide-spectrum match (PSM) workflow. The search engine extracts theoretical peptides from the target-decoy database and predicts their fragmentation patterns, based on specified protease specificity, mass shifts induced by amino acid modifications, and collisional fragmentation rules. I explain the PSM workflow as having three steps. Step 1 - Theoretical mass selection. Theoretical peptides are chosen for the PSM workflow if their masses match the one calculated for an MS1 spectrum. A narrow mass tolerance window (5-10 ppm) is used for this to assure specificity. Step 2 - PSM prediction. Experimental MS2 spectra are predicted from the selected theoretical peptides. Step 3 - The target-decoy competition. PMS from target and decoy MS2 spectra receive a probabilistic score and the one with the highest value is chosen for peptide sequence assignment. To estimate the FDR, the number of decoy matches is divided by the proportion of target ones. The estimated FDR is used to establish a probability score threshold in the target-decoy competition workflow. If for example, a 1% FDR is desired, then the PSM score cutoff will be one that only allows 1% of decoy PSMs. — **Figure 1.** ***The peptide-spectrum match (PSM) workflow.*** The search engine extracts theoretical peptides from the target-decoy database and predicts their fragmentation patterns, based on specified protease specificity, mass shifts induced by amino acid modifications, and collisional fragmentation rules. ***I explain the PSM workflow as having three steps. Step 1 - Theoretical mass selection.*** *Theoretical peptides are chosen for the PSM workflow if their masses match the one calculated for an MS1 spectrum. A narrow mass tolerance window (5-10 ppm) is used for this to assure specificity.* ***Step 2 - PSM prediction.*** *Experimental MS2 spectra are predicted from the selected theoretical peptides.* ***Step 3 - The target-decoy competition.*** *PMS from target and decoy MS2 spectra receive a probabilistic score and the one with the highest value is chosen for peptide sequence assignment.* ***To estimate the FDR***, the number of decoy matches is divided by the proportion of target ones. The estimated FDR is used to establish a probability score threshold in the target-decoy competition workflow. If for example, a 1% FDR is desired, then the PSM score cutoff will be one that only allows 1% of decoy PSMs.

MOTIVATION

Calculating the false discovery rate (FDR) of unfiltered PSMs is an error prone procedure, regardless of the peptide search engine used. To control the FDR, most peptide search engines rescore unfiltered PSMs with Percolator, a semi-supervised machine learning algorithm (Figure 2).

Semi-supervised learning for peptide identification from shotgun proteomics datasets —2007

See: PEPTIDE SEARCH ENGINES and THE PEPTIDE-SPECTRUM MATCH.

Figure 2. PMS rescoring with Percolator. — ***Figure 2.*** *PMS rescoring with Percolator.*

But even when using Percolator to rescore PSMs, a limitation remains: the MS2 predictions computed by peptide search engines lack realistic ion intensity values.

Deep learning models that predict MS2 ion intensities fill in the gap. When adding a MS2 ion intensity predictor upstream Percolator, an increase is observed in the number of true identifications computed by a peptide search engine (Figure 3).

Prosit and Percolator are complementary. To rescore PSMs, Prosit uses the MS2 ion intensities it predicts, whereas Percolator depends on the scores computed by the peptide search engine.

Rescoring Peptide Spectrum Matches: Boosting Proteomics Performance by Integrating Peptide Property Predictors Into Peptide Identification —2024

PROSIT

Prosit was developed by the ProteomeTools project and is available as an online tool in ProteomicsDB. But, while easy to access, Prosit in ProteomicsDB is limited by input file size (≤ 2GB), and only processes results generated by Andromeda in Maxquant.

To make Prosit a peptide search engine-agnostic tool, several standalone computational platforms have been developed:

Oktoberfest (2023)—Compatible with any peptide search engine.
MSBooster—Compatible with MSFragger in FragPipe.
Inferys Rescoring—Compatible with SEQUEST in ProteomeDiscoverer. INFERYS rescoring: Boosting peptide identifications and scoring confidence of database search results —2025

Since its introduction in 2019, five versions of Prosit haven been published:

Prosit.

Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning —2019

Prosit Transformer.

Prosit Transformer: A transformer for Prediction of MS2 Spectrum Intensities —2022

Prosit-TMT.

Prosit-TMT: Deep Learning Boosts Identification of TMT-Labeled Peptides —2022

Prosit-XL.

Prosit-XL: enhanced cross-linked peptide identification by fragment intensity prediction to study protein interactions and structures —2025

Prosit-PTM.

Learning the Unseen: Data-Augmented Deep Learning for PTM Discovery with Prosit-PTM —2025

To train Prosit and its subsequent iterations, the ProteomeTools project used more than a million synthetic peptide sequences and their corresponding liquid chromatography and mass spectra features. This reference dataset includes tryptic, non-tryptic, and PTM-modified synthetic peptides, which cover a large portion of the human proteome.

ProteomeTools based on a complete synthetic human proteome —2017

ProteomeTools: Systematic Characterization of 21 Post-translational Protein Modifications by Liquid Chromatography Tandem Mass Spectrometry (LC-MS/MS) Using Synthetic Peptides —2018

Deep learning boosts sensitivity of mass spectrometry-based immunopeptidomics —2021

The Prosit family of peptide feature predictors illustrates the modularity of deep learning models; it shows that the model's prediction task can be fine-tuned by adding or removing modules.

Below I describe Prosit's deep learning architecture and discuss the augmentation strategy, which makes Prosit-PTM sensitive to an unlimited number of peptide modifications.

PROSIT'S ARCHITECTURE

Prosit and its variants (Prosit-TMT, Prosit-XL, and Prosit-PTM) have a sequence-to-sequence (seq2seq) architecture, in which the encoder and decoder are recurrent neural networks (RNNs) connected by an attention layer (Figure 4).

In the case of Prosit Transformer, which I talk about in a separate blogpost, the RNNs in the encoder and decoder are replaced with multiple self-attention layers. To review what a self-attention layer is, see my blogpost TRANSFORMERS.

Figure 4. Seq2seq architecture. — ***Figure 4.*** *Seq2seq architecture.*

To assure neural network stability, Prosit uses bidirectional gate recurrent unit (BiGRU) RNNs in its encoder and decoder (Figure 5). Doing this is important because RNNs are prone to vanishing gradients (see WHAT IS AN ARTIFICIAL NEURAL NETWORK? and RECURRENT NEURAL NETWORKS).

Figure 5. In BiGRU RNNs an sequential input is processed from start-to-end and end-to-start. The gates (reset and update) regulate the information fed into the next hidden state. GRUs prevent banishing gradients during model training. To read more about RNNs and vanishing gradients, see WHAT IS AN ARTIFICIAL NEURAL NETWORK? and RECURRENT NEURAL NETWORKS. — ***Figure 5.*** In BiGRU RNNs an sequential input is processed from start-to-end and end-to-start. The gates (reset and update) regulate the information fed into the next hidden state. GRUs prevent banishing gradients during model training. To read more about RNNs and vanishing gradients, see WHAT IS AN ARTIFICIAL NEURAL NETWORK? *and* RECURRENT NEURAL NETWORKS.

PROSIT'S INPUT EMBEDDING

Prosit and Prosit-PTM ingest two inputs: a peptide sequence and a metadata layer.

The peptide sequence is tokenized to obtain a numeric vector, which is next embedded in an N-dimensional matrix that contains learned weights tuned during model training (Figure 6).

Figure 6. Peptide sequence embedding. — ***Figure 6.*** *Peptide sequence embedding.*

The metadata layer is a one-hot encoded vector, composed of seven nodes. Six of these nodes represent precursor ion charges 1-to-6. The seventh one corresponds to a normalized collision energy value (Figure 7). Prosit-PTM has an eight feature in its metadata layer: fragmentation strategy.

To embed the metadata layer, a two-layer MLP with ReLu activation is used, which produces an N-dimensional vector (Figure 7).

Figure 7. Metadata layer and the two-layer MLP that processes it. — ***Figure 7.*** *Metadata layer and the two-layer MLP that processes it.*

PROSIT-PTM'S AUGMENTATION STRATEGY

Unlike Prosit, Prosit-PTM predicts MS2 ion intensities and iRT values for chemically modified peptides. In the case of phosphopeptides, Prosite-PTM predicts the modification's site from the multiple possibilities in the peptide sequence.

Instead of only focusing on the 22 post-translational modifications (PTMs) we know of, Prosit-PTM additionally considers the 342 protein chemical modifications documented in Unimod.

Training Prosit-PTM with MS-proteomics data from synthetic peptides decorated with 364 PTMs (22 known PTMs plus 342 moieties from Unimod) would be experimentally and computationally taxing.

To overcome this obstacle, Prosit-PTM uses an in silico data augmentation strategy to train the model with hypothetical chemical modifications (Figure 8).

In the augmentation strategy, each amino acid modification is described by the chemical composition of the moiety gained or lost during the unmodified-to-modified transition (Figure 8).

Figure 8. The data augmentation strategy. — ***Figure 8.*** *The data augmentation strategy.*

To embed the chemical augmentation library, Prosit-PTM uses a three-layer MLP with linear activation in its neurons. Concatenating the chemical augmentation and peptide sequence embeddings results in a combined N-dimensional matrix, ready to be processed by the encoder (Figure 9).

Figure 9. Combined (peptide sequence + hypothetical chemical modifications) N-dimensional matrix. — ***Figure 9.*** *Combined (peptide sequence + hypothetical chemical modifications) N-dimensional matrix.*

PROSIT'S ENCODER

In Prosit and Prosit-PTM the encoder modules used to predict MS2 ion intensities and iRT values are the same.

The peptide sequence embedding (enriched or not with hypothetical amino amino acid modification features) is processed with a two-layer biGRU RNN connected to an attention layer (Figure 10). A two-layer perceptron with ReLu-activated neurons is used to process the metadata layer vector (Figure 10).

A final representation vector with learned attention weights is obtained by multiplying the output vectors generated by the peptide sequence and metadata encoders (Figure 10).

For an explanation on how biGRU RNNs work, see my blogposts RECURRENT NEURAL NETWORKS and ENCODER-DECODER NEURAL NETWORKS.

Figure 10. The encoder and the final representation vector. Left panel: The number of neurons per biGRU RNN is indicated. In the first biGRU RNN (256 neurons), the forward and backward hidden states at each time step are concatenated and passed to the second biGRU RNN (512 neurons). After passing through the attention layer, a representation vector is produced in the latent space. Right panel: The two-dimensional MLP that processes has 512 neurons per layer. Once processed by the MLP, the metadata layer becomes an N-dimensional vector, which is combined with the representation vector from the peptide sequence. — ***Figure 10.*** *The encoder and the final representation vector.* ***Left panel:*** The number of neurons per biGRU RNN is indicated. In the first biGRU RNN (256 neurons), the forward and backward hidden states at each time step are concatenated and passed to the second biGRU RNN (512 neurons). After passing through the attention layer, a representation vector is produced in the latent space. ***Right panel***: The two-dimensional MLP that processes has 512 neurons per layer. Once processed by the MLP, the metadata layer becomes an N-dimensional vector, which is combined with the representation vector from the peptide sequence.

PROSIT'S DECODER

Common to Prosit and Prosit-PTM is a decoder with two prediction heads: a biGRU RNN, which predicts MS2 ion intensities; and a two-layer MLP, which calculates iRT values (Figure 11).

The biGRU RNN is coupled to a time-distributed regressor with six neurons. This time-distributed regressor generates a 174-dimensional vector, which contains the MS2 ion intensities predicted (Figure 11).

Figure 11. The decoder. Left panel: The one-layer MLP has 518 neurons in its processing layer, followed by a on-neuron output. All the neurons in the MLP are ReLu activated. Right panel: The time-distributed regressor is has a six-neuron layer at each time point. The MS2 ion intensity output is a 174-dimensional vector. — ***Figure 11.*** *The decoder.* ***Left panel:*** *The one-layer MLP has 518 neurons in its processing layer, followed by a on-neuron output. All the neurons in the MLP are ReLu activated.* ***Right panel****: The time-distributed regressor is has a six-neuron layer at each time point. The MS2 ion intensity output is a 174-dimensional vector.*

Prosit is a useful tool in MS-proteomics data analysis.

When embedded in Oktoberfest, the MS2 ion intensities and iRT values predicted by Prosit are used to create in silico spectral libraries. Prosit is used to rescore PSMs when embedded in MSBooster (in FragPipe) and NFERYS (in ProteomeDiscoverer).

Oktoberfest: Open-source spectral library generation and rescoring pipeline based on Prosit —2023

MSBooster: improving peptide identification rates using deep learning-based features —2023

INFERYS rescoring: Boosting peptide identifications and scoring confidence of database search results —2025

I will be back with more about deep learning in MS-proteomics.

Stay tuned!

GPR

Disclosure: At BioTech Writing and Consulting we believe in the use AI in Data Science, but do not use AI to generate text or images.

PROSIT

Recent Posts

Comments