top of page

PROSIT

Updated: Feb 22

Last year I dedicated three blogposts to mass spectrometry-based proteomics (MS-proteomics) data analysis:


To continue with this subject, I now discuss how using deep learning algorithms to predict fragment (MS2) ion mass intensities increases the number of peptide sequences identified by a peptide search engine.


From the handful of deep learning-based MS2 ion intensity predictors reported so far, I discuss here Prosit—a widely used and well characterized algorithm, which has in 2025 been updated with new capabilities (Table 1).


Table 1. MS2 ion intensity and iRT predictors
Table 1. MS2 ion intensity and iRT predictors

To better understand this blogpost, I recommend that you read my previous contributions, which talk about deep learning algorithms:


Also helpful will be the following review publications, which discuss the use of deep learning in MS-proteomics.


You must also understand the peptide-spectrum match (PSM) concept. In Figure 1 below, I describe how peptide search engines compute a PSM from MS1 (precursor ion) and MS2 (precursor fragment ions) spectra.


Figure 1. The peptide-spectrum match (PSM) workflow. The search engine extracts theoretical peptides from the target-decoy database and predicts their fragmentation patterns, based on specified protease specificity, mass shifts induced by amino acid modifications, and collisional fragmentation rules. I explain the PSM workflow as having three steps. Step 1 - Theoretical mass selection. Theoretical peptides are chosen for the PSM workflow if their masses match the one calculated for an MS1 spectrum. A narrow mass tolerance window (5-10 ppm) is used for this to assure specificity. Step 2 - PSM prediction. Experimental MS2 spectra are predicted from the selected theoretical peptides. Step 3 - The target-decoy competition. PMS from target and decoy MS2 spectra receive a probabilistic score and the one with the highest value is chosen for peptide sequence assignment. To estimate the FDR, the number of decoy matches is divided by the proportion of target ones. The estimated FDR is used to establish a probability score threshold in the target-decoy competition workflow. If for example, a 1% FDR is desired, then the PSM score cutoff will be one that only allows 1% of decoy PSMs.
Figure 1. The peptide-spectrum match (PSM) workflow. The search engine extracts theoretical peptides from the target-decoy database and predicts their fragmentation patterns, based on specified protease specificity, mass shifts induced by amino acid modifications, and collisional fragmentation rules. I explain the PSM workflow as having three steps. Step 1 - Theoretical mass selection. Theoretical peptides are chosen for the PSM workflow if their masses match the one calculated for an MS1 spectrum. A narrow mass tolerance window (5-10 ppm) is used for this to assure specificity. Step 2 - PSM prediction. Experimental MS2 spectra are predicted from the selected theoretical peptides. Step 3 - The target-decoy competition. PMS from target and decoy MS2 spectra receive a probabilistic score and the one with the highest value is chosen for peptide sequence assignment. To estimate the FDR, the number of decoy matches is divided by the proportion of target ones. The estimated FDR is used to establish a probability score threshold in the target-decoy competition workflow. If for example, a 1% FDR is desired, then the PSM score cutoff will be one that only allows 1% of decoy PSMs.



MOTIVATION

Calculating the false discovery rate (FDR) of unfiltered PSMs is an error prone procedure, regardless of the peptide search engine used. To control the FDR, most peptide search engines rescore unfiltered PSMs with Percolator, a semi-supervised machine learning algorithm (Figure 2).



Figure 2. PMS rescoring with Percolator.
Figure 2. PMS rescoring with Percolator.

But even when using Percolator to rescore PSMs, a limitation remains: the MS2 predictions computed by peptide search engines lack realistic ion intensity values.


Deep learning models that predict MS2 ion intensities fill in the gap. When adding a MS2 ion intensity predictor upstream Percolator, an increase is observed in the number of true identifications computed by a peptide search engine (Figure 3).


Prosit and Percolator are complementary. To rescore PSMs, Prosit uses the MS2 ion intensities it predicts, whereas Percolator depends on the scores computed by the peptide search engine.


Figure 3. PSM rescoring with Prosit upstream Percolator.
Figure 3. PSM rescoring with Prosit upstream Percolator.


PROSIT

Prosit was developed by the ProteomeTools project and is available as an online tool in ProteomicsDB. But, while easy to access, Prosit in ProteomicsDB is limited by input file size (≤ 2GB), and only processes results generated by Andromeda in Maxquant.


To make Prosit a peptide search engine-agnostic tool, several standalone computational platforms have been developed:


Since its introduction in 2019, five versions of Prosit haven been published:

  • Prosit.

  • Prosit Transformer.

  • Prosit-TMT.

  • Prosit-XL.

  • Prosit-PTM.


To train Prosit and its subsequent iterations, the ProteomeTools project used more than a million synthetic peptide sequences and their corresponding liquid chromatography and mass spectra features. This reference dataset includes tryptic, non-tryptic, and PTM-modified synthetic peptides, which cover a large portion of the human proteome.


The Prosit family of peptide feature predictors illustrates the modularity of deep learning models; it shows that the model's prediction task can be fine-tuned by adding or removing modules.


Below I describe Prosit's deep learning architecture and discuss the augmentation strategy, which makes Prosit-PTM sensitive to an unlimited number of peptide modifications.



PROSIT'S ARCHITECTURE

Prosit and its variants (Prosit-TMT, Prosit-XL, and Prosit-PTM) have a sequence-to-sequence (seq2seq) architecture, in which the encoder and decoder are recurrent neural networks (RNNs) connected by an attention layer (Figure 4).


In the case of Prosit Transformer, which I talk about in a separate blogpost, the RNNs in the encoder and decoder are replaced with multiple self-attention layers. To review what a self-attention layer is, see my blogpost TRANSFORMERS.


Figure 4. Seq2seq architecture.
Figure 4. Seq2seq architecture.

To assure neural network stability, Prosit uses bidirectional gate recurrent unit (BiGRU) RNNs in its encoder and decoder (Figure 5). Doing this is important because RNNs are prone to vanishing gradients (see WHAT IS AN ARTIFICIAL NEURAL NETWORK? and RECURRENT NEURAL NETWORKS).


Figure 5. In BiGRU RNNs an sequential input is processed from start-to-end and end-to-start. The gates (reset and update) regulate the information fed into the next hidden state. GRUs prevent banishing gradients during model training. To read more about RNNs and vanishing gradients, see WHAT IS AN ARTIFICIAL NEURAL NETWORK? and RECURRENT NEURAL NETWORKS.
Figure 5. In BiGRU RNNs an sequential input is processed from start-to-end and end-to-start. The gates (reset and update) regulate the information fed into the next hidden state. GRUs prevent banishing gradients during model training. To read more about RNNs and vanishing gradients, see WHAT IS AN ARTIFICIAL NEURAL NETWORK? and RECURRENT NEURAL NETWORKS.


PROSIT'S INPUT EMBEDDING

Prosit and Prosit-PTM ingest two inputs: a peptide sequence and a metadata layer.


The peptide sequence is tokenized to obtain a numeric vector, which is next embedded in an N-dimensional matrix that contains learned weights tuned during model training (Figure 6).


Figure 6. Peptide sequence embedding.
Figure 6. Peptide sequence embedding.

The metadata layer is a one-hot encoded vector, composed of seven nodes. Six of these nodes represent precursor ion charges 1-to-6. The seventh one corresponds to a normalized collision energy value (Figure 7). Prosit-PTM has an eight feature in its metadata layer: fragmentation strategy.


To embed the metadata layer, a two-layer MLP with ReLu activation is used, which produces an N-dimensional vector (Figure 7).


Figure 7. Metadata layer and the two-layer MLP that processes it.
Figure 7. Metadata layer and the two-layer MLP that processes it.


PROSIT-PTM'S AUGMENTATION STRATEGY

Unlike Prosit, Prosit-PTM predicts MS2 ion intensities and iRT values for chemically modified peptides. In the case of phosphopeptides, Prosite-PTM predicts the modification's site from the multiple possibilities in the peptide sequence.


Instead of only focusing on the 22 post-translational modifications (PTMs) we know of, Prosit-PTM additionally considers the 342 protein chemical modifications documented in Unimod.


Training Prosit-PTM with MS-proteomics data from synthetic peptides decorated with 364 PTMs (22 known PTMs plus 342 moieties from Unimod) would be experimentally and computationally taxing.


To overcome this obstacle, Prosit-PTM uses an in silico data augmentation strategy to train the model with hypothetical chemical modifications (Figure 8).


In the augmentation strategy, each amino acid modification is described by the chemical composition of the moiety gained or lost during the unmodified-to-modified transition (Figure 8).


Figure 8. The data augmentation strategy.
Figure 8. The data augmentation strategy.

To embed the chemical augmentation library, Prosit-PTM uses a three-layer MLP with linear activation in its neurons. Concatenating the chemical augmentation and peptide sequence embeddings results in a combined N-dimensional matrix, ready to be processed by the encoder (Figure 9).


Figure 9. Combined (peptide sequence + hypothetical chemical modifications) N-dimensional matrix.
Figure 9. Combined (peptide sequence + hypothetical chemical modifications) N-dimensional matrix.


PROSIT'S ENCODER

In Prosit and Prosit-PTM the encoder modules used to predict MS2 ion intensities and iRT values are the same.


The peptide sequence embedding (enriched or not with hypothetical amino amino acid modification features) is processed with a two-layer biGRU RNN connected to an attention layer (Figure 10). A two-layer perceptron with ReLu-activated neurons is used to process the metadata layer vector (Figure 10).


A final representation vector with learned attention weights is obtained by multiplying the output vectors generated by the peptide sequence and metadata encoders (Figure 10).


For an explanation on how biGRU RNNs work, see my blogposts RECURRENT NEURAL NETWORKS and ENCODER-DECODER NEURAL NETWORKS.


Figure 10. The encoder and the final representation vector. Left panel: The number of neurons per biGRU RNN is indicated. In the first biGRU RNN (256 neurons), the forward and backward hidden states at each time step are concatenated and passed to the second biGRU RNN (512 neurons). After passing through the attention layer, a representation vector is produced in the latent space. Right panel: The two-dimensional MLP that processes has 512 neurons per layer. Once processed by the MLP, the metadata layer becomes an N-dimensional vector, which is combined with the representation vector from the peptide sequence.
Figure 10. The encoder and the final representation vector. Left panel: The number of neurons per biGRU RNN is indicated. In the first biGRU RNN (256 neurons), the forward and backward hidden states at each time step are concatenated and passed to the second biGRU RNN (512 neurons). After passing through the attention layer, a representation vector is produced in the latent space. Right panel: The two-dimensional MLP that processes has 512 neurons per layer. Once processed by the MLP, the metadata layer becomes an N-dimensional vector, which is combined with the representation vector from the peptide sequence.


PROSIT'S DECODER

Common to Prosit and Prosit-PTM is a decoder with two prediction heads: a biGRU RNN, which predicts MS2 ion intensities; and a two-layer MLP, which calculates iRT values (Figure 11).


The biGRU RNN is coupled to a time-distributed regressor with six neurons. This time-distributed regressor generates a 174-dimensional vector, which contains the MS2 ion intensities predicted (Figure 11).


Figure 11. The decoder. Left panel: The one-layer MLP has 518 neurons in its processing layer, followed by a on-neuron output. All the neurons in the MLP are ReLu activated. Right panel: The time-distributed regressor is has a six-neuron layer at each time point. The MS2 ion intensity output is a 174-dimensional vector.
Figure 11. The decoder. Left panel: The one-layer MLP has 518 neurons in its processing layer, followed by a on-neuron output. All the neurons in the MLP are ReLu activated. Right panel: The time-distributed regressor is has a six-neuron layer at each time point. The MS2 ion intensity output is a 174-dimensional vector.

 


Prosit is a useful tool in MS-proteomics data analysis.


When embedded in Oktoberfest, the MS2 ion intensities and iRT values predicted by Prosit are used to create in silico spectral libraries. Prosit is used to rescore PSMs when embedded in MSBooster (in FragPipe) and NFERYS (in ProteomeDiscoverer).




I will be back with more about deep learning in MS-proteomics.

Stay tuned!

GPR


Disclosure: At BioTech Writing and Consulting we believe in the use AI in Data Science, but do not use AI to generate text or images. 

Comments


bottom of page