PROSIT
- Genaro Pimienta

- Feb 17
- 7 min read
Updated: Feb 22
Last year I dedicated three blogposts to mass spectrometry-based proteomics (MS-proteomics) data analysis:
To continue with this subject, I now discuss how using deep learning algorithms to predict fragment (MS2) ion mass intensities increases the number of peptide sequences identified by a peptide search engine.
From the handful of deep learning-based MS2 ion intensity predictors reported so far, I discuss here Prosit—a widely used and well characterized algorithm, which has in 2025 been updated with new capabilities (Table 1).

To better understand this blogpost, I recommend that you read my previous contributions, which talk about deep learning algorithms:
Also helpful will be the following review publications, which discuss the use of deep learning in MS-proteomics.
You must also understand the peptide-spectrum match (PSM) concept. In Figure 1 below, I describe how peptide search engines compute a PSM from MS1 (precursor ion) and MS2 (precursor fragment ions) spectra.

MOTIVATION
Calculating the false discovery rate (FDR) of unfiltered PSMs is an error prone procedure, regardless of the peptide search engine used. To control the FDR, most peptide search engines rescore unfiltered PSMs with Percolator, a semi-supervised machine learning algorithm (Figure 2).

But even when using Percolator to rescore PSMs, a limitation remains: the MS2 predictions computed by peptide search engines lack realistic ion intensity values.
Deep learning models that predict MS2 ion intensities fill in the gap. When adding a MS2 ion intensity predictor upstream Percolator, an increase is observed in the number of true identifications computed by a peptide search engine (Figure 3).
Prosit and Percolator are complementary. To rescore PSMs, Prosit uses the MS2 ion intensities it predicts, whereas Percolator depends on the scores computed by the peptide search engine.

PROSIT
Prosit was developed by the ProteomeTools project and is available as an online tool in ProteomicsDB. But, while easy to access, Prosit in ProteomicsDB is limited by input file size (≤ 2GB), and only processes results generated by Andromeda in Maxquant.
To make Prosit a peptide search engine-agnostic tool, several standalone computational platforms have been developed:
Oktoberfest (2023)—Compatible with any peptide search engine.
MSBooster—Compatible with MSFragger in FragPipe.
Inferys Rescoring—Compatible with SEQUEST in ProteomeDiscoverer. INFERYS rescoring: Boosting peptide identifications and scoring confidence of database search results —2025
Since its introduction in 2019, five versions of Prosit haven been published:
Prosit.
Prosit Transformer.
Prosit-TMT.
Prosit-XL.
Prosit-PTM.
To train Prosit and its subsequent iterations, the ProteomeTools project used more than a million synthetic peptide sequences and their corresponding liquid chromatography and mass spectra features. This reference dataset includes tryptic, non-tryptic, and PTM-modified synthetic peptides, which cover a large portion of the human proteome.
The Prosit family of peptide feature predictors illustrates the modularity of deep learning models; it shows that the model's prediction task can be fine-tuned by adding or removing modules.
Below I describe Prosit's deep learning architecture and discuss the augmentation strategy, which makes Prosit-PTM sensitive to an unlimited number of peptide modifications.
PROSIT'S ARCHITECTURE
Prosit and its variants (Prosit-TMT, Prosit-XL, and Prosit-PTM) have a sequence-to-sequence (seq2seq) architecture, in which the encoder and decoder are recurrent neural networks (RNNs) connected by an attention layer (Figure 4).
In the case of Prosit Transformer, which I talk about in a separate blogpost, the RNNs in the encoder and decoder are replaced with multiple self-attention layers. To review what a self-attention layer is, see my blogpost TRANSFORMERS.

To assure neural network stability, Prosit uses bidirectional gate recurrent unit (BiGRU) RNNs in its encoder and decoder (Figure 5). Doing this is important because RNNs are prone to vanishing gradients (see WHAT IS AN ARTIFICIAL NEURAL NETWORK? and RECURRENT NEURAL NETWORKS).

PROSIT'S INPUT EMBEDDING
Prosit and Prosit-PTM ingest two inputs: a peptide sequence and a metadata layer.
The peptide sequence is tokenized to obtain a numeric vector, which is next embedded in an N-dimensional matrix that contains learned weights tuned during model training (Figure 6).

The metadata layer is a one-hot encoded vector, composed of seven nodes. Six of these nodes represent precursor ion charges 1-to-6. The seventh one corresponds to a normalized collision energy value (Figure 7). Prosit-PTM has an eight feature in its metadata layer: fragmentation strategy.
To embed the metadata layer, a two-layer MLP with ReLu activation is used, which produces an N-dimensional vector (Figure 7).

PROSIT-PTM'S AUGMENTATION STRATEGY
Unlike Prosit, Prosit-PTM predicts MS2 ion intensities and iRT values for chemically modified peptides. In the case of phosphopeptides, Prosite-PTM predicts the modification's site from the multiple possibilities in the peptide sequence.
Instead of only focusing on the 22 post-translational modifications (PTMs) we know of, Prosit-PTM additionally considers the 342 protein chemical modifications documented in Unimod.
Training Prosit-PTM with MS-proteomics data from synthetic peptides decorated with 364 PTMs (22 known PTMs plus 342 moieties from Unimod) would be experimentally and computationally taxing.
To overcome this obstacle, Prosit-PTM uses an in silico data augmentation strategy to train the model with hypothetical chemical modifications (Figure 8).
In the augmentation strategy, each amino acid modification is described by the chemical composition of the moiety gained or lost during the unmodified-to-modified transition (Figure 8).

To embed the chemical augmentation library, Prosit-PTM uses a three-layer MLP with linear activation in its neurons. Concatenating the chemical augmentation and peptide sequence embeddings results in a combined N-dimensional matrix, ready to be processed by the encoder (Figure 9).

PROSIT'S ENCODER
In Prosit and Prosit-PTM the encoder modules used to predict MS2 ion intensities and iRT values are the same.
The peptide sequence embedding (enriched or not with hypothetical amino amino acid modification features) is processed with a two-layer biGRU RNN connected to an attention layer (Figure 10). A two-layer perceptron with ReLu-activated neurons is used to process the metadata layer vector (Figure 10).
A final representation vector with learned attention weights is obtained by multiplying the output vectors generated by the peptide sequence and metadata encoders (Figure 10).
For an explanation on how biGRU RNNs work, see my blogposts RECURRENT NEURAL NETWORKS and ENCODER-DECODER NEURAL NETWORKS.

PROSIT'S DECODER
Common to Prosit and Prosit-PTM is a decoder with two prediction heads: a biGRU RNN, which predicts MS2 ion intensities; and a two-layer MLP, which calculates iRT values (Figure 11).
The biGRU RNN is coupled to a time-distributed regressor with six neurons. This time-distributed regressor generates a 174-dimensional vector, which contains the MS2 ion intensities predicted (Figure 11).

Prosit is a useful tool in MS-proteomics data analysis.
When embedded in Oktoberfest, the MS2 ion intensities and iRT values predicted by Prosit are used to create in silico spectral libraries. Prosit is used to rescore PSMs when embedded in MSBooster (in FragPipe) and NFERYS (in ProteomeDiscoverer).
I will be back with more about deep learning in MS-proteomics.
Stay tuned!
GPR
Disclosure: At BioTech Writing and Consulting we believe in the use AI in Data Science, but do not use AI to generate text or images.




Comments