Peptide Sequencing Via Protein Language Models (2024)

Thuong Le Hoai PhamUniversity of Texas at ArlingtonArlingtonTexasUSA,Jillur Rahman Saurav,Aisosa A. Omere,Calvin J. HeylUniversity of Texas at ArlingtonArlingtonTexasUSA,Mohammad Sadegh NasrUniversity of Texas at ArlingtonArlingtonTexasUSA,Cody Tyler ReynoldsUniversity of Texas at ArlingtonArlingtonTexasUSA,Jai Prakash Yadav VeerlaUniversity of Texas at ArlingtonArlingtonTexasUSA,Helen H ShangUCLA HealthLos AngelesCaliforniaUSA,Justyn JaworskiUniversity of Texas at ArlingtonArlingtonTexasUSA,Alison RavenscraftUniversity of Texas at ArlingtonArlingtonTexasUSA,Joseph Anthony Buonomojoseph.buonomo@uta.eduUniversity of Texas at ArlingtonArlingtonTexasUSAandJacob M. Luberjacob.luber@uta.eduUniversity of Texas at ArlingtonArlingtonTexasUSA

(2024; 15 July 2024)

Abstract.

We introduce a protein language model for determining the complete sequence of a peptide based on measurement of a limited set of amino acids. To date, protein sequencing relies on mass spectrometry, with some novel edman degregation based platforms able to sequence non-native peptides. Current protein sequencing techniques face limitations in accurately identifying all amino acids, hindering comprehensive proteome analysis. Our method simulates partial sequencing data by selectively masking amino acids that are experimentally difficult to identify in protein sequences from the UniRef database. This targeted masking mimics real-world sequencing limitations. We then modify and finetune a ProtBert derived transformer-based model, for a new downstream task predicting these masked residues, providing an approximation of the complete sequence. Evaluating on three bacterial Escherichia species, we achieve per-amino-acid accuracy up to 90.5% when only four amino acids ([KCYM]) are known. Structural assessment using AlphaFold and TM-score validates the biological relevance of our predictions. The model also demonstrates potential for evolutionary analysis through cross-species performance. This integration of simulated experimental constraints with computational predictions offers a promising avenue for enhancing protein sequence analysis, potentially accelerating advancements in proteomics and structural biology by providing a probabilistic reconstruction of the complete protein sequence from limited experimental data.

Computational Biology, Protein Sequencing, High Performance Computing, Machine Learning, Language Modeling, Deep Learning

^†^†copyright: acmlicensed^†^†journalyear: 2024^†^†doi: XXXXXXX.XXXXXXX^†^†conference: The 15th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics; Nov. 22–25, 2024; Shenzhen, Guangdong Province, PR China^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Applied computingSequencing and genotyping technologies^†^†ccs: Applied computingBioinformatics^†^†ccs: Applied computingComputational proteomics

1. Introduction

Protein sequences are fundamental to understanding biological processes, disease mechanisms, and therapeutic developments (Lieu etal., 2020; Maddocks etal., 2017). Despite significant advancements in genomics and proteomics, aided by machine learning (ML) techniques (Libbrecht and Noble, 2015; Wen etal., 2020), accurate and comprehensive protein sequencing remains a challenge in the field (Alfaro etal., 2021).

Protein sequencing methods primarily rely on techniques such as Edman degradation (Niall, 1973) and mass spectrometry (MS) (Hunt etal., 1986), including liquid chromatography tandem mass spectrometry (LC-MS/MS) (Vogeser and Parhofer, 2007). While these methods have advanced our understanding of proteins, they face significant limitations in accurately identifying all amino acids in a sequence, particularly for complex or low-abundance proteins (Alfaro etal., 2021). These limitations often result in partially known sequences, hindering comprehensive proteome analysis.

Despite these advancements, protein sequencing still faces significant challenges, including high error rates, complex data interpretation, and technological limitations (Smith etal., 2024; Filius etal., 2024; Searle, 2024). Overcoming these hurdles requires further advancements in sequencing technologies, sophisticated data processing algorithms, and improved experimental protocols to enhance accuracy, reproducibility, and scalability (Brady and Meyer, 2022; Searle, 2024).

Recent advancements in click chemistry and bioorthogonal chemistry (Stump, 2022; Koniev and Wagner, 2015; Scinto etal., 2021) have attempted to address this issue by enabling the identification of specific amino acids and their positions. For instance, Zheng et al. demonstrated the sequencing of short antibody peptides using targeted amino acid labeling (Zheng etal., 2024). However, these techniques are still limited by the number of amino acids that can be correctly identified, resulting in partially masked sequences (e.g., xCxxCxxx, where C is the experimentally identifiable amino acid) (Swaminathan etal., 2018). Additionally, the click chemistry platform demonstrated in Zheng et al. only works with non-native peptides that have undergone a priori chemical modifications(Zheng etal., 2024); limitiations in this step means that parts of the proteomic retinue are not measurable with this approach. Our language model can work with input from this non-native peptide platform, as well as hypothetical future developments in bioorthogonal chemistry that will allow for edman degregation of native peptides.

To address this specific limitation, we propose a novel approach leveraging pretrained language models. Large language models (LLMs) have shown remarkable adaptability in interpreting protein sequences, excelling in predicting structures, functions, and evolutionary relationships (Bepler and Berger, 2021; Ruffolo and Madani, 2024; Lv etal., 2024). We hypothesize that these models can be used to predict the identity of amino acids that are conditionally difficult to determine experimentally.

In this paper, we present a method that simulates partial sequencing data by selectively masking amino acids that are experimentally challenging to identify in protein sequences from the UniRef database. This targeted masking mimics real-world sequencing limitations. We then utilize ProtBert (Elnaggar etal., 2021), a transformer-based model, to predict these masked residues, providing a probabilistic reconstruction of the complete protein sequence.

We evaluate our approach on three Escherichia bacterial species: E. coli, E. albertii, and E. fergusonii. Our results demonstrate high prediction accuracy even with extremely limited known amino acids.We also validate the biological relevance of our predictions through structural assessment using AlphaFold (Jumper etal., 2021), and standard structure evaluation metrics such as template modeling score (TM-score)(Xu and Zhang, 2010; Zhang and Skolnick, 2005) and the local distance difference test (lDDT)(Mariani etal., 2013).

This innovative integration of simulated experimental constraints with computational predictions offers a promising avenue for enhancing protein sequence analysis. By improving our ability to interpret partially sequenced data, we aim to accelerate advancements in proteomics and structural biology, potentially unlocking new insights into protein structure and function.

The remainder of this paper is structured as follows: In the Methods section, we detail our data preparation, model fine-tuning process, and evaluation metrics. The Results section presents our comprehensive analysis of our model’s performance across various scenarios. Finally, we discuss the implications of our findings and potential future directions in the Discussion section.

2. Problem statement

In an assumption that the partial sequencing can be acquired from Edman degradation enhanced by click chemistry, which provides the positions and identities of a limited set of amino acids within a protein, we aim to predict the complete protein sequence. This task involves using an LLM modified and finetuned from BERT/ProtBERT (Devlin etal., 2018; Brandes etal., 2022) to fill in the gaps from unknown amino acids, given the context provided from the known ones in combination with the protein’s domain constraint, determined at the species level. Our goal is to develop a computational approach that can accurately predict the full protein sequence from this partial information, potentially revolutionizing protein sequencing methodologies.

3. Methods

3.1. Protein Dataset

For model training, which we conducted on 8 NVIDIA DGX A100 80GB cards, we utilized the UniProt Reference Clusters (UniRef) database(Suzek etal., 2014), specifically UniRef100, focusing on three bacterial species: Escherichia coli (NCBI taxID 652), Escherichia albertii (NCBI taxID 208962), and Escherichia fergusonii (NCBI taxID 564). The chosen dataset combined identical sequences and subfragments with 11 or more residues into one UniRef100 entry, reducing potential data leakage between training, evaluation, and testing datasets. Additionally, we removed sequences from hypothetical protein group to ensure the biological relevance of the dataset. Following the pretrained model’s data processing, we mapped non–canonical or unresolved amino acids ([BOUZ]) to unknown (X)(Elnaggar etal., 2021). The frequency distribution of amino acids extracted from the three species is presented in Figure1.

We propose working on two cases of targeted sets of amino acids. The first set (KCYM) contains amino acids with two or more publications supporting successful identification: Lysine (K)(Tantipanjap*rn and Wong, 2023; Anderson etal., 1964, 1963), Cysteine (C)(Vantourout etal., 2020; Renault etal., 2018; Grant, 2017), Tyrosine (Y)(Ban etal., 2013; Abdul Fattah etal., 2018; Szijj etal., 2020; Liu etal., 2019), and Methionine (M)(Lin etal., 2017b; Zang etal., 2019). The second set (KCYMRHWST) includes the amino acids from the first set, with additional amino acids that have at least one publication supporting successful identification: Arginine (R)(Wanigasekara etal., 2018), Histidine (H)(Wan etal., 2022), Tryptophan (W)(Decoene etal., 2022), Serine (S)(Vantourout etal., 2020), and Threonine (T)(Webster etal., 2014).

Peptide Sequencing Via Protein Language Models (1)

3.2. Training Model

We chose ProtBERT(Brandes etal., 2022) as our pretrained model due to its well performance in general tasks, lightweight nature (420M parameters), and bidirectional property. However, we modified the architecture of the model to use a masked language modeling head for our training task, compatible with our problem formulation. We trained one model per domain (species) and per set of amino acids, resulting in a total of six finetuned and architecturally modified models. For E. coli and E. albertii, we performed training and evaluation on 50k and 25k sequences, respectively. Due to data limitation, E. fergusonii was trained and evaluated on 40k and 5k sequences, respectively. Given the extremely high masking rate (67–88%, see Figure1 and Table2), we removed any totally–masked sequences before constructing the training and evaluation datasets. The pretrained and architecturally modified model was then finetuned using HuggingFace transformers(Wolf etal., 2019).

During the training process, we followed the ProtBERT’s pretrained tokenization scheme: one token per residue. Any residue not in the set of known amino acids was set to [MASK], and sequences were padded or truncated to a length of 1024. The training process used batch size of 50, and was evaluated using cross–entropy loss and unmasking accuracy. All models were trained on an A100 GPU and 16 CPUs. The overview of the training pipeline is visualized in Figure2.

Peptide Sequencing Via Protein Language Models (2)

3.3. Evaluation Strategies

The performance of the model predictions were evaluated based on two major aspects: prediction accuracy and secondary structure similarity. For prediction accuracy, we computed three measures to compare the primary sequence of the predicted and the true proteins: per–token accuracy, average per–sequence unmasking accuracy (i.e, excluded known amino acids), and average per–sequence total identity. Beside using an in–domain inference dataset to study the performance of models (see Figure4 and 5), we also examined cross–domain accuracy among the three species. This aims to observe how taxonomic metrics (a prior knowledge about evolutionary distance in a phylogenic tree correlates with the performance of our model predictions (see Table1). For the two sets of amino acids (KCYM and KCYMRHWST), we performed testing inference on 5,000 sequences per species (randomly sampled for 3 folds).

To present useful amino acid suggestions/prioritization for experimental development in click chemistry based amino acid identification in the wet lab, we also performed training with amino acids from the small set and one additional amino acid from ([RHWST]), creating five additional study cases: KCYMR, KCYMH, KCYMW, KCYMS, and KCYMT. This configuration was only applied to the E. coli domain (with the same training protocols), and the inference was done using 25k sequences (one fold) of E. coli (see Table2).

For the second aspect of measuring the quality of our predictions, we analyzed an important property of proteins: structure. AlphaFold(Jumper etal., 2021) is renowned for its high–accuracy prediction of protein three–dimensional structures from amino acid sequences using multiple sequence alignment in combination with a deep learning architecture. Recently, these AlphaFold predicted structures have been widely adapted inside large annotated databases, such as UniProt KnowledgeBase (UniProtKB)(uni, 2023). In this study, we used the AlphaFold platform to examine how predicted sequences with less than 90% unmasking accuracy impact their structural integrity. Our study centered on sequences from the E. coli inference (fold–1), derived from the KCYM case, with unmasking accuracy bounded to the range [50–90]%. We used the reduced database in AlphaFold settings to generate structure predictions for our unmasked sequences, and the available AlphaFold structure of the true sequences (only those annotated in the UniProtKB). Filtering under these criteria yielded a total of 124 sequences for our structure analysis (see Result4.4). Figure3 visualizes the protein structure derived from the predicted sequence and the actual UniProtKB sequence, these two structures overlaid, as well as the alignment between the predicted and actual amino acid sequence for one of these 124 proteins (PDB A0A7H9QJ10).

Peptide Sequencing Via Protein Language Models (3)

We computed the TM–score(Xu and Zhang, 2010; Zhang and Skolnick, 2005) to compare the global similarity between the topologies of two structures. For local similarity, we computed the local difference distance test of the backbone atoms (lDDT–C $\alpha$ )(Mariani etal., 2013; Biasini etal., 2013) between the two structures, similar to the AlphaFold paper (see Figure6).

Peptide Sequencing Via Protein Language Models (4)

4. Results

4.1. Inference Accuracy

The accuracy of sequence predictions generated using the known amino acids set KCYM is presented in Figure4, and the set KCYMRHWST is presented in Figure5. As shown in the confusion heatmap for KCYM, even with a masking rate of 88.5%, the per–amino–acid accuracy reaches 74.7–80.9% in E. coli, 85.3–90.5% in E. albertii, and 83.8–88.8% in E. fergusonii.

The top left panel of Figure4 shows that the average per–sequence accuracies (unmasking and identity) vs. sequence length, averaged per 50-residue bin and highlighted using the 75th percentile interval. The average per–sequence unmasking accuracy and identity are 73.53% and 76.75% for E. coli, 88.46% and 89.87% for E. albertii, 88.33% and 89.73% for E. fergusonii. The performance of the model decayed when the protein sequence length exceed the model’s maximum length of 1024 residues. This behavior is expected due to the property of the BERT model, which has linear positional embedding and the training maximum length is set to be 1024 residues. Note that only about 5% of sequences in the data had length exceeding this threshold.

After taking this into account, the line plots (left) indicate that the performance of the prediction is more stable and accurate for longer protein sequences. Specifically, with just the four known amino acids KCYM, the unmasking accuracy for sequences longer than 300 residues reached approximately 80% for E. coli and over 90% for E. albertii and E. fergusonii. However, it should be noted that only about half of the protein sequences in these species are longer than 300 residues.

In the case of knowing nine amino acids KCYMRHWST (see Figure5), where the masking rate is 67.1%, the per–amino–acid accuracy reaches 84.1–89.1% in E. coli, 90.5–94.1% in E. albertii, and 90.5–94.0% in E. fergusonii. The average per–sequence unmasking accuracy and identity are 83.26% and 88.96% in E. coli, 93.38% and 95.62% in E. albertii, 93.49% and 95.69% in E. fergusonii. For proteins with length longer than 200 residues, representing 80% of protein sequences, the unmasking accuracy of E. coli exceeds 80%, while E. albertti and E. fergusonii exceed 90% accuracy.

Peptide Sequencing Via Protein Language Models (5)

4.2. Evolution Inference

We evaluated the model’s performance in capturing evolutionary information by cross–inferring each species’ protein sequences using models trained on each other species. Our three species: E. coli, E. albertii, and E. fergusonii, are all members of the Escherichia genus, and thus are expected to share a significant amount of genetic information, indicating a decent hom*ology in protein sequences.

finetuning model on known tokens: KCYM (3 folds)
	E. coli model	E. albertii model	E. fergusonii model
E. coli	73.53% (0.49)	51.38% (0.17)	45.22% (0.09)
E. albertii	65.15% (0.49)	88.46% (0.18)	50.16% (0.64)
E. fergusonii	60.06% (0.35)	50.61% (0.34)	88.33% (0.15)
finetuning model on known tokens: KCYMRHWST (3 folds)
	E. coli model	E. albertii model	E. fergusonii model
E. coli	83.26% (0.45)	69.08% (0.33)	64.02% (0.17)
E. albertii	82.35% (0.45)	93.38% (0.05)	71.04% (0.67)
E. fergusonii	79.58% (0.46)	72.49% (0.18)	93.49% (0.13)

As shown in Table1, when the model only knows the four amino acids KCYM, the unmasking accuracy is high only when the training and inference are from the same domain (see Result 4.1). The unmasking accuracy is significantly lower when the model tries predicting out–of–domain sequences. The results of the KCYM case reveal that, in the condition where the domain is specified, the model predictions only need a small set of known amino acids (in this case, KCYM) to capture the characteristics of the domain’s protein sequences, achieving an average accuracy of at least 73%. However, with this size of amino acids set, our model fails to capture the nuance of sequences beyond the domain specified.

In the case of knowing nine amino acids KCYMRHWST, the in–domain unmasking accuracy increased by 5–10% compared to the previous KCYM case. Besides the high in–domain accuracy, the model predictions for out–of–domain sequences also performed much better, with the lowest accuracy at 64.02% when training on E. fergusonii and inferring on E. coli, and the highest accuracy at 82.35% when training on E. coli and inferring on E. albertii.

Overall, the model trained on E. coli performed best on out–of–domain inference, followed by E. albertii, and lastly E. fergusonii. This outcome is expected due to the high yield of protein sequences available from E. coli and E. albertii compared to E. fergusonii. In summary, the cross–inference results indicate that knowledge of the species to which a sequence belongs increases prediction accuracy. Furthermore, they demonstrate the model’s potential for predicting protein sequences based on another related species when the sequences’ species identity may not be known.

4.3. Generalizability

	KCYM	KCYMR	KCYMH	KCYMW	KCYMS	KCYMT	KCYMRHWST
Per–token acc. [%]	78.17%	80.26%	78.85%	79.52%	78.70%	80.38%	86.82%
Per–seq acc. [%]	73.53%	76.28% (32.26)	75.00% (33.25)	75.83% (32.92)	75.22% (32.91)	75.74% (32.53)	83.26%
Per–seq identity [%]	76.75%	80.55% (26.41)	78.57% (28.46)	79.12% (28.40)	79.83% (26.71)	80.08% (26.67)	88.96%
Masking ratio [%]	88.47%	82.78% (3.73)	86.30% (3.40)	86.92% (3.28)	82.34% (3.69)	82.88% (3.43)	67.09%

From previous results (see Results4.1 and 4.2), our work suggests that transformer models like BERT can predict protein sequences with high accuracy, given prior knowledge of limited sets of amino acids and the species domain. Additionally, the accuracy of the predictions increases significantly with a larger set of known amino acids. However, expanding the set of identifiable amino acids introduces exponential challenges in Edman degradation. This process requires peptides to undergo more chemical identification cycles, leading to an increased noise in sequencing and a higher risk of unstable peptide degradation. Therefore, we investigated our model performance on sequence prediction by using five different sets of known amino acids as a guide for prioritizing amino acids to develop future click chemistry based identification for; in essence we are comparing how unmasking new amino acids ameliorate model performance, and comparing these results to priortize future wet lab efforts. We evaluated the inference of five additional models, which are trained on five known amino acids: four being KCYM and one from the set ([RHWST]) amino acids (see Table2).

Among the five experiments, the one with KCYMS has the highest sequence coverage from known amino acids, at 17.66% (corresponding to 82.34% masking rate). However, the case of KCYMR demonstrates the best, with average per–sequence unmasking accuracy at 76.28% (2.75% more than KCYM), average per–sequence identity at 80.55% (3.8% more), and per–token accuracy at 80.26% (2.09% more). The other four cases show comparable results to KCYMR (within 2% differences).

4.4. Structure Analysis

The comparison of lDDT–C $\alpha$ vs. TM–score between the predicted and true sequences’ AlphaFold structures is shown in Figure6, in which the left panel is colored by unmasking accuracy, and the right panel is colored by sequence length.

The TM–score evaluates the global similarity between two structures, while the lDDT–C $\alpha$ assesses the local distances of the backbones. According to the plot, the high lDDT–C $\alpha$ can happens even with low TM–score, but not the opposite where the TM–score is high but lDDT–C $\alpha$ is low. This is often caused by the structures having the large difference in the bending angles at the coil regions, yield a divergence in the structure’s global shapes, and hence resulted in low TM–score. But the local structure conformations, such as alpha–helices and beta–sheets, are conserved, leading to high value of lDDT–C $\alpha$ . An AlphaFold example result (UniProtKB ID: A0A066T1W5) is presented in Figure3, showing the molecular view of two structures (using py3Dmol(Koes and Others, 2020)) and their pairwise alignment, colored by unmasking matches (green) and mismatches (red). An additional illustrative example of a different protein is presented in the appendix.

To know how our model prediction’s quality (measured by unmasking accuracy) impacts the predicting protein structure, we need to understand how the value of TM–score approximately corresponds to whether the protein pairs sharing the same topology. Xu et al.’s paper, studied on the CATH and SCOP databases, reported that the high posterior probability of two structures having the same topology corresponds to a TM–score roughly between 0.4 and 0.6, with the specific threshold varies by datasets(Xu and Zhang, 2010). In our structure results, reported from 124 samples of E. coli with known set as KCYM, we also observed the decrease in general lDDT–C $\alpha$ values when the TM–score lower than 0.6, and hence 0.6 is our evaluation threshold. This means for TM–score ¿ 0.6, we have a high statistical confidence that the two structures are the same topology. And for TM–score ¡ 0.6, we need to evaluate auxiliary metrics such as lDDT–C $\alpha$ , unmasking accuracy, sequence length, etc. to conclude the similarity in topology.

The Figure6’s left panel suggests that we have a high confidence in structure similarity between our predicted sequences and true sequences (TM–score ¿ 0.6) when the unmasking accuracy is above 75%. For outliers where the unmasking accuracy ¿ 75% but TM–score ¡ 0.6, we notices that their sequence length are often long (see Figure6’s right panel). Because of the sequence length, these protein are thus expected to have higher chance having local divergence, leading to a sensitive TM–score, but the high lDDT–C $\alpha$ .

However, although the low accuracy predictions are thought to has low structure similarity, it is not the case. For TM–core ¿ 0.6, many of the sequences has unmasking accuracy ¡ 75%, and some are even less than 65%. These sequences are observed to often have lower lDDT–C $\alpha$ compared to ones with high accuracy.

5. Future Directions

Peptide sequencing enabled by our language model will have many important implications for the development of liquid biopsies, which could yield more information for treatment decisions in the oncology clinic. Liquid biopsy is a minimally invasive tool to identify cancer biomarkers within fluids such as blood plasma and urine. These liquid samples have been readily explored as sources of nucleotide biomarkers such as non-coding RNAs and tumor-specific DNA, but creating diagnostics based on proteins has been limited by signal to noise ratios for the detection of low-abundance hits, and difficulty discerning the source of proteins to cancer- specific cells without first isolating the circulating cancer cells (Marchioni etal., 2021). However, liquid biopsies are advantageous due to their safety, high repeatability, ability to monitor disease progression and prognosis, all without the need for an inpatient procedure (Bergerot etal., 2018).

To this end, recent advances have been focused on the annotation of the cell-free transcriptome of plasma (Koh etal., 2014)(Vorperian etal., 2022) and urine (Vorperian etal., 2023)(Lin etal., 2017a)(Sin etal., 2017) to identify candidate RNAs for disease progression.There are notable protein signatures that are highly informative related to disease progression and cancer treatment prognosis which are present for clear-cell renal cell carcinomas (ccRCC) (Marchioni etal., 2021), prostate cancer(Massoner etal., 2014), and urothelial cancers (Dressler etal., 2024)(Dressler etal., 2023). These changes are notable for multiple proteoforms: immune checkpoint proteins PD-1/PDL-1 (Gulati and Vogelzang, 2021)(Larrinaga etal., 2021), CTLA4 (Gulati and Vogelzang, 2021), epithelial cell adhesion molecule (EpCAM), and glycosylation changes in the protein EpEX (Dressler etal., 2023)(Fellinger etal., 2008). These changes are currently monitored by immuno- histochemical (IHC) assessments of tumor biopsies. IHC is difficult to apply widely to novel biomarker discovery and is limited to the efficacy of antibodies employed in these assays. Notably, unless antibodies are specifically generated for cancer-specific proteoforms, getting information on these proteoforms is an arduous task.

Thus, there is a significant technological gap between current diagnostic assays and the proteoform resolution necessary to characterize and quantifiably identify cancer-specific biomarkers and prognosis indicators (DiMeo etal., 2020). Taken together, an improvement in proteoform identification and quantification with resolution to the single molecule would foster rapid development and implementation of utilizing well- studied proteoforms as both diagnostic and prognosis biomarkers. Optimally, such a technology would enable the detection of protein analytes in fecal, urine, or plasma samples.

Urine samples are safe and relatively simple to work with, have minimal variation in sample complexity compared to blood plasma, and contain signatures of genitourinary cells. We have identified urine as a “gold standard” for a liquid biopsy that is facile to collect and prepare for analysis that will likely contain invaluable information related to cancer diagnosis and prognosis. Thus, in the future, we aim to expand peptide sequencing via the language model presented in this paper to develop a platform to directly sequence proteins within a complex milieu through highly-specific chemical ligation of amplifiable DNA barcodes for amino acid identity, sequence position, and peptide identity which provides a quantifiable readout with higher sensitivity than mass spectrometry alone (Timp and Timp, 2020)(Alfaro etal., 2021). This future platform will driven by closely entwined advancements in machine learning algorithm design and chemical reaction development and characterization.

6. Discussion

Peptide Sequencing Via Protein Language Models (6)

We present a protein language model designed to determine the complete sequence of a peptide based on the measurement of a limited set of amino acids. Traditional protein sequencing primarily relies on mass spectrometry, with some novel Edman degradation-based platforms capable of sequencing non-native peptides. However, these techniques face significant limitations in accurately identifying all amino acids, thus hindering comprehensive proteome analysis. Our approach simulates partial sequencing data by selectively masking amino acids that are experimentally challenging to identify in protein sequences from the UniRef database, thereby mimicking real-world sequencing limitations. By modifying and fine-tuning a ProtBert-derived transformer-based model, we predict these masked residues, providing an approximation of the complete sequence. Unlike traditional multiple sequence alignment (MSA) approaches, our model views sequence data as partial sequences, providing a new perspective and methodology for protein sequence analysis.

Our method, evaluated on three bacterial Escherichia species, achieves per-amino-acid accuracy of up to 90.5% when only four amino acids ([KCYM]) are known. Structural assessments using AlphaFold and TM-score validate the biological relevance of our predictions, and the model demonstrates potential for evolutionary analysis through cross-species performance. This integration of simulated experimental constraints with computational predictions offers a promising avenue for enhancing protein sequence analysis. By improving our ability to interpret partially sequenced data, our approach has the potential to accelerate advancements in proteomics and structural biology, enabling a probabilistic reconstruction of complete protein sequences from limited experimental data.

Oxford Nanopore’s DNA/RNA sequencing platform, which makes inferences from incomplete signal (squiggles) and converts them algorithmically to sequence space, initially performed with lower accuracy when first introduced (Brown and Clarke, 2016; Laver etal., 2015) than our model in terms of accuracy of inferred sequence. This suggests that our computational model paired with a few additional wet lab improvements has the potential to yield the first clinically useful protein sequencing platform.

However, several challenges remain. Experimental verification is essential to validate our computational predictions and ensure their biological relevance. Additionally, successfully implementing the hypothetical Edman degradation pipeline requires effective peptide immobilization techniques without C-terminus modification. Overcoming these hurdles will be crucial for the practical application of our method.

This integration of simulated experimental constraints with computational predictions offers a promising avenue for enhancing protein sequence analysis. By improving our ability to interpret partially sequenced data, we aim to accelerate advancements in proteomics and structural biology, potentially unlocking new insights into protein structure and function.

The future directions for our research involve expanding peptide sequencing via our language model to develop a platform capable of directly sequencing proteins within complex milieus. This will involve highly specific chemical ligation of amplifiable DNA barcodes for amino acid identity, sequence position, and peptide identity, providing a quantifiable readout with higher sensitivity than mass spectrometry alone (Timp and Timp, 2020)(Alfaro etal., 2021). Such advancements will drive closely intertwined developments in machine learning algorithm design and chemical reaction characterization, ultimately fostering rapid implementation of proteoform-based diagnostics and prognostics.

In summary, our computational approach, validated through structural assessment using AlphaFold and TM-score, demonstrates significant potential for improving protein sequence analysis. By integrating these predictions with experimental techniques, we aim to bridge the gap between current technologies and the high-resolution identification required for advanced proteomics.

Acknowledgements.

This section will be written after the blind review.

7. Code Availability

Please use the following URL to access anonymized code during the review process: https://github.com/aauthors131/protein-sequencing-LLMs.git

References

(1)
uni (2023)2023.UniProt: the universal protein knowledgebase in 2023.Nucleic acids research 51, D1 (2023), D523–D531.
Abdul Fattah etal. (2018)Tanzeela Abdul Fattah, Aamer Saeed, and Fernando Albericio. 2018.Recent advances towards sulfur (VI) fluoride exchange (SuFEx) click chemistry.Journal of Fluorine Chemistry 213 (2018), 87–112.https://doi.org/10.1016/j.jfluchem.2018.07.008
Alfaro etal. (2021)JavierAntonio Alfaro, Peggy Bohländer, Mingjie Dai, Mike Filius, CecilJ Howard, XanderF VanKooten, Shilo Ohayon, Adam Pomorski, Sonja Schmid, Aleksei Aksimentiev, etal. 2021.The emerging landscape of single-molecule protein sequencing technologies.Nature methods 18, 6 (2021), 604–617.
Anderson etal. (1963)GeorgeW Anderson, JoanE Zimmerman, and FrancisM Callahan. 1963.N-hydroxysuccinimide esters in peptide synthesis.Journal of the American Chemical Society 85, 19 (1963), 3039–3039.
Anderson etal. (1964)GeorgeW Anderson, JoanE Zimmerman, and FrancisM Callahan. 1964.The use of esters of N-hydroxysuccinimide in peptide synthesis.Journal of the American Chemical Society 86, 9 (1964), 1839–1842.
Ban etal. (2013)Hitoshi Ban, Masanobu Nagano, Julia Gavrilyuk, Wataru Hakamata, Tsubasa Inokuma, and CarlosF BarbasIII. 2013.Facile and stabile linkages through tyrosine: bioconjugation strategies with the tyrosine-click reaction.Bioconjugate chemistry 24, 4 (2013), 520–532.
Bepler and Berger (2021)Tristan Bepler and Bonnie Berger. 2021.Learning the protein language: Evolution, structure, and function.Cell systems 12, 6 (2021), 654–669.
Bergerot etal. (2018)PauloG Bergerot, AndrewW Hahn, CristianeDecat Bergerot, Jeremy Jones, and SumantaKumar Pal. 2018.The role of circulating tumor DNA in renal cell carcinoma.Current treatment options in oncology 19 (2018), 1–11.
Biasini etal. (2013)Marco Biasini, Tobias Schmidt, Stefan Bienert, Valerio Mariani, Gabriel Studer, Juergen Haas, Niklaus Johner, AndreasDaniel Schenk, Ansgar Philippsen, and Torsten Schwede. 2013.OpenStructure: an integrated software framework for computational structural biology.Acta Crystallographica Section D: Biological Crystallography 69, 5 (2013), 701–709.
Brady and Meyer (2022)MorganM Brady and AnneS Meyer. 2022.Cataloguing the proteome: Current developments in single-molecule protein sequencing.Biophysics Reviews 3, 1 (2022).
Brandes etal. (2022)Nadav Brandes, Dan Ofer, Yam Peleg, Nadav Rappoport, and Michal Linial. 2022.ProteinBERT: a universal deep-learning model of protein sequence and function.Bioinformatics 38, 8 (02 2022), 2102–2110.https://doi.org/10.1093/bioinformatics/btac020arXiv:https://academic.oup.com/bioinformatics/article-pdf/38/8/2102/49009610/btac020.pdf
Brown and Clarke (2016)CliveG Brown and James Clarke. 2016.Nanopore development at Oxford nanopore.Nature biotechnology 34, 8 (2016), 810–811.
Decoene etal. (2022)KlaasW Decoene, Kamil Unal, An Staes, Olivier Zwaenepoel, Jan Gettemans, Kris Gevaert, JohanM Winne, and Annemieke Madder. 2022.Triazolinedione protein modification: from an overlooked off-target effect to a tryptophan-based bioconjugation strategy.Chemical Science 13, 18 (2022), 5390–5397.
Devlin etal. (2018)Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018.Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805 (2018).
DiMeo etal. (2020)Ashley DiMeo, Ihor Batruch, MarshallD Brown, Chuance Yang, Antonio Finelli, MichaelA Jewett, EleftheriosP Diamandis, and GeorgeM Yousef. 2020.Searching for prognostic biomarkers for small renal masses in the urinary proteome.International journal of cancer 146, 8 (2020), 2315–2325.
Dressler etal. (2024)FranzF Dressler, Falk Diedrichs, Deema Sabtan, Sofie Hinrichs, Christoph Krisp, Timo Gemoll, Martin Hennig, Paulina Mackedanz, Mareile Schlotfeldt, Hannah Voß, etal. 2024.Proteomic analysis of the urothelial cancer landscape.Nature Communications 15, 1 (2024), 4513.
Dressler etal. (2023)FranzF Dressler, Sofie Hinrichs, MarieC Roesch, and Sven Perner. 2023.EpCAM tumor specificity and proteoform patterns in urothelial cancer.Journal of Cancer Research and Clinical Oncology 149, 11 (2023), 8913–8922.
Elnaggar etal. (2021)Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rehawi, Wang Yu, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, Debsindhu Bhowmik, and Burkhard Rost. 2021.ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing.IEEE Transactions on Pattern Analysis and Machine Intelligence (2021), 1–1.https://doi.org/10.1109/TPAMI.2021.3095381
Fellinger etal. (2008)Markus Munz1Karin Fellinger, Tanja Hofmann, Barbel Schmitt, and Olivier Gires. 2008.Glycosylation is crucial for stability of tumour and cancer stem cell antigen EpCAM.Frontiers in Bioscience 5195, 5201 (2008), 5201.
Filius etal. (2024)Mike Filius, Raman van Wee, Carlos de Lannoy, Ilja Westerlaken, Zeshi Li, SungHyun Kim, Cecilia de AgrelaPinto, Yunfei Wu, Geert-Jan Boons, Martin Pabst, etal. 2024.Full-length single-molecule protein fingerprinting.Nature Nanotechnology (2024), 1–8.
Grant (2017)GregoryA Grant. 2017.Modification of cysteine.Current protocols in protein science 87, 1 (2017), 15–1.
Gulati and Vogelzang (2021)Shuchi Gulati and NicholasJ Vogelzang. 2021.Biomarkers in renal cell carcinoma: Are we there yet?Asian journal of urology 8, 4 (2021), 362–375.
Hunt etal. (1986)DonaldF Hunt, JR Yates3rd, Jeffrey Shabanowitz, Scott Winston, and CharlesR Hauer. 1986.Protein sequencing by tandem mass spectrometry.Proceedings of the National Academy of Sciences 83, 17 (1986), 6233–6237.
Jumper etal. (2021)John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, etal. 2021.Highly accurate protein structure prediction with AlphaFold.Nature 596, 7873 (2021), 583–589.
Koes and Others (2020)David Koes and Others. 2020.py3Dmol: A Python Interface for 3Dmol.js.https://pypi.org/project/py3Dmol/Accessed: 2024-07-09.
Koh etal. (2014)Winston Koh, Wenying Pan, Charles Gawad, HChristina Fan, GeoffreyA Kerchner, Tony Wyss-Coray, YairJ Blumenfeld, YasserY El-Sayed, and StephenR Quake. 2014.Noninvasive in vivo monitoring of tissue-specific global gene expression in humans.Proceedings of the National Academy of Sciences 111, 20 (2014), 7361–7366.
Koniev and Wagner (2015)Oleksandr Koniev and Alain Wagner. 2015.Developments and recent advancements in the field of endogenous amino acid selective bond forming reactions for bioconjugation.Chemical Society Reviews 44, 15 (2015), 5495–5551.
Larrinaga etal. (2021)Gorka Larrinaga, JonDanel Solano-Iturri, Peio Errarte, Miguel Unda, Ana Loizaga-Iriarte, Amparo Pérez-Fernández, Enrique Echevarría, Aintzane Asumendi, Claudia Manini, JavierC Angulo, etal. 2021.Soluble PD-L1 is an independent prognostic factor in clear cell renal cell carcinoma.Cancers 13, 4 (2021), 667.
Laver etal. (2015)Thomas Laver, J Harrison, PA O’neill, Karen Moore, Audrey Farbos, Konrad Paszkiewicz, and DavidJ Studholme. 2015.Assessing the performance of the oxford nanopore technologies minion.Biomolecular detection and quantification 3 (2015), 1–8.
Libbrecht and Noble (2015)MaxwellW Libbrecht and WilliamStafford Noble. 2015.Machine learning applications in genetics and genomics.Nature Reviews Genetics 16, 6 (2015), 321–332.
Lieu etal. (2020)ElizabethL Lieu, Tu Nguyen, Shawn Rhyne, and Jiyeon Kim. 2020.Amino acids in cancer.Experimental & molecular medicine 52, 1 (2020), 15–30.
Lin etal. (2017b)Shixian Lin, Xiaoyu Yang, Shang Jia, AmyM Weeks, Michael Hornsby, PeterS Lee, RitaV Nichiporuk, AnthonyT Iavarone, JamesA Wells, FDean Toste, etal. 2017b.Redox-based reagents for chemoselective methionine bioconjugation.Science 355, 6325 (2017), 597–602.
Lin etal. (2017a)SelenaY Lin, JenniferA Linehan, TimothyG Wilson, and DaveSB Hoon. 2017a.Emerging utility of urinary cell-free nucleic acid biomarkers for prostate, bladder, and renal cancers.European urology focus 3, 2-3 (2017), 265–272.
Liu etal. (2019)Feng Liu, Hua Wang, Suhua Li, GrantAL Bare, Xuemin Chen, Chu Wang, JohnE Moses, Peng Wu, and KBarry Sharpless. 2019.Biocompatible SuFEx click chemistry: thionyl tetrafluoride (SOF4)-derived connective hubs for bioconjugation to DNA and proteins.Angewandte Chemie International Edition 58, 24 (2019), 8029–8033.
Lv etal. (2024)Liuzhenghao Lv, Zongying Lin, Hao Li, Yuyang Liu, Jiaxi Cui, Calvin Yu-Chian Chen, Li Yuan, and Yonghong Tian. 2024.Prollama: A protein large language model for multi-task protein language processing.arXiv preprint arXiv:2402.16445 (2024).
Maddocks etal. (2017)OliverDK Maddocks, Dimitris Athineos, EricC Cheung, Pearl Lee, Tong Zhang, NielsJF Van DenBroek, GillianM Mackay, ChristiaanF Labuschagne, David Gay, Flore Kruiswijk, etal. 2017.Modulating the therapeutic response of tumours to dietary serine and glycine starvation.Nature 544, 7650 (2017), 372–376.
Marchioni etal. (2021)Michele Marchioni, JuanGomez Rivas, Anamaria Autran, Moises Socarras, Simone Albisinni, Matteo Ferro, Luigi Schips, RobertoMario Scarpa, Rocco Papalia, and Francesco Esperto. 2021.Biomarkers for renal cell carcinoma recurrence: state of the art.Current Urology Reports 22, 6 (2021), 31.
Mariani etal. (2013)Valerio Mariani, Marco Biasini, Alessandro Barbato, and Torsten Schwede. 2013.lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests.Bioinformatics 29, 21 (2013), 2722–2728.
Massoner etal. (2014)P Massoner, T Thomm, B Mack, G Untergasser, A Martowicz, K Bobowski, H Klocker, O Gires, and M Puhr. 2014.EpCAM is overexpressed in local and metastatic prostate cancer, suppressed by chemotherapy and modulated by MET-associated miRNA-200c/205.British journal of cancer 111, 5 (2014), 955–964.
Niall (1973)HughD Niall. 1973.[36] Automated edman degradation: The protein sequenator.In Methods in enzymology. Vol.27. Elsevier, 942–1010.
Renault etal. (2018)Kévin Renault, JeanWilfried Fredy, Pierre-Yves Renard, and Cyrille Sabot. 2018.Covalent modification of biomolecules through maleimide-based labeling strategies.Bioconjugate Chemistry 29, 8 (2018), 2497–2513.
Ruffolo and Madani (2024)JeffreyA Ruffolo and Ali Madani. 2024.Designing proteins with language models.nature biotechnology 42, 2 (2024), 200–202.
Scinto etal. (2021)SamuelL Scinto, DidierA Bilodeau, Robert Hincapie, Wankyu Lee, SeanS Nguyen, Minghao Xu, ChristopherW AmEnde, MG Finn, Kathrin Lang, Qing Lin, etal. 2021.Bioorthogonal chemistry.Nature Reviews Methods Primers 1, 1 (2021), 30.
Searle (2024)BrianC Searle. 2024.Nanopore Protein Sequencing Achieves Significant New Milestones.Clinical Chemistry (2024), hvae041.
Sin etal. (2017)MandyLY Sin, KathleenE Mach, Rahul Sinha, Fan Wu, DharatiR Trivedi, Emanuela Altobelli, KristinC Jensen, Debashis Sahoo, Ying Lu, and JosephC Liao. 2017.Deep sequencing of urinary RNAs for bladder cancer molecular diagnostics.Clinical Cancer Research 23, 14 (2017), 3700–3710.
Smith etal. (2024)MatthewBeauregard Smith, Kent VanderVelden, Thomas Blom, HeatherD Stout, JamesH Mapes, TuckerM Folsom, Christopher Martin, AngelaM Bardo, and EdwardM Marcotte. 2024.Estimating error rates for single molecule protein sequencing experiments.PLOS Computational Biology 20, 7 (2024), e1012258.
Stump (2022)Bernhard Stump. 2022.Click Bioconjugation: Modifying Proteins Using Click-Like Chemistry.ChemBioChem 23, 16 (2022), e202200016.
Suzek etal. (2014)BarisE. Suzek, Yuqi Wang, Hongzhan Huang, PeterB. McGarvey, CathyH. Wu, and the UniProtConsortium. 2014.UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches.Bioinformatics 31, 6 (11 2014), 926–932.https://doi.org/10.1093/bioinformatics/btu739arXiv:https://academic.oup.com/bioinformatics/article-pdf/31/6/926/49011550/bioinformatics_31_6_926.pdf
Swaminathan etal. (2018)Jagannath Swaminathan, AlexanderA Boulgakov, ErikT Hernandez, AngelaM Bardo, JamesL Bachman, Joseph Marotta, AmberM Johnson, EricV Anslyn, and EdwardM Marcotte. 2018.Highly parallel single-molecule identification of proteins in zeptomole-scale mixtures.Nature biotechnology 36, 11 (2018), 1076–1082.
Szijj etal. (2020)PeterA Szijj, KristinaA Kostadinova, RichardJ Spears, and Vijay Chudasama. 2020.Tyrosine bioconjugation–an emergent alternative.Organic & Biomolecular Chemistry 18, 44 (2020), 9018–9028.
Tantipanjap*rn and Wong (2023)Ajcharapan Tantipanjap*rn and Man-Kin Wong. 2023.Development and recent advances in lysine and N-terminal bioconjugation for peptides and proteins.Molecules 28, 3 (2023), 1083.
Timp and Timp (2020)Winston Timp and Gregory Timp. 2020.Beyond mass spectrometry, the next step in proteomics.Science Advances 6, 2 (2020), eaax8978.
Vantourout etal. (2020)JulienC Vantourout, SrinivasaRao Adusumalli, KyleW Knouse, DillonT Flood, Antonio Ramirez, NataliaM Padial, Alena Istrate, Katarzyna Maziarz, JustineN DeGruyter, RohanR Merchant, etal. 2020.Serine-selective bioconjugation.Journal of the American Chemical Society 142, 41 (2020), 17236–17242.
Vogeser and Parhofer (2007)M Vogeser and KG Parhofer. 2007.Liquid chromatography tandem-mass spectrometry (LC-MS/MS)-technique and applications in endocrinology.Experimental and clinical endocrinology & diabetes 115, 09 (2007), 559–570.
Vorperian etal. (2023)SevahnK Vorperian, BrianC DeFelice, JosephA Buonomo, HagopJ Chinchinian, IraJ Gray, Jia Yan, KathleenE Mach, Vinh La, TimothyJ Lee, JosephC Liao, etal. 2023.Multiomics characterization of cell type repertoires for urine liquid biopsies.bioRxiv (2023), 2023–10.
Vorperian etal. (2022)SevahnK Vorperian, MiraN Moufarrej, Tabula Sapiens Consortium OverallProject Direction, Coordination Jones Robert C. 3 Karkanias Jim 6 Krasnow Mark 7 8 Pisco Angela Oliveira 6 Quake Stephen R. 3 4 6 Salzman Julia 7 9 Yosef Nir 6 1011 12, Donor Recruitment Bulthaup Bryan 13 Brown Phillip 13 Harper William 13 Hemenez Marisa 13 Ponnusamy Ravikumar 13 Salehi Ahmad 13 Sanagavarapu Bhavani A. 13 SpallinoEileen 13, Surgeons Aaron Ksenia A. 14 Concepcion Waldo 13 Gardner James M. 15 16 Kelly Burnett 13 17 Neidlinger Nikole 13 WangZifa 13, Logistical coordination Crasta Sheela 3 6 Kolluru Saroja 3 6 Morri Maurizio 6 Tan Serena Y. 18 Travaglini Kyle J. 7 Xu Chenling10, Sequencing Detweiler Angela M. 6 Mekonen Honey 6 Neff Norma F. 6 Sit Rene V. 6 Tan Michelle 6 YanJia 6, Histology Bean Gregory R. 18 Charu Vivek 18 Forgó Erna 18Martin Brock A. 18 Ozawa Michael G. 18 Silva Oscar 18 Toland Angus 18 Vemuri VenkataNP 6, etal. 2022.Cell types of origin of the cell-free transcriptome.Nature biotechnology 40, 6 (2022), 855–861.
Wan etal. (2022)Chuan Wan, Yuena Wang, Chenshan Lian, Qi Chang, Yuhao An, Jiean Chen, Jinming Sun, Zhanfeng Hou, Dongyan Yang, Xiaochun Guo, etal. 2022.Histidine-specific bioconjugation via visible-light-promoted thioacetal activation.Chemical Science 13, 28 (2022), 8289–8296.
Wanigasekara etal. (2018)MaheshikaSK Wanigasekara, Xiaojun Huang, JayantaK Chakrabarty, Alejandro Bugarin, and SaifulM Chowdhury. 2018.Arginine-Selective chemical labeling approach for identification and enrichment of reactive arginine residues in proteins.ACS omega 3, 10 (2018), 14229–14235.
Webster etal. (2014)AlexandraM Webster, ChristopherR Coxon, AlanM Kenwright, Graham Sandford, and StevenL Cobb. 2014.A mild method for the synthesis of a novel dehydrobutyrine-containing amino acid.Tetrahedron 70, 31 (2014), 4661–4667.
Wen etal. (2020)Bo Wen, Wen-Feng Zeng, Yuxing Liao, Zhiao Shi, SaraR Savage, Wen Jiang, and Bing Zhang. 2020.Deep learning in proteomics.Proteomics 20, 21-22 (2020), 1900335.
Wolf etal. (2019)Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, etal. 2019.Huggingface’s transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771 (2019).
Xu and Zhang (2010)Jinrui Xu and Yang Zhang. 2010.How significant is a protein structure similarity with TM-score= 0.5?Bioinformatics 26, 7 (2010), 889–895.
Zang etal. (2019)Jia Zang, Yulin Chen, Wenxuan Zhu, and Shixian Lin. 2019.Chemoselective methionine bioconjugation on a polypeptide, protein, and proteome.Biochemistry 59, 2 (2019), 132–138.
Zhang and Skolnick (2005)Yang Zhang and Jeffrey Skolnick. 2005.TM-align: a protein structure alignment algorithm based on the TM-score.Nucleic acids research 33, 7 (2005), 2302–2309.
Zheng etal. (2024)Liwei Zheng, Yujia Sun, Michael Eisenstein, and HyongsokTom Soh. 2024.Peptide sequencing via reverse translation of peptides into DNA.bioRxiv (2024).https://doi.org/10.1101/2024.05.31.596913arXiv:https://www.biorxiv.org/content/early/2024/06/03/2024.05.31.596913.full.pdf

Appendix A Appendix

Peptide Sequencing Via Protein Language Models (7)

Peptide Sequencing Via Protein Language Models (8)

Peptide Sequencing Via Protein Language Models (10)

Peptide Sequencing Via Protein Language Models (11)

ID = A0A0H3PHF4

Description = Probable csgAB operon transcriptional regulatory protein n=32 RepID=A0A0H3PHF4_ECO5C

Length = 240 (aa)

Predicted matches = 160 / 211 (75.83%)

TM-score = 0.6319

Superposition in the TM-score: Length(d¡5.0)= 143

Peptide Sequencing Via Protein Language Models (13)

Alignment between predicted sequences and UniProtLB sequence:
MFNEVHSIHGHTLLLITKPSLQATALLQHLKHSLAITGKLHNIQRSLDDISSGSIIIVDMMEADKKLIHYWQDTLSRKNNNIKILLLNTPEDYPYRDIENWPHINGVFYA
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
MFNEVHSIHGHTLLLITKPSLQATALLQHLKQSLAITGKLHNIQRSLDDISSGSIILLDMMEADKKLIHYWQDTLSRKNNNIKILLLNTPEDYPYRDIENWPHINGVFYAMEDQERVVNGLQGVLRGECYFTQKLASYLITHSGNYRYNSTESALLNHREKPILEKLRILASNNVIADTSFFIEQIVKGHLYVLFKKIVNKSRERAAILGLTRSADSTLI
:::::::::::::::::::::::::::::::::::::::
MEDQERVVNGLQGVLRGECYFTQKLASYLITHSGNYRYNSTESALLTHREKEILNKLRIGASNNEIARSLFISENTVKTHLYNLFKKIAVKNRTQAVSWANDTSAISHET

AGNNP

CSVSASSSSAGATLW

LFTLDCGGRISVRRRESSRR