Protein language models: promises, pitfalls and applications

Introduction

Recently, tremendous advancements in the field of artificial intelligence have been achieved, especially the development of large language models (LLMs). Within the last few years, we have been introduced to numerous LLMs that showed the potential to improve our life due to their unprecedented performance in not only daily use, but also in antibody engineering and biologics design ¹.

How do Large Language Models (LLMs) work?

LLMs learn statistical patterns from large text training data derived from natural languages (e.g. English, Danish or Arabic), after breaking down the bulk text into smaller tokens such as words and symbols (Figure 1). Based on what they learn during the training process, LLMs could perform several tasks such as text completion, translation, summarisation and question answering. Famous examples of LLMs include BERT (Google)², RoBERTa ³ (Facebook AI) and GPT3 ⁴ (OpenAI).

**Figure 1**. Simple example of tokenization in the setting of natural language taken from a piece of English literature 5.

‍

The emergence of protein language models (PLMs)

‍

The good performance of these LLMs was mainly attributed to the unique transformer-based deep neural network structures and the attention feature ⁶ which allows the models to learn the long-range text dependencies. As such, LLMs are able to learn the relative importance of different tokens (technically known as “weight”) in relation to the rest of the tokens in a given sentence or paragraph ⁷.

‍

Because of their good performance, and the hierarchical similarities between natural languages and protein sequences (Figure 2), scientists are applying similar approaches to develop protein language models (PLMs). In PLMs – Analogously to LLMs (Figure 2) – amino acid sequences are treated as a stretch of text to train machine learning models ⁸.

‍

Thus, PLMs aim to learn the relationship between protein sequences, their structures, and ultimately their functions (sequence-structure-function) ^9,10.

However, in natural language, texts contain integral symbols, spaces and punctuation (orthography - Figure 1) that divide linguistic sequences into meaningful tokens which makes learning the rules and the grammar of the language feasible and interpretable ^9,11. Such symbols and means of division are absent in abstract protein sequences. Also, proteins are multi-dimensional molecules with significant complexity that is far beyond their linear sequences (Figure 2). All together, these factors hinder achieving meaningful protein sequence tokenization beyond the single amino acid level.

‍

**Figure 2** | The conceptual similarities and hierarchical structure as seen in natural languages and proteins. Inspired by 8. Monomer IgM from 12. Pentamer IgM structure from 13. *This analogy is for approximation and communication purposes only.*

‍

Applications of PLMs in protein design

‍

While the core differences between textual and protein sequences present a challenge, PLMs have proven very valuable to learn the underlying patterns of protein sequences even in the absence of prior biological knowledge (technically known as self-supervised or unsupervised learning) ^14–16.

‍

When biological knowledge is included during training (technically known as supervised learning), PLMs outperform better across a variety of protein engineering and function prediction tasks, as such knowledge is important for learning relationships and patterns that are not obvious in the raw protein sequences ¹⁷. In this context, biological knowledge can be in the form of gene ontology (GO) ¹⁸, multiple sequence alignment (MSA) ¹⁹, cDNA ²⁰ and/or cellular compartment ^21,22.

‍

PLMs could perform several tasks in silico that promise to reduce the time and effort required to perform these tasks using the gold-standard lab-based techniques. For example, Meier et al. developed the ESM-1v model which could investigate the effect of sequence modifications on protein function ²³, surpassing the need for labour-intensive deep mutational scanning experiments ²⁴. Other tasks that PLMs could help with include structure ²⁵ and binding site prediction ²⁶.

‍

When PLMs are trained on large datasets of proteins from public databases – such as UniProt ²⁷ – they are referred to as general-purpose PLMs. Few examples of these PLMs are showcased in Table 1. However, antibodies are among several protein families which harbour structural uniqueness (as we discussed in this article) and are under-represented in these databases. This jeopardises the generalisability of the rules learned from all proteins on antibodies and their design for therapeutic purposes. These factors translated into voices that either praise ²⁸ or criticise ^16,29 the performance and the advantages of using general PLMs for antibody-specific design tasks. Also, they motivated the development of antibody family-specific language models (AbLMs), by either fine-tuning pre-trained general PLMs, or training similar machine learning models solely on antibody sequence data.

‍

Table 1. Few examples of general protein language models

PLM	Developed by	Training set	Prior biological knowledge	Reference
ESM-1v	Facebook AI (USA)	UniRef90: 98 million sequences	No	²³
ESMFold	Meta AI, NYU, Stanford and MIT (USA)	UniRef and PDB	No	²⁵
ProGen	Salesforce and Profluent Bio (USA)	UniParc: 280 million	Yes: Cellular compartment, biological process, molecular function, etc	²²
proteinBERT	University of Jerusalem (Israel)	UniProtKB/UniRef90: 106M proteins	Yes: gene ontology (GO)	¹⁸

Antibody-specific protein language models (AbLMs)

‍

Monoclonal antibodies are one of the most prominent and successful biotherapeutics, but their clinical development and market success rate remain hindered by lengthy discovery and engineering steps ^11,30. AbLMs are rapidly evolving to address these hurdles by enabling computational-based generation, screening and design of antibody candidates. These AbLMs (Table 2) are mainly trained on the large-scale (and still evolving) databases of natural antibody sequences which are made available via public domains such as iReceptor ³¹ and Observed Antibody Space (OAS) ³².

‍

Several currently-available AbLMs could aid the optimisation of either developability or target binding for antibody candidates. For example, Sapiens and DeeAb are two models which have shown good results in suggesting antibody variants with decreased immunogenicity and enhanced thermal stability ^33,34. AntiBERTa, an AbLM which was fine-tuned for paratope prediction tasks, showed strong performance in predicting paratope location on the antibody structure in the absence of target information (antigen-agnostic approach) ¹⁶.

Challenges & perspectives

While major advancements have been made in the development of general PLMs and AbLMs and the benefits they offer to the scientific and medical communities, areas of improvement do exist.

‍

Generally, the biassed nature of scientific publishing tends to highlight successes over failures which hinders our overall understanding of the applicability of PLMs and how much we can really trust them.

‍

Technically, as most PLMs are trained in a self-supervised manner, they are regarded as black box models, as even when they perform well, it is challenging to pinpoint what rules they extract from the training data, and how biologically relevant they are ^8,9.

‍

Also, AbLMs were trained using only unpaired antibody sequences until recently due to the scarcity and high-cost associated with generating paired-chain antibody sequence data ³⁵. Thus, with the exception of two newly introduced models ^30,36, most current AbLMs miss out on chain pairing information: a vital factor for a complete and functional antibody molecule.

‍

Additionally, the absence of large-scale antibody-antigen data (specificity-labelled antibodies), present another challenge for the implementation of AbLMs ³⁷. A reasonable approach to tackle this issue is to attempt learning the rules of target binding from a large-scale of antibody-antigen variants that originates from a single pair and harbour only subtle differences among them on the sequence level ^38,39. This could provide a “baseline” for understanding these rules starting from a relatively less diverse training data before attempting learning them from a more complex and fundamentally different small set of antibody-antigen pairs.

‍

Overall, we are invited to go back to the basics of linguistics in order to identify linguistic features in antibody sequences and formalise the language of antibodies (immunolinguistics), which will ultimately aid us to design interpretable and biologically relevant AbLMs ⁴⁰. Finally, we summarise below a few AbLMs with the aim to present short summaries for some of them in our next blog posts, while introducing our readers to more concepts and techniques of machine learning used to develop these AbLMs.

‍

Table 2. Examples of antibody-specific PLMs (AbLMs).

AbLM	Developed by	Paired chain training	Reference
IgLM	John Hopkins University and University of California (USA)	No	⁴¹
AbLang and AbLang2	OPIG at Oxford University and GSK (UK)	No	^{42, 43}
AntiBERTa	Alchemab Therapeutics Ltd (UK)	No	¹⁶
Bio-inspired Antibody Language Model (BALM)	Fudan University (China)	No	⁴⁴
BALM-paired and unpaired (trained on BALM)	The Scripps Research Institute (USA)	Yes	³⁰
OAS-trained RoBERTa	AbSci corporation (USA)	No	⁴⁵
IgBert and IgT5	Exscientia and OPIG at Oxford University (UK)	Yes	36
AntiBERTy	John Hopkins University, USA	No	⁴⁶
Immune2Vec	Bar Ilan University, Israel	No	⁴⁷
Sapiens	Merck (USA) and University of Chemistry and Technology (Czech Republic)	No	³³
FAbConSapiens	Alchemab Therapeutics Ltd (UK)	Yes	⁴⁸

References

1. Chang, Y. et al. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 15, 1–45 (2024).

2. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv [cs.CL] (2018).

3. Liu, Y. et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv [cs.CL] (2019).

4. Makridakis, S., Petropoulos, F. & Kang, Y. Large Language Models: Their Success and Impact. Forecasting 5, 536–549 (2023).

5. Austen, J. Pride and Prejudice. (1813).

6. Vaswani, A. et al. Attention is All you Need. Adv. Neural Inf. Process. Syst. 5998–6008 (2017).

7. Wolf, T. et al. Transformers: State-of-the-Art Natural Language Processing. in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (eds. Liu, Q. & Schlangen, D.) 38–45 (Association for Computational Linguistics, Online, 2020).

8. Ferruz, N. & Höcker, B. Controllable protein design with language models. Nature Machine Intelligence 4, 521–532 (2022).

9. Vu, M. H. et al. Linguistically inspired roadmap for building biologically reliable protein language models. Nature Machine Intelligence 5, 485–496 (2023).

10. Ruffolo, J. A. & Madani, A. Designing proteins with language models. Nat. Biotechnol. 42, 200–202 (2024).

11. Akbar, R. et al. Progress and challenges for the machine learning-based design of fit-for-purpose monoclonal antibodies. MAbs 14, 2008790 (2022).

12. Chen, Q., Menon, R., Calder, L. J., Tolar, P. & Rosenthal, P. B. Cryomicroscopy reveals the structural basis for a flexible hinge motion in the immunoglobulin M pentamer. Nat. Commun. 13, 6314 (2022).

13. Lyu, M., Malyutin, A. G. & Stadtmueller, B. M. The structure of the teleost Immunoglobulin M core provides insights on polymeric antibody evolution, assembly, and function. Nat. Commun. 14, 7583 (2023).

14. Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).

15. Elnaggar, A. et al. ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022).

16. Leem, J., Mitchell, L. S., Farmery, J. H. R., Barton, J. & Galson, J. D. Deciphering the language of antibodies using self-supervised learning. Patterns 3, 100513 (2022).

17. Bepler, T. & Berger, B. Learning the protein language: Evolution, structure, and function. Cell Syst 12, 654–669.e3 (2021).

18. Brandes, N., Ofer, D., Peleg, Y., Rappoport, N. & Linial, M. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 38, 2102–2110 (2022).

19. Lupo, U., Sgarbossa, D. & Bitbol, A.-F. Protein language models trained on multiple sequence alignments learn phylogenetic relationships. Nat. Commun. 13, 6298 (2022).

20. Outeiral, C. & Deane, C. M. Codon language embeddings provide strong signals for use in protein engineering. Nature Machine Intelligence 6, 170–179 (2024).

21. Unsal, S. et al. Learning functional properties of proteins with language models. Nature Machine Intelligence 4, 227–245 (2022).

22. Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. (2023) doi:10.1038/s41587-022-01618-2.

23. Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. bioRxiv 2021.07.09.450648 (2021) doi:10.1101/2021.07.09.450648.

24. Fowler, D. M. & Fields, S. Deep mutational scanning: a new style of protein science. Nat. Methods 11, 801–807 (2014).

25. Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).

26. Barton, J., Gaspariunas, A., Galson, J. D. & Leem, J. Building Representation Learning Models for Antibody Comprehension. Cold Spring Harb. Perspect. Biol. 16, (2024).

27. UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531 (2023).

28. Hie, B. L. et al. Efficient evolution of human antibodies from general protein language models. Nat. Biotechnol. 42, 275–283 (2024).

29. Abanades, B., Georges, G., Bujotzek, A. & Deane, C. M. ABlooper: Fast accurate antibody CDR loop structure prediction with accuracy estimation. Bioinformatics (2022) doi:10.1093/bioinformatics/btac016 .

30. Burbach, S. M. & Briney, B. Improving antibody language models with native pairing. arXiv [q-bio.BM] (2023).

31. Corrie, B. D. et al. iReceptor: A platform for querying and analyzing antibody/B-cell and T-cell receptor repertoire data across federated repositories. Immunol. Rev. 284, 24–41 (07/2018).

32. Kovaltsuk, A. et al. Observed Antibody Space: A Resource for Data Mining Next-Generation Sequencing of Antibody Repertoires. The Journal of Immunology 201, 2502–2509 (2018).

33. Prihoda, D. et al. BioPhi: A platform for antibody design, humanization, and humanness evaluation based on natural antibody repertoires and deep learning. MAbs 14, (2022).

34. Hutchinson, M. et al. Enhancement of antibody thermostability and affinity by computational design in the absence of antigen. bioRxiv 2023.12.19.572421 (2023) doi:10.1101/2023.12.19.572421.

35. Mhanna, V. et al. Adaptive immune receptor repertoire analysis. Nature Reviews Methods Primers 4, 1–25 (2024).

36. Kenlay, H. et al. Large scale paired antibody language models. arXiv [q-bio.BM] (2024).

37. Akbar, R. et al. In silico proof of principle of machine learning-based antibody design at unconstrained scale. MAbs 14, 2031482 (2022).

38. Wang, Y. et al. An explainable language model for antibody specificity prediction using curated influenza hemagglutinin antibodies. bioRxiv (2023) doi:10.1101/2023.09.11.557288.

39. Chinery, L. et al. Baselining the Buzz Trastuzumab-HER2 Affinity, and Beyond. bioRxiv 2024.03.26.586756 (2024) doi:10.1101/2024.03.26.586756.

40. Vu, M. H. et al. ImmunoLingo: Linguistics-based formalization of the antibody language. arXiv [q-bio.QM] (2022).

41. Shuai, R. W., Ruffolo, J. A. & Gray, J. J. IgLM: Infilling language modeling for antibody sequence design. Cell Syst 14, 979–989.e4 (2023).

42. Olsen, T. H., Moal, I. H. & Deane, C. M. AbLang: an antibody language model for completing antibody sequences. Bioinform Adv 2, vbac046 (2022).

43. Olsen, T. H., Moal, I. H. & Deane, C. M. Addressing the antibody germline bias and its effect on language models for improved antibody design. bioRxiv 2024.02.02.578678 (2024) doi:10.1101/2024.02.02.578678.

44. Jing, H. et al. Accurate Prediction of Antibody Function and Structure Using Bio-Inspired Antibody Language Model. bioRxiv 2023.08.30.555473 (2023) doi:10.1101/2023.08.30.555473.

45. Bachas, S. et al. Antibody optimization enabled by artificial intelligence predictions of binding affinity and naturalness. bioRxiv 2022.08.16.504181 (2022) doi:10.1101/2022.08.16.504181.

46. Ruffolo, J. A., Gray, J. J. & Sulam, J. Deciphering antibody affinity maturation with language models and weakly supervised learning. arXiv [q-bio.BM] (2021).

47. Ostrovsky-Berman, M., Frankel, B., Polak, P. & Yaari, G. Immune2vec: Embedding B/T Cell Receptor Sequences in ℝ N Using Natural Language Processing. Front. Immunol. 12, 680687 (2021).

48. Barton, J. et al. A generative foundation model for antibody sequence understanding. bioRxiv 2024.05.22.594943 (2024) doi:10.1101/2024.05.22.594943.

‍