As promised in our previous blog post, we intend to introduce our readers to a number of antibody-specific language models (AbLMs) in a series of articles. In this article we present the first one – IgLM.
Model definition – In Brief
Immunoglobulin Language Model (IgLM) is a generative antibody-specific language model that leverages bidirectional context for generating, infilling and diversifying antibody sequences, conditioned on chain type and species of origin. It was published in Cell Systems in 2023.
Training data
IgLM was trained on 558M non-redundant unpaired antibody Fv sequences from human, mouse, rat, rabbit, rhesus and camel that belong to both heavy and light chains. These sequences are sourced from the Observed Antibody Space (OAS) 1.
Training strategy
IgLM uses a decoder-only transformer architecture based on GPT-22. It is trained on sequence infilling by adopting the infilling language model formulation from natural language processing 3. Briefly, the term” infilling” refers to the model’s ability to predict missing spans of protein sequence based on the knowledge obtained during training.
Overall, IgLM uses a combination of three strategies during training (Figure 1):
- Sequence masking and rearrangement: For each antibody, sequence segments of random-length are masked (hidden) during training and appended to the end of the sequence, creating a rearranged sequence. This enables IgLM to learn and eventually predict the masked segments depending on the surrounding amino acid tokens.
- Bidirectionality
Sequences are presented to IgLM in both directions during training. This enables learning both the left-to-right and the right-to-left contexts and dependencies, while maintaining the autoregressive nature of the model to eventually generate antibody sequences in the correct left-to-right direction.
- Conditioning tags
Training data is presented to the model accompanied with two tags that inform the model about the species of origin and the chain type (Heavy chain ;VH or Light chain VL) of each antibody sequence in the training set.
What Can IgLM be used for?
IgLM can:
- Generate antibodies conditioned to chain type and species of origin.
- Evaluate the likelihood of an antibody sequence to belong to a specific chain type and species of origin.
- Infill and/or diversify a certain region of the antibody sequence that is missing or needs improvement.
The infilling and diversification function showed to improve the in silico developability 4,5 and humanness 6 of the original antibody input. Also, the training strategies (Figure 1) helped IgLM to achieve good performance in infilling the CDRs being the most diverse, and hence hardest to fill regions of the antibody sequence. For example IgLM achieved higher certainty (lower perplexity scores) for generated or infilled CDR loops when compared to ProGen2-OAS 7. Of note, ProGen2-OAS is an antibody-specific language model of similar training dataset size and transformer architecture to IgLM, but it does not support bidirectional training and conditioning tags 8
Model Inventors and code availability
Richard Shuai, Jeffrey Ruffolo, and Jeffrey Gray at John Hopkins University (USA). The commercial use of IgLM requires a licensing agreement. Model code and installation is provided via this github repository. Model publication is available from Cell Press.