Antibody hit expansion by mining antibodies in large NGS repertoires

How to use known target-specific antibodies to find related clones from next-generation repertoire sequencing data

Company and product
September 27, 2022
Read time:
A diagram showing a clonal expansion strategy by mining NGS repertoires with characterized antibody sequences with the help of bioinformatics analysis on PipeBio bioinformatics cloud


NGS is often used to explore the depths of the immune response against a given antigen as the cost of sequencing millions of antibodies is very low. The drawback is unfortunately that it is not possible and economically feasible to perform functional assays for millions of individual antibodies or clones. Thus, the typical approach is to look at most abundant sequences in the NGS pool, synthesize a range of most expressed clones and then perform functional assays on a few hundred of them.

Single-cell technologies from companies such as 10x Genomics and Berkeley Lights can be used analyze thousands of individual B-cells, although still not millions of individual cells. An alternative and cost-effective approach is to use Sanger-sequenced antibodies from initial screening as input to find similar sequences in a large pool of NGS repertoire sequences. In this way antibodies from the initial screening and larger NGS repertoires can be used for clonal hit expansion.

Mining for good antibody clones in a pool of repertoire sequences

When mining for good antibodies in large repertoires there are a couple of different approaches. One approach is to pool target-specific antibodies from the initial screening together with NGS sequences and cluster them all jointly. However, this approach may result in some of the sequences in the cluster not being closely related to the original hit sequences (within the specified identity threshold).

Another approach is to seed individual clusters from original hit antibodies that have shown target-specificity, in order to find closely related sequences with similar binding properties. This way a more diverse set of antibodies can be created around the original hits. Clustering around the original hit antibodies will place the original hit sequences as centroids of the clusters and will ensure that all sequences in the cluster relate to the original hit sequences within the identity threshold specified in the clustering parameters.

Illustration showing regular sequence clustering versus clustering around a characterized hit sequence from a previous experiment
Figure 1. On the left: Regular cluster with NGS sequences as centroid sequences. On the right: Sanger sequence as centroid sequence.

Identifying related clones in NGS antibody repertoire

We use data from Miyazaki et al. (2015). The data consists of

  • 13 antigen-specific antibodies derived from a single-round biopanning experiment
  • 171,483 NGS repertoire sequences, before panning
  • 107,720 NGS repertoire sequences, after panning

Here, we use 11 of the 13 original hit sequences from the paper as “seed sequences” for clustering in order to find related antibodies in the larger antibody repertoire.

We start by merging paired-end NGS reads and annotating both NGS and Sanger-sequenced hits from the biopanning experiment. After removing sequences with frameshifts, stop codons and missing CDR or FR regions, we are left with 157,632 NGS sequences for the pre panning and 102,223 sequences for the post panning datasets.

Original hitsPre panning (NGS)Post panning (NGS)
Before QC11171,483107,720
After QC11157,632102,223
Table 1 above displays the remaining sequences after annotation and quality control.

We then cluster all three datasets around CDR-H3 region with 90% identity and calculate fold changes between sequence clusters before panning and after panning.

As seen in the table below, there is redundancy in the CDR-H3 sequences of the original hits and some of the CDR-H3 sequences are identical. As a result, some clusters will contain multiple original hit sequences with identical CDR-H3 sequences. Ideally, only sequences with unique CDR-H3 should be used, but this is ultimately dependent on the questions asked for the analysis. As we can observe in the Cdrh3Cluster id column, we have 5 unique CDR-H3 clusters among our original hit antibody sequences.

A table showin shows 11 original hit sequences, the CDR-H3 clusters they belong to, the hV-gene, hJ-gene and fold changes across panning rounds
Figure 2. The figure shows 11 original hit sequences, the CDR-H3 clusters they belong to, the hV-gene, hJ-gene and fold changes across panning rounds

In order to obtain unique clonotypes across multiple CDRs, we extract each unique CDR-H3 cluster and cluster again on CDR-H1, CDR-H2 and CDR-H3 with 100% identity to obtain a combined cluster. We then pick the top 5 most abundant sequences (1 original hit + 4 repertoire sequences) that represent a both a unique CDR-H3 cluster and a unique combined cluster. When multiple original hit sequences were found to be the most abundant sequences in a cluster, we picked the next, most abundant repertoire sequence.

After picking 5 sequences for each 5 CDR-H3 cluster we aligned the 25 sequences in order to analyze sequence diversity and phylogeny for each CDR-H3 cluster and each combined cluster representative. Original hit sequences are referred to as “Sanger” in the cladogram.

Figure showing an unedited phylogenetic tree from PipeBio shows four distinct CDR-H3 clusters and variability across CDR-H1 and CDR-H2 regions
Figure 3. The unedited phylogenetic tree from PipeBio shows four distinct CDR-H3 clusters and variability across CDR-H1 and CDR-H2 regions

We can clearly see the distinct phylogeny of the different clusters as well as the clonal diversity added by the repertoire sequencing. The combined clusters represent unique CDR subclusters of the larger CDR-H3 clusters and each sequence in the dendrogram represents one of these subclusters. The column CombinedCluster_size denotes the number of sequences in that subcluster. The fold changes between panning rounds have been normalized by PipeBio’s clustering tool, since differing sample sizes can skew the representation. Displayed fold changes are calculated based on the CDR-H3 cluster and range between 3 and 49.

In addition to the presented data, other immunoassay data, such as ELISA-values and titers can be added to the phylogenetic tree for more detailed analysis. This will help to pick the most promising sequences for subsequent synthesis and functional assays.

Hit pick and synthesize antibody clones

Following the phylogenetic analysis of antibody sequences, the most promising antibody sequences can be extracted from PipeBio and sent to be synthesized. The synthesized antibodies can then be sub-cloned and further functional analysis can be performed on the expanded clones. This functional data can then be reimported into the PipeBio platform and associated with the hit-picked clones for a complete analysis of the original and expanded hits.

Hit-picked antibody clones together with functional data is thus a valuable resource for future analyses and can easily be overlayed visually on PipeBio together with sequences, sequence liabilities, protein properties, binding affinity and other immunoassay data.


Right-pointing black chevron

Miyazaki, N., Kiyose, N., Akazawa, Y., Takashima, M., Hagihara, Y., Inoue, N., Matsuda, T., Ogawa, R., Inoue, S., & Ito, Y. (2015). Isolation and characterization of antigen-specific alpaca (Lama pacos) VHH antibodies by biopanning followed by high-Throughput sequencing. Journal of Biochemistry, 158(3), 205–215.

Expand hits and analyze antibody phylogeny with immunoassay data?

Other recent posts