Identify sequence candidates from biopanning

A key question in therapeutic antibody drug discovery is how to identify the most promising sequence candidates from a biopanning experiment using next-generation sequencing (NGS) data. This can easily be done using built-in tools in the PipeBio platform, without prior bioinformatics or coding experience. In this post, we’ll re-analyze the NGS data published in Hanke et al., where an antibody from the immunized alpaca named ‘Tyson’ was shown to neutralize the SARS-CoV-2 spike protein. We identify a shortlist of novel sequence candidates, in addition to the Ty1 sequence found in the paper.

Tyson the alpaca. Image credits: preclinics GmbH

Data used in the analysis

The NGS data used in this post were generated by Hanke et al., available in the Sequence Read Archive under BioProject ID PRJNA638614. The data consist of four samples:

  • two replicates prior to the first panning round (pre-panning)
  • a sample after the first panning round
  • a sample after the second panning round.

We imported the data into the PipeBio platform, and merged the forward-reverse pairs. A summary of the data is shown in the table below.

Sample accessionGroupNumber of read pairsMerged reads, %
SRR11974622Panning round 21,371,77777.25%
SRR11974623Panning round 12,206,99270.39%
SRR11974624Pre-panning4,053,56875.51%
SRR11974625Pre-panning3,193,30576.65%
Summary of the data produced in PipeBio. NGS data from Hanke et al.

Annotation and clustering

Next, we annotated the samples in PipeBio, using the alpaca germline database. Reads were filtered, so only reads with a complete VHH without stop codons and frameshifts were kept in each sample. The number of sequences annotated with VHH and passing the quality filters is shown below.

Sample accessionNumber of merged readsNumber of reads passing quality filters
SRR119746221,059,733729,650
SRR119746231,553,531989,317
SRR119746243,060,9332,417,651
SRR119746252,447,7431,930,294
Number of annotated reads passing quality filters built into the  PipeBio annotation pipeline. NGS data from Hanke et al.

We then clustered the reads from the four samples together on the CDR-H3 region, using an 85% identity cutoff. This identified 285,769 clusters in the data. Of these, the 20 largest clusters accounted for over 50% of the sequences.

The 20 largest clusters account for over 50% of the data.
The 20 largest clusters account for over 50% of the data.

Differential enrichment analysis

Next, we carried out differential enrichment analysis, using the two pre-panning replicates in one group. We wish to detect clusters that are increasingly enriched in each panning round, therefore we carried out the following statistical comparisons:

  • Panning round 2 vs. Pre-panning
  • Panning round 1 vs. Pre-panning
  • Panning round 2 vs. Panning round 1

With a cutoff for statistical significance at FDR-corrected p-value < 0.05 for each comparison, we identified 87 CDR-H3 clusters to be enriched at least 200-fold from the pre-panning library to panning round 2, and at least 2-fold enriched from pre-panning to round 1, and also at least 2-fold enriched from round 1 to round 2.

Volcano plot illustrating the results of differential enrichment analysis when comparing panning round 2 to the two pre-panning samples. The most differentially enriched clusters (FDR-corrected p-value < 0.05, abs(fold-change) > 200) are highlighted in red. The top rightmost point represents the cluster including the Ty1 sequence.

We retrieved a diverse set of representative sequences from these most differentially enriched clusters using the following filtering criteria: for each of the 87 differentially enriched clusters, the top 3 ranked unique CDR-H3 amino acid sequences are identified. Each of these sequences potentially represents multiple nanobodies (with mutations outside CDR-H3). Of the nanobodies in the cluster with the top-ranked CDR-H3’s, the most abundant full VHH amino acid sequences are chosen, such that up to 75% of the whole cluster is covered, with a maximum of 5 sequences picked per cluster.

This selection procedure resulted in a total of 506 unique nanobodies. 32 of these nanobodies were supported by more than 10 reads in the data.

An alignment of the 32 sequences shows that they belong to a handful of distinct families. The selected sequence in the alignment is the Ty1 sequence. The CDR-H3 region is highlighted in the alignment.

Screenshot from PipeBio, showing an alignment of the 32 most differentially enriched nanobodies supported by more than 10 reads in the data.

A shortlist of candidate sequences from biopanning

The last step in our analysis is to identify a shortlist of candidate sequences best representing the families in the 32 selected differentially enriched VHH sequences.

We extracted the most abundant representative from each of the 8 main families in the alignment. The alignment of the candidate sequences below illustrates the sequence diversity in the CDR’s, all obtained from this single panning experiment.

Screenshot from PipeBio, showing an alignment of the 8 candidate sequences selected based on the NGS data from Hanke et al. The Ty1 sequence is number 4 from the top.

Summary

We have shown that it is easy to identify the most promising sequence candidates from biopanning, when using the PipeBio platform to analyze NGS data. We were able to narrow down nearly 11 million paired-end reads to 8 candidate sequences, one of which was the Ty1 sequence. The bioinformatics analysis described here took just under 8 hours, and required no coding or scripting experience.

Summary of the bioinformatics selection pipeline to narrow down nearly 11 million paired-end reads to 8 most promising candidate sequences.

If you are interested in more details about this analysis, or would like to do similar analysis on your own data, then contact the PipeBio team here or sign up for a trial.