A key question in therapeutic antibody drug discovery is how to identify the most promising sequence candidates from a biopanning experiment using next-generation sequencing (NGS) data. This can easily be done using built-in tools in the PipeBio platform, without prior bioinformatics or coding experience. In this post, we’ll re-analyze the NGS data published in Hanke et al., where an antibody from the immunized alpaca named ‘Tyson’ was shown to neutralize the SARS-CoV-2 spike protein. We identify a shortlist of novel sequence candidates, in addition to the Ty1 sequence found in the paper.
Data used in the analysis
- two replicates prior to the first panning round (pre-panning)
- a sample after the first panning round
- a sample after the second panning round.
We imported the data into the PipeBio platform, and merged the forward-reverse pairs. A summary of the data is shown in the table below.
Annotation and clustering
Next, we annotated the samples in PipeBio, using the alpaca germline database. Reads were filtered, so only reads with a complete VHH without stop codons and frameshifts were kept in each sample. The number of sequences annotated with VHH and passing the quality filters is shown below.
We then clustered the reads from the four samples together on the CDR-H3 region, using an 85% identity cutoff. This identified 285,769 clusters in the data. Of these, the 20 largest clusters accounted for over 50% of the sequences.
Differential enrichment analysis
Next, we carried out differential enrichment analysis, using the two pre-panning replicates in one group. We wish to detect clusters that are increasingly enriched in each panning round, therefore we carried out the following statistical comparisons:
- Panning round 2 vs. Pre-panning
- Panning round 1 vs. Pre-panning
- Panning round 2 vs. Panning round 1
With a cutoff for statistical significance at FDR-corrected p-value < 0.05 for each comparison, we identified 87 CDR-H3 clusters to be enriched at least 200-fold from the pre-panning library to panning round 2, and at least 2-fold enriched from pre-panning to round 1, and also at least 2-fold enriched from round 1 to round 2.
The most differentially enriched clusters in Figure 2. (FDR-corrected p-value < 0.05, abs(fold-change) > 200) are highlighted in red. The top rightmost point represents the cluster including the Ty1 sequence.
We retrieved a diverse set of representative sequences from these most differentially enriched clusters using the following filtering criteria: for each of the 87 differentially enriched clusters, the top 3 ranked unique CDR-H3 amino acid sequences are identified. Each of these sequences potentially represents multiple nanobodies (with mutations outside CDR-H3). Of the nanobodies in the cluster with the top-ranked CDR-H3’s, the most abundant full VHH amino acid sequences are chosen, such that up to 75% of the whole cluster is covered, with a maximum of 5 sequences picked per cluster.
This selection procedure resulted in a total of 506 unique nanobodies. 32 of these nanobodies were supported by more than 10 reads in the data.
An alignment of the 32 sequences shows that they belong to a handful of distinct families. The selected sequence in the alignment is the Ty1 sequence. The CDR-H3 region is highlighted in the alignment.
A shortlist of candidate sequences from biopanning
The last step in our analysis is to identify a shortlist of candidate sequences best representing the families in the 32 selected differentially enriched VHH sequences.
We extracted the most abundant representative from each of the 8 main families in the alignment. The alignment of the candidate sequences below illustrates the sequence diversity in the CDR's, all obtained from this single panning experiment.
We have shown that it is easy to identify the most promising sequence candidates from biopanning, when using the PipeBio platform to analyze NGS data. We were able to narrow down nearly 11 million paired-end reads to 8 candidate sequences, one of which was the Ty1 sequence. The bioinformatics analysis described here took just under 8 hours, and required no coding or scripting experience.
If you are interested in more details about this analysis, or would like to do similar analysis on your own data, then contact the PipeBio team here.