NGS is often used to explore the depths of the immune response against a given antigen and the cost of sequencing millions of antibodies is very low. The drawback is unfortunately, it is not possible and economically feasible to perform functional assays for millions of individual antibodies or clones, thus the typical approach is to look at most abundant sequences in the NGS pool, synthesize a range of most expressed clones and then perform functional assays on a few hundreds. Single cell technologies from companies such as 10x Genomics and Berkeley Lights, can be used analyze thousands of individual B-cells although this is still not millions of individual cells.
An alternative and cost-effective approach is to use already characterized Sanger sequences as input to finding similar sequences in a large pool of NGS data.
Fishing for good clones in a pool of repertoire sequences
When fishing for good sequences in large repertoires there are a couple of different approaches. One approach is to pool Sanger and NGS sequences and cluster them all together. Thus, some of the sequences in the cluster may not be closely (within the specified identity threshold) related to the Sanger sequence. The Sanger sequence will still be closely related to the centroid sequence which is often the most abundant sequence.
Another approach is to seed individual clusters from individual single Sanger sequences. This will put the Sanger sequence as centroid of the cluster and will ensure that all sequences in the cluster relates to the Sanger sequence within the identity threshold specified in the clustering parameters.
These two scenarios are illustrated here
How many related sequences can we find in a pool of NGS data?
We use data from Miyazaki et al.
Here, we use 11 Sanger sequences as “seed sequences” for clustering to find related antibodies in a larger antibody repertoire.
The comparison is carried out on the CDR-H3 region. As seen in the table below are there redundancy in the CDR-H3 sequences of the sanger data and some of the CDR-H3 sequences are identical. As a result of this are some clusters “empty” and will only consist of the seeding Sanger sequence. Ideally only sequences with unique CDR-H3 should be used but this is dependent on the questions asked for the analysis. Clustering on the entire VH region or other clonotypes will give a different result.
Clusters are formed around 5 unique CDR-H3 sequences coming from the initial Sanger sequences and three of those clusters have diversity in the cluster as seen from the “clusterAaDistribution” column.
In one cluster we find 37 different CDR-H3 sequences which a number of sequences which can easily be analyzed in depth with various functional assays.
Extracting sequences from that cluster and aligning those, will show sequence diversity within the cluster. The top sequence in the alignment is the Sanger sequence and is labelled “Binder” and “Cherry picked”. Functional or assay data associated on the individual Sanger sequences kept during the clustering analysis and alignment making it easy to spot trends with e.g. good binders. Functional data in the alignment view is not shown here.
Hit pick and synthesize clones
The next step is to hit-pick and extract interesting sequences/clones and have the sequences synthesized. The synthesized sequences can then be sub-cloned and additional functional analysis can be performed. Functional data can then be reimported into the PipeBio platform and associated with the hit-picked clones.
Hit picked clones with functional data will become a valuable resource for future analyses.
The analysis steps of the above experiment are very easy to perform and some of the steps are shown here
Miyazaki, N., Kiyose, N., Akazawa, Y., Takashima, M., Hagihara, Y., Inoue, N., Matsuda, T., Ogawa, R., Inoue, S., & Ito, Y. (2015). Isolation and characterization of antigen-specific alpaca (Lama pacos) VHH antibodies by biopanning followed by high-Throughput sequencing. Journal of Biochemistry, 158(3), 205–215. https://doi.org/10.1093/jb/mvv038