How to analyze 1.6 billion sequences of NGS data in a day

We are excited to share the latest advancements of PipeBio’s capabilities, as it takes analyzing NGS data one step further. Currently, our customers are able to routinely analyze Illumina NovaSeq data on a scale of 1.3 to 1.6 billion reads.

Analysis of data at this scale is a challenge in itself and fortunately, PipeBio's workflow tools automate most of this. Analysis for this amount of data typically takes around 24 hours. However, the subsequent interpretation of results by scientists to hit pick can take a few more days.

Image illustrating analysis of 1.6 billion sequences in one day with a DNA sequence icon and a clock with the labels 1.6bn and ~1d displayed on top of the icons — **Figure 1.** Analysis of 1.6 billion NGS reads can be done in around one day on PipeBio

‍

We work with leading sequencing vendors to directly import NovaSeq reads onto PipeBio, saving hours of processing time compared to downloading and re-uploading the files, typically 50-150 Gb in size.

Once Sanger or NGS sequence data has been uploaded onto PipeBio, our customers are able to run your analysis workflows with the readily available suite of tools on the platform. These include tools for quality control (QC), alignment, annotation and enrichment, among others.

Out-of-the-box tools and NovaSeq

In the case of NovaSeq, however, we typically need to help our customers design a custom workflow. Although PipeBio has most tools available “out of the box”, NovaSeq analysis often requires linking these tools together in more complicated ways. Completing custom configurations typically takes a day or two and in order to ensure that the setup yields expected results, we also perform dry-runs with the data. In addition to helping configure the pipeline, we can also give technical advice on, for example, required read depth, primer positions and other questions.

Consequently, we are able to support advanced studies with complicated comparisons across multiple panning rounds. In these studies the amount of reads can range from tens to hundreds of gigabytes in size. These capabilities add to PipeBio’s existing strong suite of QC and analysis tools for Sanger and NGS deep sequencing data and enable end-to-end analysis workflows.

Common challenges with NGS and deep sequencing

The high-throughput capability of NovaSeq to sequence up to 6 terabases and 20 billion paired-end reads in less than two days makes the platform extremely powerful, however, it also scales up the requirements for adequate data analysis capabilities.

Higher throughput, more data

The unprecedented throughput of NovaSeq accompanied by multiplexing with barcoding technology makes large, comprehensive analysis studies possible and enables sequencing across multiple animals, multiple panning strategies and multiple panning rounds all in one go. Paired-end sequencing with forward and reverse reads give greater confidence in read accuracy and, unlike single-read data, allows detection of indels. Paired-end sequencing also gives you twice the amount of reads with the same amount of time spent in library prep.

All this means that data can be obtained faster from your affinity-based library screening, but it also comes with data management challenges; It is necessary to take into account the complex metadata relating to each pair of FASTQ files in the analysis workflow.

Image showing file attributes for mapping sequence samples in a biopanning experiment — **Figure 2.** The ability to correctly map of FASTQ files and metadata is essential for complex workflows

A typical experiment would incorporate at least two panning rounds and in addition to this, several animals, tissue samples and strategies may be incorporated to find the right candidate sequences in, for example, antibody discovery research. This type of a workflow requires making deep comparisons spanning multiple levels and in parallel (as illustrated below in Figure 3), which is challenging.

An illustration of a complex biopanning strategy with multiple panning rounds by phage display technology — **Figure 3.** The comparisons can span over multiple panning rounds, strategies and amount to billions of reads

Analyzing deep sequencing data

Fortunately, this is something that PipeBio can help you with. The biopanning workflow in PipeBio takes into account the metadata associated with the samples and enables automated high-throughput workflows with, for example, several libraries of tens of millions of reads, all the way down to a few hundred sequences.

Normalization can be run across varying sample sizes and thus imported sample libraries are allowed to differ in size. PipeBio’s suite of tools supports all workflows from manual analysis of Sanger sequences to high-throughput NGS data analysis, so after completing the desired rounds of panning and arriving at sufficient enrichment, it is possible to augment the NGS data with Sanger sequencing data.

What does the NGS workflow look like on PipeBio?

‍

A table showing the biopanning experiment setup with antigen, animal, tissue and choice of strategy leading to several panning rounds

1. Experimental & workflow design

After designing the experiment itself, the first step in an experiment like the one illustrated above is to create a comparison sheet where the sample comparisons of interest are listed.

‍

2. Import FASTQ files

Optionally also import Sanger sequences and assay data to mine with.

‍

A phred-score diagram showing quality scores for NGS reads

3. Review QC report and spot check

You can view graphs on quality scores, chromatograms for individual nucleotides in the alignment view and much more.

‍

A workflow for merging, annotating and clustering NGS sequences on PipeBio

4. Run the analysis in a single pipeline

You can view graphs on quality scores, chromatograms for individual nucleotides in the alignment view and much more.

‍

Sequence viewer showing CDRs and FRs on PipeBio

5. Analyze your data with the interactive interface and out-of-the-box tools

You can screen smaller datasets nucleotide by nucleotide or use filters or features like hit picking using factors such as enrichment, diversity and Sanger sequences.

‍

6. Export and share the data

After completing the analysis on PipeBio, the data or parts of it, for example specific rows or columns, can be easily exported in convenient formats (.tsv, .csv, Excel), so you are able to share results with colleagues and have it in the desired format for further analysis, if needed.

From Sanger and 10x to NGS

We hope that you find PipeBio to be a powerful suite of bioinformatics tools that really do help bench scientists to easily analyze sequence reads and assay data. PipeBio has optimized tools for both small scale Sanger QC all the way up to extensive NovaSeq analysis. The tools are simple to use, typically requiring only a few mouse clicks. Additionally the results are highly interactive, making interpretation easy and fun.