Advances in whole-genome sequencing have revolutionized digital biology.

Genomic programs around the world are gaining momentum as the cost of high-throughput next-generation sequencing has come down.

Whether used for sequencing intensive care patients with rare diseases or in population genetic studieswhole genome sequencing is becoming a fundamental step in clinical workflows and drug discovery.

But sequencing the genome is only the first step. Analysis of genome sequencing data requires accelerated computing, data science and artificial intelligence to read and understand the genome. WITH the end of Moore’s Lawnoting that the number of transistors in an integrated circuit doubles every two years, new computational approaches are needed to lower the cost of data analysis, increase readout throughput and accuracy, and ultimately unlock the full potential of the human genome.

The explosion in Bioinformatics data

Sequencing the entire human genome generates approximately 100 gigabytes of raw data. This more than doubles after genome sequencing with sophisticated algorithms and applications such as deep learning and natural language processing.

As the cost of sequencing the human genome continues to decrease, the amount of sequencing data is increasing exponentially.

Rating 40 exabytes by 2025, all human genome data will need to be stored. For comparison, that’s 8 times more than it takes to store every word spoken in history.

There are many channels for genome analysis trying to keep up with large levels of raw data generation.

Accelerated Analysis of genome sequencing Work processes

Sequencing analysis is complex and computationally intensive, with multiple steps required to identify genetic variants in the human genome.

Deep learning is becoming essential for base calling directly in the genomic tool using RNN and Convolutional Neural Network (CNN) based models. Neural networks interpret the image and signal data generated by the tools and infer the 3 billion nucleotide pairs of the human genome. This improves the accuracy of reads and ensures that base calling occurs in closer to real-time, further accelerating the entire genomics workflow, from sample to variant call format to final report.

For secondary genomic analysis, alignment technologies use a reference genome to help connect the genome after sequencing the DNA fragments.

BWA-MEMleading algorithm for alignment, helps researchers rapidly map DNA sequence reads to a reference genome. STAR is another gold standard alignment algorithm used for RNA-sequencing data that provides accurate, ultra-fast alignment for better understanding of gene expression.

The Smith-Waterman dynamic programming algorithm is also widely used for alignment, a step that is speeded up by a factor of 35 NVIDIA H100 Tensor Core GPUwhich includes a dynamic programming accelerator.

Disclosure Genetic variants

One of the most important steps in sequencing projects is variant identification, when researchers identify differences between a patient sample and a reference genome. This helps clinicians determine which genetic disease a critically ill patient may have, or helps researchers study populations to discover new targets for treatment. These variants can be single nucleotide changes, small insertions and deletions, or complex rearrangements.

GPU optimized and accelerated calls such as GATK of the Shiroky Institute — a genome analysis toolkit for calling germline variants — increasing the speed of analysis. To help researchers eliminate false positives from GATK results, NVIDIA has partnered with the Broad Institute to present NVScoreVariantsa deep learning tool for option filtering using CNNs.

Challenge options based on deep learning, such as Google DeepVariant improve call accuracy without the need for a separate filtering step. DeepVariant uses a CNN architecture to call variants. It can be retrained to fine-tune for increased accuracy with the results of each genomic platform.

Secondary analysis software in NVIDIA Clara Parabricks the toolkit accelerated these challenge variants up to 80x. For example, the runtime of the germline HaplotypeCaller has been reduced from 16 hours in a CPU-based environment to less than five minutes with GPU-accelerated Clara Parabricks.

Accelerating the next wave of genomics

NVIDIA is helping to create the next wave of genomics by providing both short- and long-read sequencing platforms with accelerated AI base calling and variant calling. Industry leaders and startups are partnering with NVIDIA to push the boundaries of whole-genome sequencing.

For example, a biotechnology company PacBio recently announced about Review system, a new long sequence readout system with NVIDIA Tensor Core GPU. With a 20-fold increase in computing power over previous systems, Revio is designed to sequence human genomes with high-accuracy long reads at scale for less than $1,000.

Oxford Nanopore Technologies offers the only technology that can sequence DNA or RNA fragments of any length in real time. These features make it possible to quickly detect more genetic variations. Seattle Children’s Hospital recently used the PromethION high-throughput nanopore sequencing tool to understand a genetic disorder in the first few hours of a newborn’s life.

Ultima Genomics offers high-throughput whole-genome sequencing for as little as $100 per sample Singular genomicsG4 is the most powerful desktop system.

Learn more

on NVIDIA GTCfree artificial intelligence conference to be held online March 20-23, speakers from PacBio, Oxford Nanopore, Genomic England, KAUST, Stanford, Argonne National Labs and other leading institutions will share recent AI advances in genomic sequencinganalysis and genome large language patterns for understanding gene expression.

A. was presented at the conference a speech by Jensen Huang, founder and CEO of NVIDIA on Tuesday, March 21 at 8 a.m. PT.

NVIDIA Clara Parabricks is free for students and researchers. Get started today or try the free hands-on lab experience the toolkit in action.