Fri. Dec 2nd, 2022

Gordon Bell Special Prize Finalist for High Performance Computing COVID-19 Research taught Large Language Models (LLMs) new jargon – gene sequencing – that can unlock understanding of genomics, epidemiology and protein engineering.

Published in October, the groundbreaking work is the result of a collaboration between more than two dozen academic and commercial researchers from Argonne National Laboratory, NVIDIA, the University of Chicago and others.

The research team trained LLM to track genetic mutations and predict worrying variants of SARS-CoV-2, the virus that underlies COVID-19. While most LLMs used in biology today have been trained on small molecule or protein datasets, this project is one of the first models trained on raw nucleotide sequences—the smallest units of DNA and RNA.

“We hypothesized that moving from protein-level data to gene-level data could help us build better models for understanding COVID variants,” said Arvind Ramanathan, a computational biologist at Argonne who led the project. “By training our model to track the entire genome and all the changes that occur in its evolution, we can make more accurate predictions not only about COVID, but about any disease with enough genomic data.”

Considered the Nobel Prize for High Performance Computing, the Gordon Bell Prize will be presented this week at the SC22 conference by the Association for Computing Machinery, which represents about 100,000 computer scientists from around the world. Since 2020, the group has been awarding a special prize for outstanding research that advances understanding of COVID through high performance computing.

LLM Teaching Four-Letter Language

LLMs have long been taught human languages, which are typically a couple of dozen letters long, which can be combined into tens of thousands of words and combined into longer sentences and paragraphs. On the other hand, in the language of biology, just four letters stand for nucleotides—A, T, G, and C in DNA or A, U, G, and C in RNA—arranged in different sequences like genes.

While fewer letters may seem like an easier task for AI, language models for biology are actually much more difficult. This is because the genome, which consists of more than 3 billion nucleotides in humans and about 30,000 nucleotides in coronaviruses, is difficult to break down into separate meaningful units.

“When it comes to understanding the code of life, the main problem is that the information about sequencing in the genome is quite extensive,” said Ramanathan. “The meaning of a nucleotide sequence can be influenced by another sequence that is much further than the next sentence or paragraph in human text. It can be equivalent to chapters in a book.”

The NVIDIA staff on this project developed a hierarchical diffusion method that allowed LLM to process long strings of around 1500 nucleotides as if they were sentences.

“Standard language models have problems generating cohesive long sequences and learning the underlying distribution of different options,” said paper co-author Anima Anandkumar, senior director of AI research at NVIDIA and Brena professor in the Department of Computational and Mathematical Sciences at the California Institute of Technology. “We have developed a distribution model that works at a higher level of detail, which allows us to generate realistic options and collect more accurate statistics.”

Predicting worrying COVID variants

Using open source data from the Bacterial and Viral Bioinformatics Resource Center, the team first pre-trained their LLM on more than 110 million gene sequences from prokaryotes, which are single-celled organisms like bacteria. He then refined the model using 1.5 million high-quality sequences from the COVID virus genome.

By pre-training on a larger dataset, the researchers also ensured that their model could be generalized to other prediction tasks in future projects, making it one of the first genome-wide models with this capability.

After fine-tuning the COVID data, LLM was able to distinguish between the genome sequences of the virus variants. He was also able to generate his own nucleotide sequences, predicting potential mutations in the COVID genome, which could help scientists anticipate future variants of concern.

visualization of covid sequenced genomes
Trained on a year of SARS-CoV-2 genome data, the model can infer differences between different viral strains. Each dot on the left corresponds to a sequenced strain of the SARS-CoV-2 virus, color-coded by variant. The figure on the right shows one particular virus strain that captures the evolutionary relationships between viral proteins specific to that strain. Image courtesy of Bharat Kale of Argonne National Laboratory, Max Zvyagin, and Michael E. Papka.

“Most researchers are tracking mutations in the spike protein of the COVID virus, especially in the domain that binds to human cells,” Ramanathan said. “But there are other proteins in the viral genome that are often mutated and are important to understand.”

The model can also be integrated with popular protein structure prediction models such as AlphaFold and OpenFold, the paper says, helping researchers model the structure of a virus and study how genetic mutations affect a virus’s ability to infect its host. OpenFold is one of the pre-trained language models included in the NVIDIA BioNeMo LLM service for developers using LLM in digital biology and chemistry applications.

Accelerate AI learning with GPU-accelerated supercomputing

The team has developed artificial intelligence models on supercomputers powered by NVIDIA A100 Tensor Core GPUs, including Argonne’s Polaris, DOE’s Perlmutter, and NVIDIA’s own Selene system. By scaling up to these powerful systems, they have achieved over 1500 exaflops of performance in training runs, creating the largest biological language models to date.

“Today we are working with models with up to 25 billion parameters, and we expect this number to increase significantly in the future,” said Ramanathan. “The size of the model, the lengths of the genetic sequences, and the amount of training data needed means that we really need the computational complexity provided by supercomputers with thousands of GPUs.”

The researchers estimate that training a version of their model with 2.5 billion parameters took more than a month on approximately 4,000 GPUs. The team, which had already researched an LLM for Biology, spent about four months on the project before publicly releasing the paper and code. The GitHub page contains instructions for other researchers to run the model on Polaris and Perlmutter.

Available in early access on the NVIDIA NGC Hub for GPU-optimized software, the NVIDIA BioNeMo infrastructure helps researchers scale large biomolecular language models on multiple GPUs. Part of the NVIDIA Clara Discovery drug discovery toolkit, this framework will support chemistry, protein, DNA, and RNA data formats.

Find NVIDIA on SC22.

The image above represents the COVID strains sequenced by the LLM researchers. Each dot is color coded for the COVID variant. Image courtesy of Bharat Kale of Argonne National Laboratory, Max Zvyagin, and Michael E. Papka.