Pangenome: a new 'map' of the most complete and global human genome that will help medicine

HEALTH / By Paul Torres

April 14, 2003 is an important date in the history of science.. That day a little over 20 years ago the end of the Human Genome Project was announced: the essential sequence of our DNA had been deciphered after many years of effort.

However, this 'map', which cost 3,000 million dollars and which began to be used as a reference of the human genome, was not complete.. It had gaps in several genetic regions and was based primarily on the DNA of a few individuals of European origin.. And although in these two decades there have been technological advances that have allowed these gaps to be 'mapped' – the complete sequence was obtained in 2022 – and to make the process cheaper, a more global and diverse reference was still missing.

From now on, thanks to an international consortium funded by the US National Human Genome Research Institute, this resource – a reference pangenome – will be available.

The new 'map', which is presented as a first draft, includes the complete genetic sequence of 47 individuals of different origins, which provides detailed information on 94 genomes because each individual carries two copies of 'inherited' genes in their DNA. of his father and mother. The objective of the project is to continue adding data to the 'map', which is why it is expected that by mid-2024 it will include genetic information from 350 people of diverse ethnic ancestry.

“Until now, the reference used by the scientific community was incomplete and lacked diversity,” Benedict Paten, associate director of the University of California's Santa Cruz Genomics Institute and one of the project leaders, told a news conference.. This new resource, on the other hand, provides a more complete image and will allow for more accurate analyzes when characterizing the genetic variability of the human population, whatever its origin, he highlighted.

In fact, the new pangenome has already brought to light more than 100 million new bases -each one of the letters that make up the genome-, and has uncovered new alleles in structurally complex regions of the genome that until now were not included in the genome. reference. Details of the research are published in four articles in the latest issue of the journals Nature and Nature Biotechnology.

Using state-of-the-art computational techniques, the researchers have been able to build a resource that, instead of being unique and linear, as was the GRCh38 reference used up to now, provides different versions of the same sequence at the same time, which provides researchers with a greater range of options for their analyzes. A team from the Barcelona Supercomputing Center (BSC) led by Santiago Marco-Sola has participated in the project.

What does the new pangenome mean for research?

“Until now we have been content with a single genome sequence that was once arbitrarily decided to be the reference sequence, made up of bits of sequence from a handful of people of mainly European descent.. And while this has been very useful, it also has many limitations,” says Jorge Ferrer, a researcher at the Barcelona Center for Genomic Regulation (CRG).. “For example, surprising as it may seem, each of us may be missing or left over with a few very large chunks of the genome.. If the piece of genome chosen to be the reference is from someone who does not have that piece (or has it sufficiently altered), the reference map we currently use would not work for a person who has a mutation that affects that part,” he clarifies.. To further complicate matters, he continues, “the genome can vary enormously in different parts of the world.. And if the reference map is made with European variants, it is less useful for interpreting the genome of a person from Cameroon or China.”

The current work, says Ferrer, “is the first step to solve these problems”. “They have created a complex system that allows one person's genomic sequence to be matched against all of these possible human sequences, rather than just one sequence, and the consortium plans to develop this strategy against the sequence of many more individuals.”

A resource for medicine

For José Manuel Castro Tubío, leader of the Genomes and Disease Research Group at the Center for Research in Molecular Medicine and Chronic Diseases (CIMUS) in Santiago, this new resource will help, first of all, to “better understand our identity, to know what makes us genetically different from each other”. And the fact of “knowing what makes us different, what sequences of genetic material make us different, will allow us to know things about our evolution and it will also allow us to know things about the genetic diseases that affect us”.

“Genetic variability is associated with biological traits and also with the predisposition to develop diseases,” he explains.. “These new genomes that are now being published will make it possible to discover many variants that we still don't know what they are associated with.”

“All the people who are sequencing genomes right now are going to be able to compare the sequences with these reference genomes that are very well characterized.. And that will give us much more information than what we have been able to obtain up to now with the reference human genome that was obtained in the early 2000s”, underlines the researcher, who points out that the new resource represents “a more quantitative than qualitative leap”. .

“In the last 20 years there have been great technological advances that have made it possible to go from a single incomplete reference genome to 47 complete genomes with a very good level of sequencing precision,” he points out.. And this has been possible thanks to second and third generation sequencing technologies that have previously allowed, for example, very long DNA readings.

In the reference genome that existed until now, there were large knowledge gaps. “There were sequences that could not be assembled,” explains Tubío. “But the development of third-generation sequencing technology has made it possible to get very long sequencing reads, which allow us to bypass these complex regions and reconstruct the complete architecture of chromosomes.. Last year the first complete genome was published for the first time thanks to these advances and continuing with this work, this pangenome is now being presented”, the researcher underlines.

Although the new reference presented is still a first draft and only represents a still small number of individuals, it contains information that will be very useful to advance biomedical research, concludes the researcher.

What role have the BSC supercomputers played?

“The construction of a pangenome is complex and involves different phases of analysis and processing (some manual and others automatic). In fact, various methods have been used to understand which are the most appropriate methods and tools for its construction and subsequent analysis”, explains Santiago Marco, a specialist in algorithms, bioinformatics and high-performance computing, who points out that his contribution “focuses on the development of high performance algorithms and software tools, not on observations or biological/genomic results”.

“It must be understood that if the reconstruction of a linear genomic reference (such as the first human genome) requires aligning and assembling hundreds of billions of DNA bases, the construction of a pangenomic reference may require processing orders of magnitude more information.. In addition, assembly and processing pipelines are composed of multiple processing phases that require complex and expensive algorithms to use.. For this reason, this project would not be possible without supercomputers, since only supercomputers like the Marenostrum4 have the capacity to process and store such large amounts of data”, explains Marco, who also stresses that “the methods that our research group proposes/researches have been developed and tested thanks to the power of Marenostrum4. Subsequently, these methods have been incorporated and used in this pangenome project.. However, the computation and processing of the final results, of this publication in particular, has been carried out in other supercomputing infrastructures (outside Spain)”.