Exploiting Population Genomics to Construct Ideal Reference Genomes

John Hall and Jason de Koning

The human genome reference sequence, published by the Human Genome Project (HGP) in 2001, is used widely as if it were a representative human sequence. For example, when patient genomes or exomes are sequenced, their reads are mapped back against the reference genome to identify genetic variants. However, “the” human genome represents the genetic material of only a few individuals who happened to donate DNA to the HGP. Recently, next generation DNA sequencing has allowed huge numbers of diverse, individual genome and exome sequences to be determined, including complete diploid genome sequences from thousands of living humans, complete genomes from several extinct hominid lineages, and exome sequences from nearly 100,000 people. Using this data, we sought to simply reconstruct better reference genome sequences that are more representative of specific human populations, and of humans in general. We have also preliminarily reconstructed ancestral human and hominid genome sequences. The former represents the “most human” genome and is more representative of humans in general than is the HGP reference. Use of these reconstructed genomes as reference sequences will aide in more natural identification of variants in population and medical genomics studies, and will allow more efficient data storage in human genomics. We make these new reference sequences and related tools available through our laboratory website (http://lab.jasondk.io).