Thursday, November 1, 2012

A genetic cartography of humans



The Phase I paper of the 1000 genomes project has been published in Nature.  Similarly to the completion of the first draft of the human genome sequence, this work constitutes a milestone in the path to understand the complex relationships between genotype and phenotype in our species. When we had the first human sequence we had, for the first time, a broad view of what were the genetic constituents of our species, no doubt that this has served to advance our understanding in many fields related to human biology and disease. What is then the significance of having 999 genomes more? I have been asked this question by some journalists in the last days. 

If one would like to describe our species purely in genetic terms, a single genome could be a good approximation, but only that, an approximation. We know that we all differ from each other genetically, and that some of these differences explain part of the observable differences (the phenotype). What is the extent and nature of the genetic differences that exists currently?, or that even existed before in the human population?, which of these differences are important in terms of phenotypic variability, including the propensity to suffer from certain diseases?, what fraction of these differences have no important effect and can vary freely?. All these questions cannot get an answer from the analyses of a single genome, and only the comparison of a large set of genomes would serve to have a better idea of what is the genome of our species.


The analogy of a map has been used several times to illustrate how the genome sequence has helped us navigate it and has enabled dramatic improvements in how we address questions related to human biology. I think the analogy is very good, since a map in itself has only a limited scientific value, since it is, basically, a description. However, similarly to how ancient maps dramatically affected the course of history, having this maps enable unanticipated scientific discoveries. This first 1000 (1092 to be extact) genomes constitutes a first cartography of human genetic variability. Providing detailed information of what mutations occur in different populations. This map is not complete, of course, but enables a good level of resolution. The authors estimate that we now have a catalogue of more than 98% of the mutations that occur at a frequency of at least 1%. Continuing with the analogy we still miss is the specific details of how the coastal areas are: like if we would see them from very far away. This missing variability may be important, since variants involved in deleterious phenotypes (disease) are expected to be at very low frequencies. Thus the effort of improving this cartography will continue and 1500 additional genomes are planned within the consortium. In parallel, many other projects and even some from particular private persons are producing more individual genome sequences. It will be important to ensure that all these information ends up in public repository, so that this information is efficiently exploited by the scientific community.  




The 1000 paper is very descriptive but already shows some important results that have an impact on how we think about the relationships of genotypes and phenotypes. They report that an individual would carry on average 200-300 variants that affect conserved residues in non-coding sequences, and even 2-4 that have been associated to disease in other studies. All individuals sequenced are healthy and thus this result tells us about the plasticity of the genome to tolerate mutations that may be deleterious in other genetic backgrounds. There is much to learn from this and the 1000 genomes will be a useful resource for studies trying to associate genetic backgrounds with disease propensity. In addition the genome sequences carry the footprints of the recent evolution of human populations, and the level of observable variability of a site can be informative of the potential functionality. Thus the possible applications of this data are many, and as I posed to a journalist. The main scientific discovery enable by this articles yet to come.

Finally, there is one important aspect that journalists do not pay much attention. Putting together this project has been a gigantic effort and has required the development of new tools and algorithms to work with this massive amount of data. Only the coordinated efforts of many groups has made this possible.This comes at a time in which such tools are desperately needed, given the growing impact of idividual genome sequencing in medicine and other fields. Similar to how an ambitious mission to bring a rover to Mars impacts scientific development beyond the particular purpose of this mission, the tools developed by the 1000 genomes project are already playing a role in hundreds other genomics project. Thus the merit of this big consortium project is not entirely the immediate scientific discoveries- at times deceiving because they are inevitably only descriptive- but their catalytic effect on a scientific field.