Thursday, November 1, 2012

A genetic cartography of humans

The Phase I paper of the 1000 genomes project has been published in Nature.  Similarly to the completion of the first draft of the human genome sequence, this work constitutes a milestone in the path to understand the complex relationships between genotype and phenotype in our species. When we had the first human sequence we had, for the first time, a broad view of what were the genetic constituents of our species, no doubt that this has served to advance our understanding in many fields related to human biology and disease. What is then the significance of having 999 genomes more? I have been asked this question by some journalists in the last days. 

If one would like to describe our species purely in genetic terms, a single genome could be a good approximation, but only that, an approximation. We know that we all differ from each other genetically, and that some of these differences explain part of the observable differences (the phenotype). What is the extent and nature of the genetic differences that exists currently?, or that even existed before in the human population?, which of these differences are important in terms of phenotypic variability, including the propensity to suffer from certain diseases?, what fraction of these differences have no important effect and can vary freely?. All these questions cannot get an answer from the analyses of a single genome, and only the comparison of a large set of genomes would serve to have a better idea of what is the genome of our species.

The analogy of a map has been used several times to illustrate how the genome sequence has helped us navigate it and has enabled dramatic improvements in how we address questions related to human biology. I think the analogy is very good, since a map in itself has only a limited scientific value, since it is, basically, a description. However, similarly to how ancient maps dramatically affected the course of history, having this maps enable unanticipated scientific discoveries. This first 1000 (1092 to be extact) genomes constitutes a first cartography of human genetic variability. Providing detailed information of what mutations occur in different populations. This map is not complete, of course, but enables a good level of resolution. The authors estimate that we now have a catalogue of more than 98% of the mutations that occur at a frequency of at least 1%. Continuing with the analogy we still miss is the specific details of how the coastal areas are: like if we would see them from very far away. This missing variability may be important, since variants involved in deleterious phenotypes (disease) are expected to be at very low frequencies. Thus the effort of improving this cartography will continue and 1500 additional genomes are planned within the consortium. In parallel, many other projects and even some from particular private persons are producing more individual genome sequences. It will be important to ensure that all these information ends up in public repository, so that this information is efficiently exploited by the scientific community.  

The 1000 paper is very descriptive but already shows some important results that have an impact on how we think about the relationships of genotypes and phenotypes. They report that an individual would carry on average 200-300 variants that affect conserved residues in non-coding sequences, and even 2-4 that have been associated to disease in other studies. All individuals sequenced are healthy and thus this result tells us about the plasticity of the genome to tolerate mutations that may be deleterious in other genetic backgrounds. There is much to learn from this and the 1000 genomes will be a useful resource for studies trying to associate genetic backgrounds with disease propensity. In addition the genome sequences carry the footprints of the recent evolution of human populations, and the level of observable variability of a site can be informative of the potential functionality. Thus the possible applications of this data are many, and as I posed to a journalist. The main scientific discovery enable by this articles yet to come.

Finally, there is one important aspect that journalists do not pay much attention. Putting together this project has been a gigantic effort and has required the development of new tools and algorithms to work with this massive amount of data. Only the coordinated efforts of many groups has made this possible.This comes at a time in which such tools are desperately needed, given the growing impact of idividual genome sequencing in medicine and other fields. Similar to how an ambitious mission to bring a rover to Mars impacts scientific development beyond the particular purpose of this mission, the tools developed by the 1000 genomes project are already playing a role in hundreds other genomics project. Thus the merit of this big consortium project is not entirely the immediate scientific discoveries- at times deceiving because they are inevitably only descriptive- but their catalytic effect on a scientific field. 


Saturday, September 22, 2012

Can genomics save endangered species?

Nowadays genomics is pervading many research fields in biology, and conservation biology is not an exception anymore. The Giant panda was perhaps the first organism selected for sequencing in which the primary reason was its status as an endangered species. Since then, other species have been selected for sequencing, in an effort to contribute to their conservation. To name a few: the Californian condor, the Tiger, Tasmanian devil and the Iberian lynx, are also entering the genomic era. Our group is contributing to the efforts of sequencing and analyzing the Iberian Lynx genome, an emblematic predator of our peninsula which has the dubious honor to be the most endangered feline species on the planet. With a population below 400, a fragmented and restricted distribution area and a dangerously low level of genetic diversity, its situation is rather critical. Two years ago a consortium of Spanish research groups joined forces to sequence this species' genome. 

"Candiles" the sequenced Iberian Lynx male 

I have been asked many times if this effort will definitely save the species, or even whether the money would not be better invested in other efforts. How can a genome help in saving an endangered species?, are we feeding unreasonable expectations on the possible role of genomics in species conservation? Although only time will tell whether such efforts will pay off, I consider that genomics can certainly provide a new, very useful angle to species conservation. In any case, genomics should be considered just as another tool towards species conservation, rather than as the definitive solution. Species are endangered because of various causes, mostly territory loss and degradation, overexploitation, and alteration of their ecological networks. It is obvious that the main focus should be given to fight the causes that triggered population drops and create the necessary conditions for the populations to recover safely. As a powerful tool to understand a species' biology, and as a way to investigate past and current population dynamics, the availability of a genome can greatly help in understanding some of the factors that may have been decisive in population decline. Having a reference genome opens the door for a closer genetic monitoring of wild populations, not only because it enables the selection of new marker genes than can be sampled in many individuals but also because it paves the way for obtaining whole genome-level population data by re-sequencing strategies. Indeed, our project includes already re-sequencing of additional individuals from the main fragmented territories occupied by the species.

Having such kind of data is key to understand gene flow among the different populations, since it will provide a better picture of the genetic pools of the different populations. This will help to better plan crosses among captive individuals -mainly those with permanent injuries that cannot be successfully released to the wild- and future releases of their progeny. This will have a direct impact in the case of the Iberian lynx, where high levels of inbreeding and low genetic diversity exposes fragmented populations to a higher rate of diseases with a genetic basis (particularly a renal disease), and a reduced potential to overcome potential infectious diseases. A better knowledge of the genetic pool of both wild and captive populations will undoubtedly help in guiding strategies to help them recover. In addition individuals and their territories could be tracked from materials such as faeces or hairs.  Other applications may be more specific for a particular endangered species, for instance in the tasmanian devil, genomics has been used to track a transmissible cancer that causes a facial tumor disease that is transmitted by biting. 

Tasmanian devil with transmissible facial tumor

Other applications of conservation genomics that go beyond the sequencing of the endangered species itself, refer to the monitoring, using similar genomics tools, of important pathogens or symbionts of endangered species. Of course all these efforts will only be of little help if the causes that drove their decline are still around. Thus there is a growing number of promising possible applications of genomics to the conservation of endangered species, some of them already at work. I expect this field to grow fast in the coming years, as a concerned scientist I am proud that my particular corner of expertise can contribute to the noble cause of helping to keep the biodiversity of our planet.

Thursday, June 28, 2012

wrap-up of the orthology, paralogy, and function symposium at SMBE 2012

I promised some people to write a short summary of the symposium that Matthew Hahn, Marc Robinson-Rechavi, Iddo Friedberg, and I co-organized at SMBE 2012. I particularly enjoyed the symposium and the room was pretty full all the time, despite running in parallel to other interesting topics. I will just write an overall summary without going into too much details of each of the talks, and at the end I would list a number of papers that were commented on the various talks. I have to clarify that this informal wrap-up only contains my own views and has not been consensuated among the organizers. I invite any of the attendants to add comments to highlight some important aspects that I may have missed.

I’ll start by providing a summary of how all this started... which is a rather unusual way, I believe. Indeed the idea of the symposium was born in the blogosphere, in the popular Jonathan Eisen’s Tree of Life blog, where he invited Matthew Hahn to write a special guest post on the “history behind” his paper on testing the orthology conjecture. One of the conclusions from that paper was that paralogous sequences were more similar in function (and in expression patterns) than paralogs, which contradicted one of the major expectations (and assumptions) behind the theories of duplication-driven functional divergence and the strategies for inferring functions from orthologous sequences. That paper had already caused a bit of a turmoil in the orthology community (I remember this was a hot discussion during the last Quest for Orthologs meeting, at Cambridge), and several concerns were being raised about the suitability of comparisons of functional annotations from different species, and the conclusions derived within the paper. Rather rapidly, several people commented on Matt’s post and a lively discussion started (more than 40 comments in total!). The discussion was so interesting that Marc Robinson-Rechavi suggested we should bring this scientific debate in the form of a symposium in one of the upcoming conference, and so is how some of us started to work on this idea.To me it was the first time that I met the other organizers in person.

The symposium started with Eugene Koonin, who nicely introduced the topic of what conjectures could be implied by the definition of orthology, a purely evolutionary one as introduced by Walter Fitch in 1970. He then showed results from his lab that indicate that conjectures tend to hold, but that there may be exception. For instance, the conjecture that orthologs should be best reciprocal hits can be broken by an accelerated evolution in one of the true orthologs, he then showed work from other groups (Sali, Sonnhammer) on the higher conservation of structure and domain architecture in orthologs as compared to paralogs. He criticized the use of GO terms by Hahn and others and argued that one should at variety of data on function to test the conjecture. He presented results from his own group which show higher conservation of expression across species. He concluded that the functional conjecture still holds, although he observed that differences may not be spectacular.  Catherina Gushanski was next talking on changes in gene expression following segmental duplications in mammals. They have produced an impressive dataset of expression from  different tissues in various mammal species. She used that set to ask the question whether duplication was contributing more to divergence than time alone and showed that levels of expression were decreasing in younger duplicates, changes were different across different tissues. She observed no differences between one-to-one orthologs or old duplicate pairs, she also found no differences in terms of tissue specificity in orthologs vs paralogs.  Next on stage was Nicholas Furnham who presented new implementations in FUNTREE that would allow exploring functional evolution on trees. He warned that EC classification is not univocal and that can also have problems for functional comparisons. They have developed “EC-Blast” which directly measures distances between enzymatic reaction based on the molecular structures of substrates and products. Christophe Dessimoz presented results from his recent paper in which they show important biases in GO term annotations, genes from the same species and families tend to be annotated with more similar terms because of experimental biases and author biases. When correcting for this biases, the conjecture still holds. However he admitted that differences were not very big, but still significant. Romain Studer came next. He measured selection and changes in structural stability in orthologs and duplicated genes. He showed that selected sites in paralogs tend to be more clustered in the structure than in orthologs, however he observed no differences in the evolution of stability between orthologs and paralogues. He concluded that differences between paralogues may be smaller than previously thought.

After the coffee break Jianzhi Zhang told us about his work towards probing the orthology conjecture. After giving a try, he gave up of using GO terms because of the many inconsistencies, and the biases observed. He thus reverted to interrogate for conservation of protein-protein interactions using experimentally determined interactions in various yeast species. Unfortunately the many interactions to test experimentally in duplicated proteins prevented him to show a comparison of orthologs and paralogs in this talk. Nevertheless he found that all PPIs tested for orthologs were conserved, even those that seemed not to be, were caused by possible errors in previous large-scale Yeast 2 Hybrid experiments. Alex Nguyen also showed results on the budding yeast gene duplications. They focused on a more specific aspect of function: the presence of short-conserved linear motifs in protein. They found that these were more likely to disappear/diverge after the duplication event, consistent with neo- or sub-functionalization models. We moved to Drosophila with our next speaker, Lev Yamplosky who exploited expression and genomic data from the 12 Drosophila genomes. They showed larger differences in paralogs, as compared to orthologs in rates of divergence, which were also more asymmetrical. They also found that these differences varied for fast- or slow-evolving families. Finally they could also find larger differences in paralogs in terms of expression. Then it was my turn, and I mainly showed our results on comparison of expression patterns in human and mouse. Our experimental design is different from others in that we use topological dating (not sequence divergence) to establish orthologs and paralogs of a similar age, and, second, we compared always orthologs to inter-species paralogs to get rid of species-specific biases in the comparisons. Our results support a larger divergence of paralogues as compared to orthologs in tissue pattern expression. Thanks to our experimental design we could also assess that most of the differences between paralogs were gained shortly after the duplication, linking the duplication event to a big fraction of the divergence. Our last speaker was Paul Thomas who gave an overview of what can you expect and what can you not expect from GO annotations. He also showed progress on how the consortium is trying to model functional evolution through gene families, and how these models can help in the study of the relationship between orthology, paralogy and gene function.

Thus we had a diverse set of talks, most of them focusing on the comparison of different aspects of functional evolution (GO annotations, expression, functional motifs, interactions, divergence, structure) and also using varying experimental designs and species. I would say one of the main conclusion is that GO (and even EC numbers) annotation can be misleading in our ascertainment of functional evolution. My personal view is that most talks showed results consistent with the conjecture, although the level of differences between paralogs and orthologs was sometimes small. Function can be described at multiple levels, and I would expect that functional divergence after duplications may affect only one or few of these. Thus if one experimental design focuses on one of such levels it may be expected to miss divergence in the other ones. In addition those designs that average over all levels will inevitably dilute small but important aspects of functional divergence. In conclusion this is an exciting topic and with the number and variety of groups that are now interested in the topic, I am sure that we will be closer and closer to understanding the complex relationships between orthology, paralogy and functional divergence.

Some links and  papers mentioned during the symposium (I probably miss some):

Abstracts from oral presentations in SMBE, including our symposium

Another post on the orthology conjecture 

Announcement of our symposiyum 

FunTree: a resource for exploring the functional evolution of
structurally defined enzyme superfamilies.
Furnham N, Sillitoe I, Holliday GL, Cuff AL, Rahman SA, Laskowski RA,
Orengo CA, Thornton JM.
Nucleic Acids Res. 2012 Jan;40(Database issue):D776-82

Brawand, D., et. al. The evolution of gene expression levels in mammalian organs. URL

 Forslund et. al. Domain conservation architecture in orthologs

Huerta-Cepas and Gabaldón Assigning duplication events to relative temporal scales in genome-wide studies.

Nehrt et. al. Testing the Ortholog Conjecture with Comparative Functional Genomic Data from Mammals

Nguyen et. al. Proteome-Wide Discovery of Evolutionary Conserved Sequences in Disordered Regions;5/215/rs1
Peterson et. al. Evolutionary constraints on structural similarity in orthologs and paralogs

Thomas et. al. On the Use of Gene Ontology Annotations to Assess Functional Similarity among Orthologs and Paralogs: A Short Report

Large-scale analysis of orthologs and paralogs under covarion-like and
constant-but-different models of amino acid evolution.
Studer RA, Robinson-Rechavi M.
Mol Biol Evol. 2010 Nov;27(11):2618-27.

How confident can we be that orthologs are similar, but paralogs differ?
Studer RA, Robinson-Rechavi M.
Trends Genet. 2009 May;25(5):210-6.

Pervasive positive selection on duplicated and nonduplicated vertebrate
protein coding genes.
Studer RA, Penel S, Duret L, Robinson-Rechavi M.
Genome Res. 2008 Sep;18(9):1393-402.

Friday, June 1, 2012

Publicly available or not?

I have always had the naive understanding that databases such as GenBank were public, and that one was free to do research on data accessed from there, and eventually publish the results. However nothing seems to be as simple as that, since many of the genomes deposited in there have not been published yet. I have experienced myself and heard from many colleagues problematic situations regarding the use of genome data taken from public databases but yet to be published. Current guidelines are open to different interpretations, and different stakeholders (editors, reviewers, users, data producers) may have entirely different and conflicting views. With the current trend we will soon have more unpublished than published genomes in public databases, so I think it is worth re-assessing the policies. Here I share some views.

Policy guidelines regarding the use of genomic sequences prior to publication are available (see NHGRI rapid data release policy, and set reasonable rules. For instance that data producers should deposit the data publicly and should produce a paper citable for the source of the data within a short period of time. This could precede a full genome paper in which a more througough analysis is produced. Users should not take the public data to publish an analysis focused on that genome. But this situation should not be prolonged too much. The underlying idea is to reserve the opportunity to describe the main characteristics and findings to the researchers that do the effort of sequencing, assembling, and annotating a genome, while ensuring that the data serves the advancement of science by allowing other groups to perform research on the genome data as soon as it is produced. However, there are many interpretations on what possible uses of the data should be allowed. Moreover, although indicative time-frames for the preferential exploitation of the data are given (e.g. 6 months), these are only indications. In the absence of clear-cut rules, the situation is calling for conflict. With the current flow of sequencing data, we will increasingly face the situation that data produced for public use and accessible through public databases is not associated to a paper and thus unclear whether its use should require permission. In such situations one may have different interpretations on existing rules, that of the leader of the sequencing project, that of the researcher that is accessing the data, that of the agency that financed the sequencing, and even that of the editors and reviewers of papers using available but unpublished data. Below I list some undesirable situations that highlights the contradictions of the current system. These situations are not hypothetical but rather correspond to real cases that I experienced or heard from colleagues

  • Users of public databases may unadvertedly download unpublished data, specially when they use they do this at large scales. After all they are using a public repository, and it is contradictory that public databases provide data that are not usable.
  • Most genome sequencing projects are financed using public money or from agencies that require that the data is made publicly available as soon as it is produced, but this leads to the situation above, making it difficult to sequencing project leaders to know what use is being made of their data.
  • Referees may specifically ask authors to use genomes that are in databases, or simply reject a paper because it does not use this or that “publicly available” genome in the comparative analyses. In addition referees or editors may ask for evidence of a specific permission to use unpublished data.
  • Authors willing to ask for the use of an unpublished genomes may be required to explain the exact use of the data, which expose their ideas to possible direct competitors.
  • Leaders of genome projects may feel in the right to ask for authorship in exchange of data that is available on public databases.
  • Leaders of genome projects may intentionally delay the publication of the genome paper to extend the period of preferential use. They may even decide to publish partial analysis before the genome paper.
  • Some unpublished genomes are in public databases for several years, and still different interpretations are possible of whether these data could be freely used.
  • Some genomes may never be published in the form of a genome paper, because they were sequenced with a very particular purpose.
In my opinion the current situation is too ambiguous, generates conflicts and ultimately jeopardizes the advance of science. We need clear rules, rather than guidelines, and I below propose four simple rules that would simplify the process.

  • Granting agencies and sequencing centers should specify a reasonable time-frame for preferential use (6-12 months) before the data is released. This should suffice for giving the upper hand to the research team that is doing the sequencing effort, but will also force them to focus on publish inga genome paper as soon as possible.
  • During this period, sequencing projects may announce the availability of the data for restricted use, through a specific repository that can be accessed only after a specific permission is granted. This will enable use of the data from time 0.
  • Data is released to the public repositories (at least in the form of bulk download) only after that period.
  • All data in public repositories should thus be free to be used for any purpose, regardless whether a genome paper is published.

Personally, for what the activity in my lab concerns I have taken the decision that we will use any data publicly deposited in GenBank for more than a year, for any purpose other than doing a “genome paper” (of course!). I think this is in perfect agreement with the NHGRI recommendations and will definitely save us time, and worries.

Sunday, March 25, 2012

Challenges in phylogenetic tree visualization

I recently read an excellent review by Roderic Page, on the challenges in phylogenetic tree representation and visualization. It provides an overview  on existing software and tools (although he missed our ETE package, see image below for an example of ETE's visualization features). The number and diversity of existing tools is overwhelming, but probably matches the diversity of different interests and possible applications of phylogenetic trees. One may be interested in  overlaying sequence information (see below), while other would be interested in displaying information on the geographical distribution of the species. Some may need to represent uncertainty and overly different topologies, or networks to represent transfers of genetic material, the possibilities are unlimited.

 Most importantly he mentions some of the challenges of tree visualization software such as the ability to represent huge trees and to allow interactive behavior with the user. In our group we have encountered such needs and this is the reason behind implementing more visualization features in ETE. Fortunately new technologies are offering new opportunities as well, and I enjoyed imagining the possibilities that 3D visualization and touchscreen technologies will provide to researchers. Definitely is a field to follow.

 If you are interested in the topic. I recommend this video.

Saturday, March 10, 2012

Open Letter for Research in Spain

As you surely have heard, Spain is facing a serious crisis in the context of a globalized market-economy (yes, it used to be a time when economical crisis related to something more tangible, such as a serious drought or a plague, but now one can only blame abstract fluxes of financial speculations). The new government is preparing a new budget which is predicted to include the most dramatic cuts in our history. Researchers here, who have already been hit by previous cuts (see this letter), are now embracing for the worst. 

 In this context, an open letter has been put together by the Confederation of Spanish Scientific Societies, the Federation of Young Researchers and others. I recommend you to read it (some cited figures and data are very revealing), and if yo agree with it sign it, as I just did.

 Open letter for research in Spain.

Sunday, March 4, 2012

Darwin's h-index

  I guess most scientists are nowadays familiar with the term "h-index", which is a metric of citations to your published articles. More specifically the h-index correspond to the number of articles (h) that have at least h citations. Given that this index is used by many funding agencies and by peers that evaluate you for a position or competitive grant, we all hope to see it grow year by year.

  Charles Darwin lived in completely different times, he had no need to apply for grants or positions every few years and there was no system to track citations or give a "number" to the supposed "impact" of his research.  He, nevertheless, has been absorbed by the current metrics obsession and has already an h-index, computed by google scholar. 

His magic number is 63. Will this change anyway our idea of how important was Darwin's impact to Science? or it will rather help us to put the h-index into context, and highlight the difficulty of measuring true impacts?

Wednesday, February 22, 2012

Phylogenetic Tree Challenge in Encyclopedia Of Life

 The Encyclopedia of Life initiative aims at providing an open, digital resource providing comprehensive information about the diversity of life. It has recently opened a call for teams that can provide a phylogeny-aware organization of as many scientific names as possible. This text is from the call:

A prize is offered to the individual or team that can provide a very large, phylogenetically-organized set(s) of scientific names suitable for ingestion into the Encyclopedia of Life as an alternate browsing hierarchy.  


Among other factors, the total number of uniquely named nodes, node/leaf ratios and tree height may be used to compare entries so contestants should consider how they wish to trade off strict consensus versus other methods of reflecting the state of phylogenetic knowledge.
Problems to solve include 1) how to assign labels to unnamed nodes, 2) how to fill in gaps so that the set of taxa included is as comprehensive as possible, even if trees are not fully resolved or all taxa have not been analyzed, 3) how to handle competing hypotheses, 4) how to update the hierarchy at least annually.  
The winning submission must be available to EOL and others under an acceptable CC license if it is under copyright.  The tree need not be previously published in peer-reviewed form.
 and more information is available here.


Wednesday, February 8, 2012

Getting more complex and gaining.... nothing

 The origin of complexity is a highly debated issue in biology. For instance, many functions in the cell are carried out by intricate macro-molecular complexes formed of a multitude of subunits. When tracing the evolution of such complexes, as we did with mitochondrial Complex I, one often finds that the number of subunits have increased through time. However, the addition of subunits not always seems to correlate with the acquisition of novel functions, which would provide a selective advantage for the increase in complexity.  Can we think of a mechanism promoting a trend for increasing complexity in the absence of a selective advantage provided by a novel function?.

 A recent paper by Finnigan and colleagues show a plausible mechanism and present evidence that this may have been responsible for the acquisition of a novel subunit in fungal vacuolar ATPases (depicted below).

 This molecular machines that pump protons across membranes have a membrane ring (in green in the figure) formed by 6 units. In vertebrates two different subunits (originated from the duplication of an ancestral gene) form the 6-units ring in a 1:5, stoichiometry. In fungi a more recent duplication brought about one more subunit type so that the ring is formed by the products of three different genes in a 1:1:4 organization. Using ancestral sequence resurrection (I love that name!), a technique that consists of reconstructing most likely ancestral sequences and then synthesizing them in the lab, they show that a single mutation acquired early in each paralogue, was sufficient for making the two of them indispensable. Thus, such model could explain a trend to increase complexity in multi-paralogue complexes (those comprised by some subunits derived from duplicated genes) without a requirement for an initial selective advantage.  

 In a way, I see this model as a special type of sub-functionalization. That is, the two new paralogues would in sum make the same function that was performed by the ancestral gene. In the absence of more examples we do not know how widespread is this mechanism, but the fact that it does require few likely events and that it actually constitutes a "ratchet" (as noted by W Ford Doolittle), that is once you gain that complexity you don't go back, one would expect to have occurred in several of many multi-paralogue complexes, at least in some lineages. 

 Perhaps this could explain an intersting finding we did some years ago when looking at the evolution of the mitochondrial electron transport chain in fungi (mostly formed by multi-protein complexes): the amount of duplications in members of this complexes was of the same level as other proteins. This is in contrast to the gene-dosage effect hypothesis that states that complexes would tend to duplicate only when the stochiometry is conserved (that is in when the whole complex duplicates, e.g in whole genome duplications). 

 Finally, another remark that I always do when seeing ancestral sequence resurrection working is that the fact that ancestral reconstructions display the expected biochemical activities (e.g by complementing extant sequences) is an indication that the models of evolution we use are not that wrong after all.


Sunday, January 29, 2012

Interview with Nick Lane

As I reported in an earlier post, I had the opportunity to meet Nick Lane during the Spanish Evolutionary Society meeting. We had a very interesting discussion over a couple of beers around mitochondrial endosymbiosis and the origin of eukaryotes. Some days after the meeiing, Andrés Moya, the President of the society, suggested to me to interview him for the Society's  Bulletin eVolución.  You can find this interview translated to Spanish in the current issue of eVolution 7(1), however I think the interview might be of interest for a broader audience and thus I paste here the original, English version. 

TG- After your recent visit to Spain as an invited speaker to the III SESBE congress (Madrid, November 2011), what is your opinion about the field of Evolutionary Biology in Spain?
NL- Well, I thoroughly enjoyed the few talks I attended, but my Spanish is poor and I could hardly judge many of them; and unfortunately I missed much of the conference. But I liked the great range of themes that were being discussed. And in general I am impressed with a lot of evolutionary research going on in Spain. There is a tendency to consider comparative physiology in evolution more than there is in England, for example, and I find that a very insightful approach. One thing that has struck me over the years is that Spanish researchers are not cited as frequently as they ought to be. This does not reflect the quality of the research, but rather the US-dominated English-language citation bias.

TG- Your career has been quite unconventional. Can you summarize for our readers which have been the major steps in your career path?
NL- It sure has! I had a medical research background, and my PhD was on mitochondrial function and oxygen free radicals in transplanted organs. But I was getting nowhere with that, and couldn’t see a way of getting from there into what was really an interest for me: evolutionary biology. So I took to writing instead, for several independent agencies doing medical education for pharmaceutical companies. That was an eye opener, and I learnt to write clearly and quickly, but it was also a frustration. After quite a lot of hard work I finally got a contract to write Oxygen, which was initially conceived as a book about free radicals, mitochondria and medicine, but ended up reflecting my interests in evolutionary biology to a much greater extent. That was the beginning of a decade spent writing books on evolutionary biochemistry, drawing heavily on my background in bioenergetics but ranging widely over any material that interested me. It was fantastic fun but no way to make a living. And ultimately frustrating too, in that in writing on that scope, you can’t help but come up with new ideas, essentially a broad synthesis with gaps, that you sketch in with speculations, which can be reframed as testable hypotheses. That’s what drew me back into research – the frustrated desire to test some of these hypotheses.

3) Thus, you have been active as a science writer, a researcher, and now you seem to combine both aspects. Do these two tasks reinforce or rather interfere with each other?
Both. I think I’ve benefited tremendously as a researcher from the decade I spent thinking and writing. I now have a coherent set of hypotheses that are testable in one way or another – experimentally or by some kind of mathematical modeling, or just by empirical analysis of existing data. So I’m drawing heavily on this ‘credit’ now. At the same time it is hard to think synthetically or to write books while in research, there are so many demands on time. So on a daily basis, writing and research interfere with each other, but I think if you are able to focus on one or the other for periods then they can, and should, reinforce each other. The trick is to balance each so that they reinforce each other over time. I’m not sure I’ve mastered that trick yet, but it is my long term goal: for me, it is the best way to understand the most interesting evolutionary questions, and that is what I want to do.

4) In your view, where lies the main responsibility of communicating science to the general society (e.g scientists, funding agencies, scientific societies etc, science journalists)?
Good question. There is certainly a responsibility, but being responsible counts for nothing if nobody listens to what you have to say: as a writer, you must be interesting to be noticed at all. And society is rarely interested in responsible but boring views. So there is a balance that you have to wrestle with every sentence, between interest and accuracy. That’s another reason I’m happy to be back in research: to write accurately (in precise scientific language) is at least as much pleasure for me as to write interestingly. Frankly it is the questions themselves that interest me. I think that the real challenge in writing for the public is to find ways of phrasing questions in an interesting way, which draws attention to the problem, without sacrificing the accuracy. That is the ideal: responsible (boring) and interesting at the same time.

With respect to which group has the responsability, I don't think that one group alone can be considered responsible communicating science to general society. Each group can address different needs, and each has its own responsibility. Scientists are responsible for sculpting new ideas, for conveying the excitement and intellectual thrust of science. The best ideas in science are still driven by individuals with passion, insight and ingenuity, and there is nobody better to convey this intensity to the general reader, although it is rare. Journalists are responsible for balanced reporting, explaining ideas clearly and intelligibly, providing context for the reader, ideally some commentary from other scientists. It is unusual for journalists to drive the scientific agenda, but serious journalists have a broader perspective and can sometimes see things that scientists can't.

Scientific societies can provide very helpful consensus statements on difficult issues, from global warming to the effectiveness of chemotherapy. It's not really for them to give a sense of the cut and thrust of science, more the strength of the conclusions that emerge from the uncertainty.

Finally, funding agencies. In my view, funding agencies have a duty to explain to the public and to politicians that research is open-ended and unpredictable. Research that appears to have little immediate societal impact can have immense and unimagined benefits in the future. Most major scientific breakthroughs, with the greatest economic benefits, came from unexpected quarters, and could not have been anticipated by either the scientists themselves or the funders. This perspective is being lost in a political drive to justify spending by societal impact. As with so much, short-term political cycles are trumping long term good sense. It is up to funding agencies to explain why research should be funded on its own merits, without constant recourse to some hoped-for and probably illusory impact.      

 TG- In one of you articles, to commemorate the 150 anniversary of “The Origin of species”, you discuss about what Darwin would love to know about the origin of the eye if he were still alive. Darwin is granted for being the first who used a “tree of life” to describe the evolutionary relationships of species and their shared ancestry. What do you think he would love to now in this respect if he were still alive?
NL- Well I think he’d love what’s going on in microbial genomics. The picture that has emerged over the last couple of decades of lateral gene transfer and endosymbiosis in microbes is radically different to the idea of gene sequence divergence between populations. Having said that, I see all this as a juxtaposition to standard Neodarwinian population genetics. He would have loved that too, although it is old hat to us now; but given that Darwin knew nothing about genes, he would have been thrilled by the Neodarwinian synthesis, and what amounted to a genetic basis for a tree of life. All of this means that variation is more complex than any of us imagined; and in this sense, Darwin’s coyness on the mechanisms of variation was well placed: it really is wild and fascinating.

  TG- In one of your last books, you mention 10 major transitions in the evolution of life on earth. Which one of them is, according to you, the most enigmatic or difficult to explain?
NL- Consciousness, without a doubt. Frequently the origin of life and consciousness are put forward as the twin pinnacles, the two big unanswered questions in biology. I think we’re actually quite close to understanding the origin of life in conceptual terms, but I personally can’t understand consciousness well at all. I read a lot on the subject and came to the conclusion that nobody really does. We still can’t answer the simple question: how does the depolarization of a neuron give rise to a feeling or sensation of anything at all? They are two different languages, and we don’t seem to have any kind of Rosetta stone at the moment.

  TG- Some of these transitions seem to have happened only once in the history of life. If they were so advantageous why they have been restricted to a single lineage?
NL- I think each transition has to be taken on its own terms. These are tremendously difficult questions and you will find diametrically opposed answers to each question from very insightful researchers. The answers reflect temperament more than anything else. Christian de Duve actually wrote a book called ‘Singularities’, and my reading of that is that there isn’t a single answer that would apply to the origin of life, the origin of photosynthesis, the origin of the eukaryotic cell, the origin of animals, and the origin of consciousness. Obviously for some reason, each was improbable or it would have happened more than once (like eyes), but the reasons for improbability differ and are very dependent on context. In the case of eukaryotes, I would say their unique origin was based on an improbable endosymbiosis between prokaryotes, followed by a problematic reconciliation of selfish interests between two entities that had to live in intimate union. There were no advantages at all until they had come out of that tight bottleneck; on the contrary, all the advantages were with the bacteria that just kept on doing their bacterial thing. From that point of view, the difficult question is why did it happen at all?

  TG- Some of your research interests concern very ancient events (e.g. the origin of eukaryotes, of life itself). This is a field in which different hypotheses are difficult to prove right or wrong given the difficulty of direct experimentation. What are the criteria used by scientists in your area to reach a consensus over which is the support for the different  scenarios?
NL- There is a consensus on quite a lot: cell structure, behavior (phagocytosis or sex) genome sequences (albeit with disputes over methodology), the existence of introns in certain positions and so on. Where consensus breaks down is when different methods give different answers. That happens all the time. I’m actually focusing a lot of my attention now on the origin of life itself, because this seems to me to be more experimentally tractable: we can ask specific experimental questions that involve chemistry and thermodynamics, which are much more reliable than biology and genes, so although the event was the most ancient of all, it is not necessarily the most inaccessible. I think we’re making progress on many questions, but in the case of the origin of eukaryotes a lot of the evidence is oblique and disputable. The reasoning is often equivalent to historical reconstruction in that you need to weigh the evidence: there’s no doubt that it happened, and there’s plenty of evidence, it’s just that some of it is unreliable and some is irrelevant, so there’s plenty of scope for argument still.

  TG- In this respect. What is the impact on your field of the ever-growing number of genome sequencing projects?. What are the species or environments you would like to be sampled in order to help answering important questions in the origin and evolution of complex life.
NL- Genome sequences have made a tremendous difference, the only trouble being that they tend to reflect pathogens or industrially interesting bugs, rather than those most relevant to, say, the origin of eukaryotes. I would love to see more genomes from anoxic or anaerobic deep ocean environments, or the deep hot biosphere. I’m especially interested in two questions: the variation in eukaryotic genomes, and the variation in mitochondrial genomes. There is a brilliant and bold hypothesis that the origin of the eukaryotic cell was an endosymbiosis between two prokaryotes, an archaeon host cell and an alpha-proteobacterium (or somesuch). The prediction is that all eukaryotes should have mitochondria or organelles derived from them like hydrogenosomes or mitosomes; and that in terms of mitochondrial genomes we should find more overlap between bacterial metabolic capacity and metabolically versatile mitochondria. This is a wonderful prediction because it is so easy to falsify, and yet all the genome sequencing so far has failed to disprove it. The places most likely to disprove – or prove – it are precisely those anaerobic environments that have been undersampled so far.

TG- Carbon has always been considered a hallmark of life on earth, but life (elsewhere) based on other molecules (e.g Silicium) has been speculated. You seem to favor the idea that oxygen was the molecule that enabled the appearance of complex life on earth, could you speculate on the theoretical possibility of other molecules playing a similar role in other forms of life.
NL- I think it is most likely that life elsewhere would be constrained by much the same issues that constrain life here. I doubt very much that there will be silicon based life forms. There are two important properties of carbon: it is much better than silicon at organic chemistry; but equally important, it is available in the form of a gaseous oxide, a Lego brick if you will. There are no gaseous silicon oxides, only sand, which is vast and unwieldy in comparison. You can’t build a house on sand and you can’t build an organism from sand. My feeling is that not only is carbon especially useful, it is also more abundant than silicon. Likewise, water is more abundant than methane and a much better solvent (you can’t dissolve carbon chains of more than about 5 carbon atoms in methane). And so on. On the basis of usefulness and abundance, I would argue that life would mostly be carbon based. I would go further to argue that it is likely to require proton gradients over membranes for thermodynamic reasons. When I say that oxygen is necessary for complex life, I mean large active animals. I doubt that anything else could do the job: nothing else could accumulate to the appropriate level in an atmosphere and at the same time be sufficiently reactive to provide the power needed. So I’d say that in terms of their broad biochemistry, alien life won’t be all that different. In terms of morphology or the specifics of their biochemistry, they could be very different, of course.

  TG- Are you already working on your next book?, can you advance something on what is it about?
NL- I’m not writing yet, but I do have a contract… and it will be about everything I have talked about here. The origin of complex life, and why it was a unique event here on Earth.

Tuesday, January 24, 2012

RECOMB 2012 (Barcelona): one week left for early registration

 As I reported in an earlier post, RECOMB 2012 will be held in Barcelona and CRG's Bioinformatics and Genomics program is part of the local organizing committee.
 This post is a reminder that the deadline for early registration with a reduced rate is approaching and will expire 31st of January. More information here.

 See you there!

Friday, January 13, 2012

SMBE 2012 early registration deadline and symposium on orthology

 For those who don't know, the deadline for abstract to the next Society for Molecular Biology and Evolution meeting (Dublin 23-26 June) is approaching. I am co-organizing a workshop on orthology/paralogy and function in collaboration with Marc Robinson-Rechavi, Matthew Hahn, and Iddo Friedberg. Find below an invitation to submit to SMBE2012 and more info on this workshop.

 Hope we can meet in Dublin. 

Dear colleague,

We invite you to submit an abstract to the symposium "The complex relationship between orthology, paralogy, and function" to take place at the meeting of the Society for Molecular Biology and Evolution in Dublin (23rd-26th June, 2012).

The deadline to submit an abstract is the 27th of January 2012, for more details please visit:

Symposium "The complex relationship between orthology, paralogy, and function"

Orthology and paralogy have been central concepts in molecular evolution since the distinction was first proposed by Fitch in 1970. A long standing interpretation of this distinction has been that orthologs would be more similar in function than paralogs. Until recently, this interpretation was rarely tested, and in fact rarely explicitly articulated in a testable manner. Yet it has been widely used, from undergraduate teaching, to the practical application of orthology searches for genome annotation. There has been a recent increase of research, seeking to define and test this "ortholog conjecture". Notably, a recent paper (Nehrt et al. 2011, PloS Comput. Biol.) has reported a higher functional similarity of paralogs than of orthologs. This paper has generated much attention and debate, while at the same time recent work on orthologs has shown the vitality and importance of this field to a broad range of applications and questions. Our symposium will feature speakers addressing the fundamental relationships between molecular evolution and biological function, focusing especially on the role of orthology and paralogy in modulating such relationships.

Confirmed speakers: Eugene V. Koonin, Jianzhi Zhang

If you have any question regarding this symposium please do not hesitate to contact us:

Toni Gabaldon , Matthew Hahn , Iddo Friedberg, Marc Robinson-Rechavi

Tuesday, January 10, 2012

Diversity arises whenever, wherever, and at whatever rate is advantageous

 This is the conclusion from a recent paper from the group of Mark Pagel, in which they analyzed a dataset of body sizes of 3,185 extant mammals in a phylogenetic context.

  They modeled the evolution of body sizes across the  phylogeny using a Bayesian approach that allows evolutionary rates to vary at every branch. 

This provided them with an idea of where burst of evolution (big shifts in sizes) had occurred. The main idea was to contrast a long-held hypothesis that the early radiation of mammals was accompanied by increased rates of body-size variation (i.e burst in species diversity coincided with burst in body-size). This was explained by the idea that mammals expanded into a largely-unoccupied niche which provided opportunities for diversification. When the niche was filled up, diversification and evolutionary rates decreased. 

  Results from this team are in stark contrast with such view, since they see bursts at many different places of the phylogeny, which are uncoupled with the early radiation of mammals. 

 Reading this paper was very useful to me since, I was by then preparing the evaluation of a PhD thesis by Victor soria-Carrasco (see some related paper here) on, precisely mammalian, diversification. In the thesis they found that most mammalian orders showed a decline in the rate of diversification (in terms of forming of new species), which may seem compatible with the idea of a niche being filled-up. This highlights the importance of properly delimiting what evolutionary rates we refer to (sequence variation, variation in some morphological character, speciation rate...), since we may reach apparently different conclusions. Complicating the issue further, one does not know whether niche limitation may select for or against diversification. 
 In any case it is comforting to see that the increasing amount of genetic, phylogenetic, and other type of data, as well as sophisticated models, enable us to explore such interesting issues at the edge between evolution, phylogenetics and ecology. I was really impressed by the works mentioned.