Treevolution: Biology through the evolutionary lens

Response to Late Mitochondrial Origin is Pure Artefact

2016-05-31T12:03:00.000-07:00

We recently published a study showing that protobacterial derived proteins in the Last Eukaryotic Common Ancestor (LECA) show a tendency to have shorter phylogenetic distances to their bacterial counterparts as compared to LECA proteins originating from other Bacteria or Archaea. We interpreted this as evidence suggesting a late acquisition of mitochondria by a host which already contained bacterial and archeal-derived protein families. Our work has been heavily criticized by William Martin -one of the main proponents of mito-early hypotheses- and colleagues. The critic was first submitted to Nature, reviewed by editors and independent reviewers and eventually rejected. The authors have decided to publish a slightly modified version of the letter in BioRxiv. In my opinion the tone of the letter is unacceptable for an open scientific discussion. In any case the bottom line is that their arguments do not support the claim that our results are artefactual, nor they show in which way the purported artefact produces the observed trend. For the sake of scientific discussion we have decided to publish our original response to their letter. We tried to post it in BioRxiv but it was declined because "is a rebuttal to a criticism not a research paper". Therefore I have decided to post it here.

Martin et. al. criticize several methodological aspects of our study. We first want to note that none of the points raised affect the core of our conclusions -i.e. that differences in stem lengths relate to phylogenetic origin of LECA families so that they are shorter in bacterial, and particularly alpha-proteobacterial derived families- because the observed relationships i) are independent of the clustering performed in Figure 1 of Pittis and Gabaldón (2016), and ii) their criticism focuses on one single comparison of a single dataset but the differences are present across several datasets and approaches, including the very same dataset from the authors mentioned in their letter (Ku et. al. 2015), as we show below. Secondly, their interpretation of our stem length measurement and how they extrapolate to branches sub-tending eukaryotic clades is conceptually flawed, as we also demonstrate below. Thus none of their arguments compromise at any rate the main conclusions of our article. We nevertheless want to discuss their points.

Contrary to what Martin et al. claim we do not assume a normal distribution of the global distribution of stem lengths. The claim that our statistical analyses are inappropriate is simply not true, we clearly explain all the methods used, and the tests performed to support observed differences are all nonparametric, without any assumption of normality. In Figure 1 we did use a probabilistic clustering method that fits a Gaussian mixture model, a mixture of normal distributions, assuming multimodality in the data. Martin et. al. show that a unimodal log-normal distribution would better fit the data when the number of parameters is penalized. Does this demonstrates that the underlying distribution is not a composite of five gaussians? No, because when data are drawn from a five gaussian distributions with the obtained parameters, in 81% of the cases a log-normal distribution would be (wrongly) preferred using the BIC criterion. Also, the fact that any randomly sampled log-normal distribution could be fitted by a mixture model is by no means a surprise. In fact any distribution of data could be fitted by a finite number of mixture components, and this is precisely why these mixture models are commonly used as universal function approximators and as a tool to partition various kinds of data. Finally the definition of overfitting is not BIC inflation but the lack of predictive power. Thus other parameters have to be considered when assessing whether a model provides a reasonable representation of the data. The use of the EM algorithm is justified as a method for partitioning the data because i) we may expect composite of signals from a proteome (LECA) with at least two ancestral components (Archaeal host, and bacterial endosymbiont), and ii) prior studies have suggested that normalized branch lengths measurements as the ones used here to be approximately normal (Rasmussen and Kellis, 2007). The assumtion of a unimodal distribution such as the one proposed by Martin et. al. does not capture the expected mixture origins for a chimeric proteome and does not fit with the observation that differences in stem lengths relate to non-homogeneous phylogenetic origins. In any case our results are independent of this clustering exercise as the differences in stem lengths are apparent when simply grouping the LECA families according to their sister clades (Fig. 2 and Extended Data Fig. 1b of Pittis and Gabaldón, 2016), or when using other forms of clustering the data such as equal binning (results not shown).

Their purported extrapolation of our analyses to eukaryotic clades and their derived dates is totally flawed and misleading. First of all, we explicitly say that we do not assume constant rates (i.e. molecular clock), and our normalized branch length is a measurement that is proportional to time but multiplied by a ratio between the rate preceding and postdating LECA, so their timing exercise, providing date estimates, is completely ungrounded. Secondly, Martin et al. consider the normalized sl to yield arbitrary values, resulting in a log-normal distribution. This openly contradicts the observation that families of different prokaryotic origins show significant differences in sl and also rsl values. All our analyses robustly prove the opposite, there are differences and these differences reflect the relative divergence times. The cases of the cyanobacterial signal in Archaeplastida (Extended Data Fig. 3, Pittis and Gabaldón 2016) and of Lokiarchaeota signal in LECA (Extended Data Fig. 7, Pittis and Gabaldón 2016) nicely indicate the validity of the measurement. Expecting some extreme ebl values to reflect radical adaptations and fast rates of some lineages, we used the median because of its robustness with respect to extreme outliers (see Methods). We also tried not accounting for fast evolving taxonomic groups in the calculations, without any change in our main results. All these observations are not explained by the interpretation of the data provided by Martin. et. al. Furthermore, Martin et. al. show that the normalized branch lengths sub-tending each eukaryotic clade follow log-normal distributions, and conclude that this observation demonstrates that this is natural variation for branches meant to represent a single time interval (e.g. divergence of fungi from metazoans). By adopting this assumption they are surprisingly ignoring that eukaryotic families are also subject to differential gene loss and other processes, which would result in multiple underlying patterns of the sub-tending branches (i.e. the sub-tending branch of a fungal family, which was lost in metazoans does not derive from the divergence between fungi and metazoans, but from the deeper divergence of fungi and other unikonts). This becomes apparent when controlling for the relationship of the normalized branch lengths with the phylogenetic affiliation of the sister branch -a key step in our analyses which they ignore. Indeed applying to the eukaryotic clades an EM-based clustering and measuring enrichments in phylogenetic affiliations as we did in our previous analysis (Pittis and Gabaldón, 2016) reveals major underlying distributions related with the nature of the sister group (Figure 1). Thus, in this case also, the variation of sl values, interpreted by the authors as “vividly documenting abundant branch length variation”, is clearly shown to naturally carry the signal of different divergence times. So yes, the sl values in eukaryotic groups do imply phases of early and late divergence times due to gene loss or other biological events, as they do in the case of LECA. Of note this is a new, independent demonstration that variation in stem lengths relate with underlying variation in phylogenetic distribution, and provides additional support to our approach.

Figure | Ascomycota stem length analysis. Different phylogenetic sister groups show
significant differences in stem lengths according to their divergence times from Ascomycota.
Gene losses in the sister group lineage can explain the alternative tree topologies and
differences in estimated stem lengths.

Finally, Martin et. al. Focus their criticism in only one of our comparisons and on only one of the datasets used. For that dataset, they wrongly claim that we reused eukaryotic sequences in the different tree. This is false. Given the multidomain nature of eukaryotic protein sequences, the source of that dataset (Powell et. al. 2014) may incorporate a given protein to more than one orthologous cluster. However we made sure we only used the orthologous sequence regions in a given analysis, thus never re-using a given eukaryotic sequence. Our analyses use standard filtering approaches but they claim that statistical significance for one of our comparisons (alpha-proteobacterial to other bacteria) is lost when applying additional ad hoc filtering on top of our previous filtering steps. We must note that even applying their filterings and using a permutation test as the one used in our paper, the alpha-proteobacterial sl values, remain significantly lower compared to other bacteria (P=1e-2, accounting only for families with eukaryotic sequence lengths >= 100 and P=3.7e-2, accounting only for alignments with gaps <= 50%, 10⁶ permutations). The loss of significance in some of the tests when artificially reducing the data is unsurprising. We are focusing on very ancient events and the signal we are measuring must be necessarily weak, and the number of LECA families that can be traced back to specific ancestries is limited. Indeed the statistical significance using a Mann-Whitney U-test is often lost (>60%-70% of the times) when randomly reducing the data to sizes similar to the resulting sizes in their filtered dataset, which suggest that the mere effect of reducing the size, rather than the particular additional filtering used is having a major effect. This is why we made sure the signal was robust across different datasets, always using state of the art filtering approaches. Given the suggestion by Martin et. al. that a recent phylogenetic analyses from them (which appeared after we had submitted the paper) represents a more careful dataset (Ku et al., 2015), we repeated our analyses using this dataset, which confirmed our results (650 eukaryotic clades, Archaeal vs Bacterial families, P=1.2e-41, two-tailed Mann-Whitney U-test and α-proteobacterial families’ sl significantly smaller within Bacterial, P=4.7e-2, permutation test, 10⁶ permutations). Again, this result lends further support to our findings.

Altogether, we show that the criticisms raised by Martin et. al. do not compromise the main results and conclusions of our paper. Furthermore, we would like to stress that the new dataset and analyses brought about by this discussion lend additional support to our approach and conclusions.

Ku, C. et al. Endosymbiotic origin and differential loss of eukaryotic genes. Nature 524, 427–432 (2015).
Rasmussen, M. D. & Kellis, M. Accurate gene-tree reconstruction by learning gene- and species-specific substitution rates across multiple complete genomes. Genome Res. 17, 1932–42 (2007).
Pittis, A. A. & Gabaldón, T. Late acquisition of mitochondria by a host with chimaeric prokaryotic ancestry. Nature 531, 101–4 (2016).
Powell et. al. eggNOG v4.0: nested orthology inference across 3686 organisms. Nucleic Acids. Res. 42(Database issue):D231-9

A genetic cartography of humans

2012-11-01T03:40:00.003-07:00

The Phase I paper of the 1000 genomes project has been published in Nature. Similarly to the completion of the first draft of the human genome sequence, this work constitutes a milestone in the path to understand the complex relationships between genotype and phenotype in our species. When we had the first human sequence we had, for the first time, a broad view of what were the genetic constituents of our species, no doubt that this has served to advance our understanding in many fields related to human biology and disease. What is then the significance of having 999 genomes more? I have been asked this question by some journalists in the last days.

If one would like to describe our species purely in genetic terms, a single genome could be a good approximation, but only that, an approximation. We know that we all differ from each other genetically, and that some of these differences explain part of the observable differences (the phenotype). What is the extent and nature of the genetic differences that exists currently?, or that even existed before in the human population?, which of these differences are important in terms of phenotypic variability, including the propensity to suffer from certain diseases?, what fraction of these differences have no important effect and can vary freely?. All these questions cannot get an answer from the analyses of a single genome, and only the comparison of a large set of genomes would serve to have a better idea of what is the genome of our species.

The analogy of a map has been used several times to illustrate how the genome sequence has helped us navigate it and has enabled dramatic improvements in how we address questions related to human biology. I think the analogy is very good, since a map in itself has only a limited scientific value, since it is, basically, a description. However, similarly to how ancient maps dramatically affected the course of history, having this maps enable unanticipated scientific discoveries. This first 1000 (1092 to be extact) genomes constitutes a first cartography of human genetic variability. Providing detailed information of what mutations occur in different populations. This map is not complete, of course, but enables a good level of resolution. The authors estimate that we now have a catalogue of more than 98% of the mutations that occur at a frequency of at least 1%. Continuing with the analogy we still miss is the specific details of how the coastal areas are: like if we would see them from very far away. This missing variability may be important, since variants involved in deleterious phenotypes (disease) are expected to be at very low frequencies. Thus the effort of improving this cartography will continue and 1500 additional genomes are planned within the consortium. In parallel, many other projects and even some from particular private persons are producing more individual genome sequences. It will be important to ensure that all these information ends up in public repository, so that this information is efficiently exploited by the scientific community.

The 1000 paper is very descriptive but already shows some important results that have an impact on how we think about the relationships of genotypes and phenotypes. They report that an individual would carry on average 200-300 variants that affect conserved residues in non-coding sequences, and even 2-4 that have been associated to disease in other studies. All individuals sequenced are healthy and thus this result tells us about the plasticity of the genome to tolerate mutations that may be deleterious in other genetic backgrounds. There is much to learn from this and the 1000 genomes will be a useful resource for studies trying to associate genetic backgrounds with disease propensity. In addition the genome sequences carry the footprints of the recent evolution of human populations, and the level of observable variability of a site can be informative of the potential functionality. Thus the possible applications of this data are many, and as I posed to a journalist. The main scientific discovery enable by this articles yet to come.

Finally, there is one important aspect that journalists do not pay much attention. Putting together this project has been a gigantic effort and has required the development of new tools and algorithms to work with this massive amount of data. Only the coordinated efforts of many groups has made this possible.This comes at a time in which such tools are desperately needed, given the growing impact of idividual genome sequencing in medicine and other fields. Similar to how an ambitious mission to bring a rover to Mars impacts scientific development beyond the particular purpose of this mission, the tools developed by the 1000 genomes project are already playing a role in hundreds other genomics project. Thus the merit of this big consortium project is not entirely the immediate scientific discoveries- at times deceiving because they are inevitably only descriptive- but their catalytic effect on a scientific field.

Can genomics save endangered species?

2012-09-22T04:56:00.001-07:00

Nowadays genomics is pervading many research fields in biology, and conservation biology is not an exception anymore. The Giant panda was perhaps the first organism selected for sequencing in which the primary reason was its status as an endangered species. Since then, other species have been selected for sequencing, in an effort to contribute to their conservation. To name a few: the Californian condor, the Tiger, Tasmanian devil and the Iberian lynx, are also entering the genomic era. Our group is contributing to the efforts of sequencing and analyzing the Iberian Lynx genome, an emblematic predator of our peninsula which has the dubious honor to be the most endangered feline species on the planet. With a population below 400, a fragmented and restricted distribution area and a dangerously low level of genetic diversity, its situation is rather critical. Two years ago a consortium of Spanish research groups joined forces to sequence this species' genome.

"Candiles" the sequenced Iberian Lynx male

I have been asked many times if this effort will definitely save the species, or even whether the money would not be better invested in other efforts. How can a genome help in saving an endangered species?, are we feeding unreasonable expectations on the possible role of genomics in species conservation? Although only time will tell whether such efforts will pay off, I consider that genomics can certainly provide a new, very useful angle to species conservation. In any case, genomics should be considered just as another tool towards species conservation, rather than as the definitive solution. Species are endangered because of various causes, mostly territory loss and degradation, overexploitation, and alteration of their ecological networks. It is obvious that the main focus should be given to fight the causes that triggered population drops and create the necessary conditions for the populations to recover safely. As a powerful tool to understand a species' biology, and as a way to investigate past and current population dynamics, the availability of a genome can greatly help in understanding some of the factors that may have been decisive in population decline. Having a reference genome opens the door for a closer genetic monitoring of wild populations, not only because it enables the selection of new marker genes than can be sampled in many individuals but also because it paves the way for obtaining whole genome-level population data by re-sequencing strategies. Indeed, our project includes already re-sequencing of additional individuals from the main fragmented territories occupied by the species.

Having such kind of data is key to understand gene flow among the different populations, since it will provide a better picture of the genetic pools of the different populations. This will help to better plan crosses among captive individuals -mainly those with permanent injuries that cannot be successfully released to the wild- and future releases of their progeny. This will have a direct impact in the case of the Iberian lynx, where high levels of inbreeding and low genetic diversity exposes fragmented populations to a higher rate of diseases with a genetic basis (particularly a renal disease), and a reduced potential to overcome potential infectious diseases. A better knowledge of the genetic pool of both wild and captive populations will undoubtedly help in guiding strategies to help them recover. In addition individuals and their territories could be tracked from materials such as faeces or hairs. Other applications may be more specific for a particular endangered species, for instance in the tasmanian devil, genomics has been used to track a transmissible cancer that causes a facial tumor disease that is transmitted by biting.

Tasmanian devil with transmissible facial tumor

Other applications of conservation genomics that go beyond the sequencing of the endangered species itself, refer to the monitoring, using similar genomics tools, of important pathogens or symbionts of endangered species. Of course all these efforts will only be of little help if the causes that drove their decline are still around. Thus there is a growing number of promising possible applications of genomics to the conservation of endangered species, some of them already at work. I expect this field to grow fast in the coming years, as a concerned scientist I am proud that my particular corner of expertise can contribute to the noble cause of helping to keep the biodiversity of our planet.

wrap-up of the orthology, paralogy, and function symposium at SMBE 2012

2012-06-28T01:31:00.000-07:00

I promised some people to write a short summary of the symposium that Matthew Hahn, Marc Robinson-Rechavi, Iddo Friedberg, and I co-organized at SMBE 2012. I particularly enjoyed the symposium and the room was pretty full all the time, despite running in parallel to other interesting topics. I will just write an overall summary without going into too much details of each of the talks, and at the end I would list a number of papers that were commented on the various talks. I have to clarify that this informal wrap-up only contains my own views and has not been consensuated among the organizers. I invite any of the attendants to add comments to highlight some important aspects that I may have missed.

I’ll start by providing a summary of how all this started... which is a rather unusual way, I believe. Indeed the idea of the symposium was born in the blogosphere, in the popular Jonathan Eisen’s Tree of Life blog, where he invited Matthew Hahn to write a special guest post on the “history behind” his paper on testing the orthology conjecture. One of the conclusions from that paper was that paralogous sequences were more similar in function (and in expression patterns) than paralogs, which contradicted one of the major expectations (and assumptions) behind the theories of duplication-driven functional divergence and the strategies for inferring functions from orthologous sequences. That paper had already caused a bit of a turmoil in the orthology community (I remember this was a hot discussion during the last Quest for Orthologs meeting, at Cambridge), and several concerns were being raised about the suitability of comparisons of functional annotations from different species, and the conclusions derived within the paper. Rather rapidly, several people commented on Matt’s post and a lively discussion started (more than 40 comments in total!). The discussion was so interesting that Marc Robinson-Rechavi suggested we should bring this scientific debate in the form of a symposium in one of the upcoming conference, and so is how some of us started to work on this idea.To me it was the first time that I met the other organizers in person.

The symposium started with Eugene Koonin, who nicely introduced the topic of what conjectures could be implied by the definition of orthology, a purely evolutionary one as introduced by Walter Fitch in 1970. He then showed results from his lab that indicate that conjectures tend to hold, but that there may be exception. For instance, the conjecture that orthologs should be best reciprocal hits can be broken by an accelerated evolution in one of the true orthologs, he then showed work from other groups (Sali, Sonnhammer) on the higher conservation of structure and domain architecture in orthologs as compared to paralogs. He criticized the use of GO terms by Hahn and others and argued that one should at variety of data on function to test the conjecture. He presented results from his own group which show higher conservation of expression across species. He concluded that the functional conjecture still holds, although he observed that differences may not be spectacular. Catherina Gushanski was next talking on changes in gene expression following segmental duplications in mammals. They have produced an impressive dataset of expression from different tissues in various mammal species. She used that set to ask the question whether duplication was contributing more to divergence than time alone and showed that levels of expression were decreasing in younger duplicates, changes were different across different tissues. She observed no differences between one-to-one orthologs or old duplicate pairs, she also found no differences in terms of tissue specificity in orthologs vs paralogs. Next on stage was Nicholas Furnham who presented new implementations in FUNTREE that would allow exploring functional evolution on trees. He warned that EC classification is not univocal and that can also have problems for functional comparisons. They have developed “EC-Blast” which directly measures distances between enzymatic reaction based on the molecular structures of substrates and products. Christophe Dessimoz presented results from his recent paper in which they show important biases in GO term annotations, genes from the same species and families tend to be annotated with more similar terms because of experimental biases and author biases. When correcting for this biases, the conjecture still holds. However he admitted that differences were not very big, but still significant. Romain Studer came next. He measured selection and changes in structural stability in orthologs and duplicated genes. He showed that selected sites in paralogs tend to be more clustered in the structure than in orthologs, however he observed no differences in the evolution of stability between orthologs and paralogues. He concluded that differences between paralogues may be smaller than previously thought.

After the coffee break Jianzhi Zhang told us about his work towards probing the orthology conjecture. After giving a try, he gave up of using GO terms because of the many inconsistencies, and the biases observed. He thus reverted to interrogate for conservation of protein-protein interactions using experimentally determined interactions in various yeast species. Unfortunately the many interactions to test experimentally in duplicated proteins prevented him to show a comparison of orthologs and paralogs in this talk. Nevertheless he found that all PPIs tested for orthologs were conserved, even those that seemed not to be, were caused by possible errors in previous large-scale Yeast 2 Hybrid experiments. Alex Nguyen also showed results on the budding yeast gene duplications. They focused on a more specific aspect of function: the presence of short-conserved linear motifs in protein. They found that these were more likely to disappear/diverge after the duplication event, consistent with neo- or sub-functionalization models. We moved to Drosophila with our next speaker, Lev Yamplosky who exploited expression and genomic data from the 12 Drosophila genomes. They showed larger differences in paralogs, as compared to orthologs in rates of divergence, which were also more asymmetrical. They also found that these differences varied for fast- or slow-evolving families. Finally they could also find larger differences in paralogs in terms of expression. Then it was my turn, and I mainly showed our results on comparison of expression patterns in human and mouse. Our experimental design is different from others in that we use topological dating (not sequence divergence) to establish orthologs and paralogs of a similar age, and, second, we compared always orthologs to inter-species paralogs to get rid of species-specific biases in the comparisons. Our results support a larger divergence of paralogues as compared to orthologs in tissue pattern expression. Thanks to our experimental design we could also assess that most of the differences between paralogs were gained shortly after the duplication, linking the duplication event to a big fraction of the divergence. Our last speaker was Paul Thomas who gave an overview of what can you expect and what can you not expect from GO annotations. He also showed progress on how the consortium is trying to model functional evolution through gene families, and how these models can help in the study of the relationship between orthology, paralogy and gene function.

Thus we had a diverse set of talks, most of them focusing on the comparison of different aspects of functional evolution (GO annotations, expression, functional motifs, interactions, divergence, structure) and also using varying experimental designs and species. I would say one of the main conclusion is that GO (and even EC numbers) annotation can be misleading in our ascertainment of functional evolution. My personal view is that most talks showed results consistent with the conjecture, although the level of differences between paralogs and orthologs was sometimes small. Function can be described at multiple levels, and I would expect that functional divergence after duplications may affect only one or few of these. Thus if one experimental design focuses on one of such levels it may be expected to miss divergence in the other ones. In addition those designs that average over all levels will inevitably dilute small but important aspects of functional divergence. In conclusion this is an exciting topic and with the number and variety of groups that are now interested in the topic, I am sure that we will be closer and closer to understanding the complex relationships between orthology, paralogy and functional divergence.

Some links and papers mentioned during the symposium (I probably miss some):

Abstracts from oral presentations in SMBE, including our symposium http://imgpublic.mci-group.com/ie/PCO/OralAbstracts_Final.pdf

The blog post that initiated this: http://phylogenomics.blogspot.com.es/2011/09/special-guest-post-discussion.html

Another post on the orthology conjecture

Announcement of our symposiyum

Altenhoff et. al. Resolving the Ortholog Conjecture: Orthologs Tend to Be Weakly, but Significantly, More Similar in Function than Paralogs

FunTree: a resource for exploring the functional evolution of
structurally defined enzyme superfamilies.
Furnham N, Sillitoe I, Holliday GL, Cuff AL, Rahman SA, Laskowski RA,
Orengo CA, Thornton JM.
Nucleic Acids Res. 2012 Jan;40(Database issue):D776-82
http://nar.oxfordjournals.org/content/40/D1/D776.long

Brawand, D., et. al. The evolution of gene expression levels in mammalian organs. URL

Forslund et. al. Domain conservation architecture in orthologs

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3215765/

Huerta-Cepas. et. al. Evidence for short-time divergence and long-time conservation of tissue-specific expression after gene duplication.

Huerta-Cepas and Gabaldón Assigning duplication events to relative temporal scales in genome-wide studies.

Nehrt et. al. Testing the Ortholog Conjecture with Comparative Functional Genomic Data from Mammals http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002073

Nguyen et. al. Proteome-Wide Discovery of Evolutionary Conserved Sequences in Disordered Regions http://stke.sciencemag.org/cgi/content/abstract/sigtrans;5/215/rs1

Peterson et. al. Evolutionary constraints on structural similarity in orthologs and paralogs

http://onlinelibrary.wiley.com/doi/10.1002/pro.143/full

Thomas et. al. On the Use of Gene Ontology Annotations to Assess Functional Similarity among Orthologs and Paralogs: A Short Report

Large-scale analysis of orthologs and paralogs under covarion-like and
constant-but-different models of amino acid evolution.
Studer RA, Robinson-Rechavi M.
Mol Biol Evol. 2010 Nov;27(11):2618-27.
http://mbe.oxfordjournals.org/content/27/11/2618.short

How confident can we be that orthologs are similar, but paralogs differ?
Studer RA, Robinson-Rechavi M.
Trends Genet. 2009 May;25(5):210-6.
http://www.sciencedirect.com/science/article/pii/S0168952509000559

Pervasive positive selection on duplicated and nonduplicated vertebrate
protein coding genes.
Studer RA, Penel S, Duret L, Robinson-Rechavi M.
Genome Res. 2008 Sep;18(9):1393-402.
http://genome.cshlp.org/content/18/9/1393.short

Publicly available or not?

2012-06-01T04:11:00.003-07:00

I have always had the naive understanding that databases such as GenBank were public, and that one was free to do research on data accessed from there, and eventually publish the results. However nothing seems to be as simple as that, since many of the genomes deposited in there have not been published yet. I have experienced myself and heard from many colleagues problematic situations regarding the use of genome data taken from public databases but yet to be published. Current guidelines are open to different interpretations, and different stakeholders (editors, reviewers, users, data producers) may have entirely different and conflicting views. With the current trend we will soon have more unpublished than published genomes in public databases, so I think it is worth re-assessing the policies. Here I share some views.

Policy guidelines regarding the use of genomic sequences prior to publication are available (see NHGRI rapid data release policy http://www.genome.gov/10506376), and set reasonable rules. For instance that data producers should deposit the data publicly and should produce a paper citable for the source of the data within a short period of time. This could precede a full genome paper in which a more througough analysis is produced. Users should not take the public data to publish an analysis focused on that genome. But this situation should not be prolonged too much. The underlying idea is to reserve the opportunity to describe the main characteristics and findings to the researchers that do the effort of sequencing, assembling, and annotating a genome, while ensuring that the data serves the advancement of science by allowing other groups to perform research on the genome data as soon as it is produced. However, there are many interpretations on what possible uses of the data should be allowed. Moreover, although indicative time-frames for the preferential exploitation of the data are given (e.g. 6 months), these are only indications. In the absence of clear-cut rules, the situation is calling for conflict. With the current flow of sequencing data, we will increasingly face the situation that data produced for public use and accessible through public databases is not associated to a paper and thus unclear whether its use should require permission. In such situations one may have different interpretations on existing rules, that of the leader of the sequencing project, that of the researcher that is accessing the data, that of the agency that financed the sequencing, and even that of the editors and reviewers of papers using available but unpublished data. Below I list some undesirable situations that highlights the contradictions of the current system. These situations are not hypothetical but rather correspond to real cases that I experienced or heard from colleagues

Users of public databases may unadvertedly download unpublished data, specially when they use they do this at large scales. After all they are using a public repository, and it is contradictory that public databases provide data that are not usable.
Most genome sequencing projects are financed using public money or from agencies that require that the data is made publicly available as soon as it is produced, but this leads to the situation above, making it difficult to sequencing project leaders to know what use is being made of their data.
Referees may specifically ask authors to use genomes that are in databases, or simply reject a paper because it does not use this or that “publicly available” genome in the comparative analyses. In addition referees or editors may ask for evidence of a specific permission to use unpublished data.
Authors willing to ask for the use of an unpublished genomes may be required to explain the exact use of the data, which expose their ideas to possible direct competitors.
Leaders of genome projects may feel in the right to ask for authorship in exchange of data that is available on public databases.
Leaders of genome projects may intentionally delay the publication of the genome paper to extend the period of preferential use. They may even decide to publish partial analysis before the genome paper.
Some unpublished genomes are in public databases for several years, and still different interpretations are possible of whether these data could be freely used.
Some genomes may never be published in the form of a genome paper, because they were sequenced with a very particular purpose.

In my opinion the current situation is too ambiguous, generates conflicts and ultimately jeopardizes the advance of science. We need clear rules, rather than guidelines, and I below propose four simple rules that would simplify the process.

Granting agencies and sequencing centers should specify a reasonable time-frame for preferential use (6-12 months) before the data is released. This should suffice for giving the upper hand to the research team that is doing the sequencing effort, but will also force them to focus on publish inga genome paper as soon as possible.
During this period, sequencing projects may announce the availability of the data for restricted use, through a specific repository that can be accessed only after a specific permission is granted. This will enable use of the data from time 0.
Data is released to the public repositories (at least in the form of bulk download) only after that period.
All data in public repositories should thus be free to be used for any purpose, regardless whether a genome paper is published.

Personally, for what the activity in my lab concerns I have taken the decision that we will use any data publicly deposited in GenBank for more than a year, for any purpose other than doing a “genome paper” (of course!). I think this is in perfect agreement with the NHGRI recommendations and will definitely save us time, and worries.

Challenges in phylogenetic tree visualization

2012-03-25T10:20:00.000-07:00

I recently read an excellent review by Roderic Page, on the challenges in phylogenetic tree representation and visualization. It provides an overview on existing software and tools (although he missed our ETE package, see image below for an example of ETE's visualization features). The number and diversity of existing tools is overwhelming, but probably matches the diversity of different interests and possible applications of phylogenetic trees. One may be interested in overlaying sequence information (see below), while other would be interested in displaying information on the geographical distribution of the species. Some may need to represent uncertainty and overly different topologies, or networks to represent transfers of genetic material, the possibilities are unlimited.

Most importantly he mentions some of the challenges of tree visualization software such as the ability to represent huge trees and to allow interactive behavior with the user. In our group we have encountered such needs and this is the reason behind implementing more visualization features in ETE. Fortunately new technologies are offering new opportunities as well, and I enjoyed imagining the possibilities that 3D visualization and touchscreen technologies will provide to researchers. Definitely is a field to follow.

If you are interested in the topic. I recommend this video.

Open Letter for Research in Spain

2012-03-10T09:52:00.000-08:00

As you surely have heard, Spain is facing a serious crisis in the context of a globalized market-economy (yes, it used to be a time when economical crisis related to something more tangible, such as a serious drought or a plague, but now one can only blame abstract fluxes of financial speculations). The new government is preparing a new budget which is predicted to include the most dramatic cuts in our history. Researchers here, who have already been hit by previous cuts (see this letter), are now embracing for the worst.

In this context, an open letter has been put together by the Confederation of Spanish Scientific Societies, the Federation of Young Researchers and others. I recommend you to read it (some cited figures and data are very revealing), and if yo agree with it sign it, as I just did.

Open letter for research in Spain.

Darwin's h-index

2012-03-04T00:26:00.002-08:00

I guess most scientists are nowadays familiar with the term "h-index", which is a metric of citations to your published articles. More specifically the h-index correspond to the number of articles (h) that have at least h citations. Given that this index is used by many funding agencies and by peers that evaluate you for a position or competitive grant, we all hope to see it grow year by year.

Charles Darwin lived in completely different times, he had no need to apply for grants or positions every few years and there was no system to track citations or give a "number" to the supposed "impact" of his research. He, nevertheless, has been absorbed by the current metrics obsession and has already an h-index, computed by google scholar.

His magic number is 63. Will this change anyway our idea of how important was Darwin's impact to Science? or it will rather help us to put the h-index into context, and highlight the difficulty of measuring true impacts?

Phylogenetic Tree Challenge in Encyclopedia Of Life

2012-02-22T00:06:00.001-08:00

The Encyclopedia of Life initiative aims at providing an open, digital resource providing comprehensive information about the diversity of life. It has recently opened a call for teams that can provide a phylogeny-aware organization of as many scientific names as possible. This text is from the call:

A prize is offered to the individual or team that can provide a very large, phylogenetically-organized set(s) of scientific names suitable for ingestion into the Encyclopedia of Life as an alternate browsing hierarchy.

[...]

Among other factors, the total number of uniquely named nodes, node/leaf ratios and tree height may be used to compare entries so contestants should consider how they wish to trade off strict consensus versus other methods of reflecting the state of phylogenetic knowledge.

Problems to solve include 1) how to assign labels to unnamed nodes, 2) how to fill in gaps so that the set of taxa included is as comprehensive as possible, even if trees are not fully resolved or all taxa have not been analyzed, 3) how to handle competing hypotheses, 4) how to update the hierarchy at least annually.

The winning submission must be available to EOL and others under an acceptable CC license if it is under copyright. The tree need not be previously published in peer-reviewed form.

and more information is available here.

Getting more complex and gaining.... nothing

2012-02-08T09:15:00.000-08:00

The origin of complexity is a highly debated issue in biology. For instance, many functions in the cell are carried out by intricate macro-molecular complexes formed of a multitude of subunits. When tracing the evolution of such complexes, as we did with mitochondrial Complex I, one often finds that the number of subunits have increased through time. However, the addition of subunits not always seems to correlate with the acquisition of novel functions, which would provide a selective advantage for the increase in complexity. Can we think of a mechanism promoting a trend for increasing complexity in the absence of a selective advantage provided by a novel function?.

A recent paper by Finnigan and colleagues show a plausible mechanism and present evidence that this may have been responsible for the acquisition of a novel subunit in fungal vacuolar ATPases (depicted below).

This molecular machines that pump protons across membranes have a membrane ring (in green in the figure) formed by 6 units. In vertebrates two different subunits (originated from the duplication of an ancestral gene) form the 6-units ring in a 1:5, stoichiometry. In fungi a more recent duplication brought about one more subunit type so that the ring is formed by the products of three different genes in a 1:1:4 organization. Using ancestral sequence resurrection (I love that name!), a technique that consists of reconstructing most likely ancestral sequences and then synthesizing them in the lab, they show that a single mutation acquired early in each paralogue, was sufficient for making the two of them indispensable. Thus, such model could explain a trend to increase complexity in multi-paralogue complexes (those comprised by some subunits derived from duplicated genes) without a requirement for an initial selective advantage.

In a way, I see this model as a special type of sub-functionalization. That is, the two new paralogues would in sum make the same function that was performed by the ancestral gene. In the absence of more examples we do not know how widespread is this mechanism, but the fact that it does require few likely events and that it actually constitutes a "ratchet" (as noted by W Ford Doolittle), that is once you gain that complexity you don't go back, one would expect to have occurred in several of many multi-paralogue complexes, at least in some lineages.

Perhaps this could explain an intersting finding we did some years ago when looking at the evolution of the mitochondrial electron transport chain in fungi (mostly formed by multi-protein complexes): the amount of duplications in members of this complexes was of the same level as other proteins. This is in contrast to the gene-dosage effect hypothesis that states that complexes would tend to duplicate only when the stochiometry is conserved (that is in when the whole complex duplicates, e.g in whole genome duplications).

Finally, another remark that I always do when seeing ancestral sequence resurrection working is that the fact that ancestral reconstructions display the expected biochemical activities (e.g by complementing extant sequences) is an indication that the models of evolution we use are not that wrong after all.

Interview with Nick Lane

2012-01-29T04:29:00.000-08:00

As I reported in an earlier post, I had the opportunity to meet Nick Lane during the Spanish Evolutionary Society meeting. We had a very interesting discussion over a couple of beers around mitochondrial endosymbiosis and the origin of eukaryotes. Some days after the meeiing, Andrés Moya, the President of the society, suggested to me to interview him for the Society's Bulletin eVolución. You can find this interview translated to Spanish in the current issue of eVolution 7(1), however I think the interview might be of interest for a broader audience and thus I paste here the original, English version.

TG- After your recent visit to Spain as an invited speaker to the III SESBE congress (Madrid, November 2011), what is your opinion about the field of Evolutionary Biology in Spain?

NL- Well, I thoroughly enjoyed the few talks I attended, but my Spanish is poor and I could hardly judge many of them; and unfortunately I missed much of the conference. But I liked the great range of themes that were being discussed. And in general I am impressed with a lot of evolutionary research going on in Spain. There is a tendency to consider comparative physiology in evolution more than there is in England, for example, and I find that a very insightful approach. One thing that has struck me over the years is that Spanish researchers are not cited as frequently as they ought to be. This does not reflect the quality of the research, but rather the US-dominated English-language citation bias.

TG- Your career has been quite unconventional. Can you summarize for our readers which have been the major steps in your career path?

NL- It sure has! I had a medical research background, and my PhD was on mitochondrial function and oxygen free radicals in transplanted organs. But I was getting nowhere with that, and couldn’t see a way of getting from there into what was really an interest for me: evolutionary biology. So I took to writing instead, for several independent agencies doing medical education for pharmaceutical companies. That was an eye opener, and I learnt to write clearly and quickly, but it was also a frustration. After quite a lot of hard work I finally got a contract to write Oxygen, which was initially conceived as a book about free radicals, mitochondria and medicine, but ended up reflecting my interests in evolutionary biology to a much greater extent. That was the beginning of a decade spent writing books on evolutionary biochemistry, drawing heavily on my background in bioenergetics but ranging widely over any material that interested me. It was fantastic fun but no way to make a living. And ultimately frustrating too, in that in writing on that scope, you can’t help but come up with new ideas, essentially a broad synthesis with gaps, that you sketch in with speculations, which can be reframed as testable hypotheses. That’s what drew me back into research – the frustrated desire to test some of these hypotheses.

3) Thus, you have been active as a science writer, a researcher, and now you seem to combine both aspects. Do these two tasks reinforce or rather interfere with each other?

Both. I think I’ve benefited tremendously as a researcher from the decade I spent thinking and writing. I now have a coherent set of hypotheses that are testable in one way or another – experimentally or by some kind of mathematical modeling, or just by empirical analysis of existing data. So I’m drawing heavily on this ‘credit’ now. At the same time it is hard to think synthetically or to write books while in research, there are so many demands on time. So on a daily basis, writing and research interfere with each other, but I think if you are able to focus on one or the other for periods then they can, and should, reinforce each other. The trick is to balance each so that they reinforce each other over time. I’m not sure I’ve mastered that trick yet, but it is my long term goal: for me, it is the best way to understand the most interesting evolutionary questions, and that is what I want to do.

4) In your view, where lies the main responsibility of communicating science to the general society (e.g scientists, funding agencies, scientific societies etc, science journalists)?

Good question. There is certainly a responsibility, but being responsible counts for nothing if nobody listens to what you have to say: as a writer, you must be interesting to be noticed at all. And society is rarely interested in responsible but boring views. So there is a balance that you have to wrestle with every sentence, between interest and accuracy. That’s another reason I’m happy to be back in research: to write accurately (in precise scientific language) is at least as much pleasure for me as to write interestingly. Frankly it is the questions themselves that interest me. I think that the real challenge in writing for the public is to find ways of phrasing questions in an interesting way, which draws attention to the problem, without sacrificing the accuracy. That is the ideal: responsible (boring) and interesting at the same time.

With respect to which group has the responsability, I don't think that one group alone can be considered responsible communicating science to general society. Each group can address different needs, and each has its own responsibility. Scientists are responsible for sculpting new ideas, for conveying the excitement and intellectual thrust of science. The best ideas in science are still driven by individuals with passion, insight and ingenuity, and there is nobody better to convey this intensity to the general reader, although it is rare. Journalists are responsible for balanced reporting, explaining ideas clearly and intelligibly, providing context for the reader, ideally some commentary from other scientists. It is unusual for journalists to drive the scientific agenda, but serious journalists have a broader perspective and can sometimes see things that scientists can't.

Scientific societies can provide very helpful consensus statements on difficult issues, from global warming to the effectiveness of chemotherapy. It's not really for them to give a sense of the cut and thrust of science, more the strength of the conclusions that emerge from the uncertainty.

Finally, funding agencies. In my view, funding agencies have a duty to explain to the public and to politicians that research is open-ended and unpredictable. Research that appears to have little immediate societal impact can have immense and unimagined benefits in the future. Most major scientific breakthroughs, with the greatest economic benefits, came from unexpected quarters, and could not have been anticipated by either the scientists themselves or the funders. This perspective is being lost in a political drive to justify spending by societal impact. As with so much, short-term political cycles are trumping long term good sense. It is up to funding agencies to explain why research should be funded on its own merits, without constant recourse to some hoped-for and probably illusory impact.
TG- In one of you articles, to commemorate the 150 anniversary of “The Origin of species”, you discuss about what Darwin would love to know about the origin of the eye if he were still alive. Darwin is granted for being the first who used a “tree of life” to describe the evolutionary relationships of species and their shared ancestry. What do you think he would love to now in this respect if he were still alive?

NL- Well I think he’d love what’s going on in microbial genomics. The picture that has emerged over the last couple of decades of lateral gene transfer and endosymbiosis in microbes is radically different to the idea of gene sequence divergence between populations. Having said that, I see all this as a juxtaposition to standard Neodarwinian population genetics. He would have loved that too, although it is old hat to us now; but given that Darwin knew nothing about genes, he would have been thrilled by the Neodarwinian synthesis, and what amounted to a genetic basis for a tree of life. All of this means that variation is more complex than any of us imagined; and in this sense, Darwin’s coyness on the mechanisms of variation was well placed: it really is wild and fascinating.

TG- In one of your last books, you mention 10 major transitions in the evolution of life on earth. Which one of them is, according to you, the most enigmatic or difficult to explain?

NL- Consciousness, without a doubt. Frequently the origin of life and consciousness are put forward as the twin pinnacles, the two big unanswered questions in biology. I think we’re actually quite close to understanding the origin of life in conceptual terms, but I personally can’t understand consciousness well at all. I read a lot on the subject and came to the conclusion that nobody really does. We still can’t answer the simple question: how does the depolarization of a neuron give rise to a feeling or sensation of anything at all? They are two different languages, and we don’t seem to have any kind of Rosetta stone at the moment.

TG- Some of these transitions seem to have happened only once in the history of life. If they were so advantageous why they have been restricted to a single lineage?

NL- I think each transition has to be taken on its own terms. These are tremendously difficult questions and you will find diametrically opposed answers to each question from very insightful researchers. The answers reflect temperament more than anything else. Christian de Duve actually wrote a book called ‘Singularities’, and my reading of that is that there isn’t a single answer that would apply to the origin of life, the origin of photosynthesis, the origin of the eukaryotic cell, the origin of animals, and the origin of consciousness. Obviously for some reason, each was improbable or it would have happened more than once (like eyes), but the reasons for improbability differ and are very dependent on context. In the case of eukaryotes, I would say their unique origin was based on an improbable endosymbiosis between prokaryotes, followed by a problematic reconciliation of selfish interests between two entities that had to live in intimate union. There were no advantages at all until they had come out of that tight bottleneck; on the contrary, all the advantages were with the bacteria that just kept on doing their bacterial thing. From that point of view, the difficult question is why did it happen at all?

TG- Some of your research interests concern very ancient events (e.g. the origin of eukaryotes, of life itself). This is a field in which different hypotheses are difficult to prove right or wrong given the difficulty of direct experimentation. What are the criteria used by scientists in your area to reach a consensus over which is the support for the different scenarios?

NL- There is a consensus on quite a lot: cell structure, behavior (phagocytosis or sex) genome sequences (albeit with disputes over methodology), the existence of introns in certain positions and so on. Where consensus breaks down is when different methods give different answers. That happens all the time. I’m actually focusing a lot of my attention now on the origin of life itself, because this seems to me to be more experimentally tractable: we can ask specific experimental questions that involve chemistry and thermodynamics, which are much more reliable than biology and genes, so although the event was the most ancient of all, it is not necessarily the most inaccessible. I think we’re making progress on many questions, but in the case of the origin of eukaryotes a lot of the evidence is oblique and disputable. The reasoning is often equivalent to historical reconstruction in that you need to weigh the evidence: there’s no doubt that it happened, and there’s plenty of evidence, it’s just that some of it is unreliable and some is irrelevant, so there’s plenty of scope for argument still.

TG- In this respect. What is the impact on your field of the ever-growing number of genome sequencing projects?. What are the species or environments you would like to be sampled in order to help answering important questions in the origin and evolution of complex life.

NL- Genome sequences have made a tremendous difference, the only trouble being that they tend to reflect pathogens or industrially interesting bugs, rather than those most relevant to, say, the origin of eukaryotes. I would love to see more genomes from anoxic or anaerobic deep ocean environments, or the deep hot biosphere. I’m especially interested in two questions: the variation in eukaryotic genomes, and the variation in mitochondrial genomes. There is a brilliant and bold hypothesis that the origin of the eukaryotic cell was an endosymbiosis between two prokaryotes, an archaeon host cell and an alpha-proteobacterium (or somesuch). The prediction is that all eukaryotes should have mitochondria or organelles derived from them like hydrogenosomes or mitosomes; and that in terms of mitochondrial genomes we should find more overlap between bacterial metabolic capacity and metabolically versatile mitochondria. This is a wonderful prediction because it is so easy to falsify, and yet all the genome sequencing so far has failed to disprove it. The places most likely to disprove – or prove – it are precisely those anaerobic environments that have been undersampled so far.

TG- Carbon has always been considered a hallmark of life on earth, but life (elsewhere) based on other molecules (e.g Silicium) has been speculated. You seem to favor the idea that oxygen was the molecule that enabled the appearance of complex life on earth, could you speculate on the theoretical possibility of other molecules playing a similar role in other forms of life.

NL- I think it is most likely that life elsewhere would be constrained by much the same issues that constrain life here. I doubt very much that there will be silicon based life forms. There are two important properties of carbon: it is much better than silicon at organic chemistry; but equally important, it is available in the form of a gaseous oxide, a Lego brick if you will. There are no gaseous silicon oxides, only sand, which is vast and unwieldy in comparison. You can’t build a house on sand and you can’t build an organism from sand. My feeling is that not only is carbon especially useful, it is also more abundant than silicon. Likewise, water is more abundant than methane and a much better solvent (you can’t dissolve carbon chains of more than about 5 carbon atoms in methane). And so on. On the basis of usefulness and abundance, I would argue that life would mostly be carbon based. I would go further to argue that it is likely to require proton gradients over membranes for thermodynamic reasons. When I say that oxygen is necessary for complex life, I mean large active animals. I doubt that anything else could do the job: nothing else could accumulate to the appropriate level in an atmosphere and at the same time be sufficiently reactive to provide the power needed. So I’d say that in terms of their broad biochemistry, alien life won’t be all that different. In terms of morphology or the specifics of their biochemistry, they could be very different, of course.

TG- Are you already working on your next book?, can you advance something on what is it about?

NL- I’m not writing yet, but I do have a contract… and it will be about everything I have talked about here. The origin of complex life, and why it was a unique event here on Earth.

RECOMB 2012 (Barcelona): one week left for early registration

2012-01-24T00:19:00.000-08:00

As I reported in an earlier post, RECOMB 2012 will be held in Barcelona and CRG's Bioinformatics and Genomics program is part of the local organizing committee.
This post is a reminder that the deadline for early registration with a reduced rate is approaching and will expire 31st of January. More information here.

See you there!

SMBE 2012 early registration deadline and symposium on orthology

2012-01-13T09:09:00.000-08:00

For those who don't know, the deadline for abstract to the next Society for Molecular Biology and Evolution meeting (Dublin 23-26 June) is approaching. I am co-organizing a workshop on orthology/paralogy and function in collaboration with Marc Robinson-Rechavi, Matthew Hahn, and Iddo Friedberg. Find below an invitation to submit to SMBE2012 and more info on this workshop.

Hope we can meet in Dublin.

Dear colleague,

We invite you to submit an abstract to the symposium "The complex relationship between orthology, paralogy, and function" to take place at the meeting of the Society for Molecular Biology and Evolution in Dublin (23rd-26th June, 2012).

The deadline to submit an abstract is the 27th of January 2012, for more details please visit:

http://www.smbe2012.org/scientific-content/call-for-abstracts.html

Symposium "The complex relationship between orthology, paralogy, and function"

Orthology and paralogy have been central concepts in molecular evolution since the distinction was first proposed by Fitch in 1970. A long standing interpretation of this distinction has been that orthologs would be more similar in function than paralogs. Until recently, this interpretation was rarely tested, and in fact rarely explicitly articulated in a testable manner. Yet it has been widely used, from undergraduate teaching, to the practical application of orthology searches for genome annotation. There has been a recent increase of research, seeking to define and test this "ortholog conjecture". Notably, a recent paper (Nehrt et al. 2011, PloS Comput. Biol.) has reported a higher functional similarity of paralogs than of orthologs. This paper has generated much attention and debate, while at the same time recent work on orthologs has shown the vitality and importance of this field to a broad range of applications and questions. Our symposium will feature speakers addressing the fundamental relationships between molecular evolution and biological function, focusing especially on the role of orthology and paralogy in modulating such relationships.

Confirmed speakers: Eugene V. Koonin, Jianzhi Zhang

If you have any question regarding this symposium please do not hesitate to contact us:

Toni Gabaldon , Matthew Hahn , Iddo Friedberg, Marc Robinson-Rechavi

Diversity arises whenever, wherever, and at whatever rate is advantageous

2012-01-10T11:30:00.000-08:00

This is the conclusion from a recent paper from the group of Mark Pagel, in which they analyzed a dataset of body sizes of 3,185 extant mammals in a phylogenetic context.

They modeled the evolution of body sizes across the phylogeny using a Bayesian approach that allows evolutionary rates to vary at every branch.

This provided them with an idea of where burst of evolution (big shifts in sizes) had occurred. The main idea was to contrast a long-held hypothesis that the early radiation of mammals was accompanied by increased rates of body-size variation (i.e burst in species diversity coincided with burst in body-size). This was explained by the idea that mammals expanded into a largely-unoccupied niche which provided opportunities for diversification. When the niche was filled up, diversification and evolutionary rates decreased.

Results from this team are in stark contrast with such view, since they see bursts at many different places of the phylogeny, which are uncoupled with the early radiation of mammals.

Reading this paper was very useful to me since, I was by then preparing the evaluation of a PhD thesis by Victor soria-Carrasco (see some related paper here) on, precisely mammalian, diversification. In the thesis they found that most mammalian orders showed a decline in the rate of diversification (in terms of forming of new species), which may seem compatible with the idea of a niche being filled-up. This highlights the importance of properly delimiting what evolutionary rates we refer to (sequence variation, variation in some morphological character, speciation rate...), since we may reach apparently different conclusions. Complicating the issue further, one does not know whether niche limitation may select for or against diversification.

In any case it is comforting to see that the increasing amount of genetic, phylogenetic, and other type of data, as well as sophisticated models, enable us to explore such interesting issues at the edge between evolution, phylogenetics and ecology. I was really impressed by the works mentioned.

Sequencing species.... by the thousands

2011-12-10T08:12:00.000-08:00

When I was giving my first steps in the field of comparative genomics, there was not much to think about when deciding which genomic datasets to use: one would just take them all. With only a few dozens of genomes, mostly of bacteria, one could have everything at hand, in the local disk, just need to update every couple of months by adding one or two more...

These times have definitely passed, and now the flow of newly sequenced genomes is... well, overwhelming (see figure below, taken from Genomes Online). This is both a blessing and a curse for us doing comparative genomics, since we have an unprecedented amount of data which enables more resolution, but we are increasingly facing novel technical and analyitical challenges.

Just to give a taste of this avalanche of genomes from different species (projects for sequencing genomes for a given species, such as the 1000 genomes is another story) that is coming, I here list some of the projects I am aware of that aim at sequencing thousands of genomes from a given taxonomic group.

As expected, in this kind of projects it is way more easy to come up with a bold number, than to actually define the list of species that are actually going to be sequenced. At least this is what I can tell from my involvement in the i5K initiative, in which prioritisation of species to be sequenced is not simple, since usually one wants to weigh in different criteria (phylogenetic relevance, biological, economical, and clinical importance, etc).

I'm sure I missed some, and, in addition, there is a growing flow of genomes that are sequenced by independent groups, including my modest own group. One common weakness of this large, and small-scale initiatives is that they sometimes come with the cost for covering the genome sequencing but do not account for the necessary bioinformatics analyses to actually make sense of the data. With the sequencing costs dropping and the potential analyses becoming more complex, the actual costs of sequencing projects will more and more be on the side of the analysis beyond the assembly and annotation phases. As a result, many bioinformatics groups are streching their resources to contribute to genomics projects without getting any specific funding.

In my opinion the planning of a sequencing project should account for all the downstream phases with their associated costs. With such an approach we may end up having a handful of genomes less, but we will definitely learn more from them.

Watch the talks from the CRG Symposium: Computational Biology of Molecular Sequences.

2011-12-05T07:27:00.000-08:00

If you missed the opportunity to attend physically our past symposium on "Computational Biology of molecules" (see this past post), you can now watch the videos of the talks (read message below).

*****************

Dear all,

All contents of the 10^th CRG Annual Symposium on Computational Biology of Molecular Sequences, celebrated last 10^th and 11^th of November, are now available online.

Leading scientists in computational biology came together in Barcelona on the occasion of the tenth edition of the CRG Annual Symposium, which focused on computational biology of molecular sequences, organized by the Centre for Genomic Regulation (CRG). The auditorium of the Barcelona Biomedical Research Park (PRBB) hosted the event, celebrated from Thursday 10 to Friday 11 November 2011.

In the microsite you can find the inaugural video of the Symposium, videos of the talks, interviews with some of the speakers, participants and organizers of the event and two summary videos that capture the major points of all sessions. There are also available two articles that summarize the talks and news related to the field of computational biology of sequencing.

We hope that these resources are useful for you!

Click here to visit the 10^th CRG Annual Symposium web.

SESBE: Spanish Society for Evolutionary Biology

2011-12-03T08:46:00.000-08:00

Last week I went to Madrid to attend the 3rd congress of the Spanish Society for Evolutionary Biology (SESBE). This is a relatively new (7 years) society that embraces evolutionary biology as a whole, from palaeontology and systematics, to evolutionary genomics and darwinian medicine. Thus, the meetings are very diverse and one can listen to the most diverse talks, always with the common ground of evolutionary theory as a framework of analysis.

Due to other commitments, I could only stay two days but it was worth and enjoyed most of the talks and, most of all, meeting colleagues around Spain. I would highlight here the talk of Nick Lane, on the evolution of eukaryotes and the role played by mitochondrial endosymbiosis. Nick, who is also a prolific writer of popular science books, gave a very nice talk that seduced the whole audience, including me. I had the opportunity to discuss with him, and it was nice to discuss again on big theories on the evolution of eukaryotes, a big theme that I am passionate.

This year, the SESBE elected a new board, in which I will stand as a secretary. Not that I am very keen on holding such a position, but I was asked and I think one should be prepared to contribute his two cents to noble causes, such as that of this society promoting the study of evolution and its transmission to society in our country.

XI Jornadas de Bioinformatica in Barcelona (23-25 January)

2011-11-20T00:01:00.000-08:00

A short note to spread the word on the joint Spanish and Portuguese Meeting on Bioinformatics. This is a yearly meeting that is gaining momentum every year, and it is a great opportunity to meet most groups doing bioinformatics in the region. Talks are in English and everybody is welcome to attend.

As other years, this meeting has associated a regional (Spain, Portugal and North Africa) ISCB student symposium. This year this symposium is co-organized by, Salvador Capella-Gutierrez, one of the members of my lab.

If you plan to submit a communication, there is time till the end of November.

See you there.

ALPHY 2012: French-Spanish meeting on Bioinformatics and Evolutionary Genomics (March 19 -21, Banyuls-sur-Mer)

2011-11-08T02:42:00.000-08:00

I am glad to announce ALPHY 2012, which for the first time is jointly co-organized by French and Spanish researchers. I was very glad to be invited by my French colleagues to sit at the organizing committe. I think it is a great opportunity to join two communities with ample experience in phylogenetics-related research.

ALPHY is an annual meeting, organized in France since 1995, dedicated to the field of Bioinformatics and Comparative Genomics (ALPHY = ALignments and PHYlogeny). The main goal of this meeting is to promote informal exchanges in this highly multidisciplinary field, and to encourage young scientists to present their work. The official invitation follows, plus a very tempting picture of the location.

This year, ALPHY is co-organized by Spanish and French scientists, in the nice city of Banyuls. There will be two invited speakers (Henrik Kaessmann and Jose Castresana), and the program will be open to contributions for 20’ talks.
The registration to the meeting is free, but mandatory. Please use the link (top left of this page) to register. If you wish to present your work, submit your abstract in the registration form.
Important dates:

Deadline for abstract submission: January 10 2012
Deadline for registration : February 1st 2012

Hasta pronto – A bientôt – fins aviat - see you in Banyuls!

Sad news from CIPF: the rise and fall of the "flagship" of valencian research

2011-11-03T02:19:00.000-07:00

For those who don't know. I am originally from Valencia. There, one of the deepest traditions and the main festivity are the so-called "Falles", which in part consist of building huge temporary cardboard sculptures which are exposed for little more than a week and then burned in a big fire. For some people is hard to understand how so much time and money is invested in something that is then left to the flames.

Apparently, something similar is happening with a research centre!!!

The "Centro de Investigación Príncipe Felipe" was created in 2005 by the local Valencian government with the idea of making it the "flagship" of research in the region. It came with a strong investment from the regional and central governments and soon attracted many scientists. I was one of the seduced scientists, who originally from Valencia, and at that time in the Netherlands was enthusiastic about a move aiming to put biomedical research in Valencia at the forefront.

Five years after its creation, the cuts started. Crisis had hit Spanish economy and many local governments had big debts, particularly that of Valencia who has been famous for investing in huge events such as the America's cup or the formula 1 competition. When things went complicated, research was seen as one of the most superfluous thing in which a government could invest, and thus cuts were announced. This year the centre is firing 40% of the personnel, including PhD candidates at the middle of their PhD. I guess many of the remaining researchers will leave this downsized center for a better live elsewhere. The flagship is now sinking, "burned" after so much investment and efforts, the comparison to our "Falles" is unavoidable.

The whole story is reported by Nature and by many articles in the Spanish press. As Juli Peretó reports, the local government is letting CIPF fall, while keeping investing on other type of events, such as an international Golf tournament in Castelló, or increasing the funds for a motorbike circuit. This is most ironic, and deeply sad.

I just wish the best for my many ex-colleagues that are still at CIPF and hope this is not the kind of science policy that the future government of Spain (according to polls is likely to be the same conservative party that is now governing in Valencia) is planning.

Educational video on the Tree of Life

2011-11-01T03:46:00.000-07:00

In the blog of Jun-Hoe Lee, a former visiting student in my lab, I found this interesting video from Yale university on the Tree of Life and the efforts to reconstruct it.

I think it is a good piece for popular communication of science and conveys pretty reasonably well the problem. Of course, there are simplifications and some important aspects such as that of horizontal transfer of genes, symbioses, and their effects on the tree are not covered, but it provides an attractive and educational introduction to the problem of assembling the tree of life.

RECOMB 2012 (Barcelona)

2011-10-24T06:36:00.000-07:00

The next RECOMB meeting will be held at Barcelona. Our department is part of the local organizing committee and the list of confirmed speakers looks very promising.

Submission opened in September, and you still have time to submit papers until the end of the week. Do not miss the deadline.

Special BiB issue on "Orthology and Applications"

2011-09-24T10:32:00.000-07:00

An special issue on "Orthology and Applications" is out in the journal Briefings in Bioinformatics.

This special issue has been edited by Christophe Dessimoz and comprises a number of interesting papers including several comprehensive reviews and also original research articles. Some of the papers emerge from efforts on orthology benchmarking and standardization of datasets that were initiated during the first "Quest for Orthologs meeting" in 2009. See this letter reporting from that meeting. We contributed with an article reporting on the comparison of expression patterns between across-species orthologs and paralogs of a similar evolutionary age.

On the "orthology conjecture"

2011-09-21T09:04:00.000-07:00

Hi,

Jonathan Eisen has opened a thread in his blog to discuss the recent paper by Hahn and colleagues on the "ortholog conjecture" You can read more about the discussions raised by this paper here.

This is what I wrote, a text which I had to split in three pieces in Eisen's blog given the word limit for comments!!

Hi

I appreciate the effort by Matthew Hahnn on explaining the story behind his paper on the so-called "Ortholog conjecture" and on facing some of the criticism. This paper attracted my interest as that of many others that work on or just use orthology. For instance it was chosen by one of my postdocs for our "Journal Club" meeting. And it was discussed during our last "Quest for Orthologs" meeting in Cambridge. I think is raising a necessary discussion and therefore I think is a good paper. This does not mean that I fully agree with the interpretation and conclusions ;-). I hope to modestly contribute to this debate with the following post.

I think one of the causes that this paper has caused so much debate is that the conclusions seem to challenge common practice (inferring function from orthologs), and could be interpreted as the need of changing the strategies of genome annotation. I think, however, that one should interpret carefully these results before start annotating based on paralogous proteins. As I will discuss below one of the problems is that we need to agree in what is the conjecture to then agree in how to test it. I see three main points that can be a source of confusion: i) the issue of what is actually stated by this conjecture, ii) the issue of annotation, and iii) the issue of time

1) What is the "ortholog conjecture"?
Or in other terms, when should we expect orthologs to be more likely to share function than paralogs?. Always? Of course not. All of us would agree that two recently duplicated paralogs are likely to be more similar in function than two distant orthologs, so it is obvious that the conjecture is not simply "orthologs are more similar in function than paralogs". In reality the expectation that orthologs are more likely to be similar in function than paralogs, as least this is how I interpret it, is directly related to the effect that duplication have on functional divergence. If gene duplication has some effect on functional divergence (even in not 100% of the cases), then, given all other things equal (divergence time, story of speciation/duplication events - except fpr the duplication defining the orthologs) one would expect orthologs to be more likely to conserve function.

I think this complexity is not well considered (by many authors, in general). Hahn refeers to the famous review of orthology by Koonin (2005) as the source for the term "ortholog conjecture". However, In that paper this conjecture is discussed always within the context of genes accross two particular species, whether in Hahn's paper it is taken as well to other contexts. Thus, the proper context in which to test this conjecture is only between orthologs and between-species paralogs. As we can see, Red and purple lines in Hahn paper in figure2 do not show any clear difference.

Secondly, Koonin was very cautions in his paper, stating that he was referring to "equivalent functions" and not exactly the same "function", correctly implying that the functional contexts would be different in the two different species. This brings me to the next point.

ii) annotation
If the expectation of functional conservation of orthologs refers to a given pair of species, then it makes no sense to test that expectation between paralogs within the same species and orthologs in different species. We were interested in this issue and it took us some effort to control for this "species" influence on the comparison, if you are interested you can read our paper on divergence of expression profiles between orthologs and paralogs (http://www.ncbi.nlm.nih.gov/pubmed/21515902)

As Hahn founds, and it was anticipated by Koonin in that review, there is a huge influence of the "species context", a big constraint of what fraction of the function is shared. Indeed I think is the dominant signal in Hahn's paper. Why is that? One possibility is that the functional context determines the function, I agree. However, we should not discard biases in how different communities working around a model species define processes and function, also the type of experiments that are usually done. For instance experimental inference from KO mutants might be common from mouse, but I guess is not the case in humans (!!). I think this may be having a big influence and might even be the dominant signal in Hahns paper.

Finally function has many levels and I expect subfunctionalization mostly affect lower levels (i.e. more specific). Biases may also
exist in the level of annotation between species or between families of different size (contributing more or less to the orthologs/paralogs class).

Microarray data are less likely to be subject to biases (although some may exist), at least they should be expected to be free of "human interpretation biases" and so Hahn and colleaguies did well, in my opinion, of testing that dataset. It is important to note that for microarrays and for orthologs and between-species paralogs (which I think is the right frame for testing the conjecture) ortholgs are more likely to share an expression context. This is compatible to what we found in the paper mentioned above, and compatible with the orthology conjecture as stated by koonin (accross species)

iii) time
Finally, one aspect which I think is fundamental is the notion of "divergence time". Since paralogs can emerge at different time-scales they are composed by a heterogeneous set of protein pairs. Most of comparisons of orthologs and paralogs (Hahn's as well) use sequence divergence as a proxy of time. However this is only a poor estimate, specially when duplications (as in here) are involved (we explored this issue in the past: http://www.ncbi.nlm.nih.gov/pubmed/21075746). This means that for a given divergence time paralogs may have larger sequence divergence than orthologs at the same divergence time, or otherwise (if gene conversion is playing a role). Is the conjecture based on sequence divergence or on divergence time?, I think the initial sense of using orthology to annotate accross species is based on the notion of comparing things at the same evolutionary distance. Thus basing our conclusions on divergence times might not be the proper way of doing it.

CONCLUSIONS AND PROPOSAL FOR RE-STATEMENT

To conclude, and with the intention of going beyond this particular paper,
I would finish by saying that the key to the problem lies on how we interpret the so-called "ortholog conjecture" or how are our expectations on how function evolves. What I get from re-reading Eugene Koonin's paper and how I am using that "assumption" in my day-to-day work is the following:

"Orthologs in two given species are more likely to share equivalent functions than paralogs between these two species"

Therefore the notion of "accross the same pair of species" is important and thus only part of the comparisons made by Hahn and colleagues could directly test this. Looking at the microarray and between-species comparisons data, the conjecture may even hold true!!

I, however, do think that the conjecture as stated above is limited and does not capture the complexity of orthology relationships. Indeed us, and many other researchers, are tuning the confidence of the orthology-based annotation based on whether the orthologs are one-to-one, one-to-many or many-to-many, even when orthologs are "super-orthologs" (with no duplication event in the lineages separating the two orthologs).

Since, the underlying assumption of the ortholog conjecture is that duplication may (not necessarily always) promote functional shifts, then many-to-many orthology relationships will tend to include orthologous pairs with different functions.

Thus I would re-state the conjecture (or expectation) as follows:

"In the absence of additional duplication events in the lineages separating them, two orthologous genes from two given species are more likely to share equivalent functions than two paralogs between these two species"

This would be a more conservative expectation, which is closer to the current use of orthology-based annotation that tends to identify one-to-one orthologs, rather than any type.

When duplications start appearing in subsequent lineages thus creating one- or many-to-many orthology relationships, the situation is less clear. Following the assumption that duplications may promote functional divergence. Then one could expand the conjecture by "the more duplications in the evolutionary history separating two genes, the lower the expectation that these two genes would share equivalent functions".

I wrote this contribution on the fly, and surely there are ways of expressing this in more appropriate terms. In any case I hope I made clear the idea that the conjecture emerges from the notion of duplications causing functional shifts and that our expectations will be clearer if expressed on those terms. This goes on the lines of what Jonathan Eisen mentioned on considering the whole phylogenetic story to annotate genes.

Under this perspective, the real important hypothesis is that "duplications tend promote functional shifts", I think this is based on solid grounds and has been tested intensively in the past.

Cheers,

Toni Gabaldón

http://treevolution.blogspot.com

CRG Symposium: Computational Biology of Molecular Sequences. 10-11 November

2011-09-14T07:48:00.000-07:00

Registration is open for the CRG symposium organized by our Bioinformatics and Genomics programme. This meeting will host internationally reknown scientists in the Bioinformatics field. Just to cite some: Smith, Tramontano, Ponting, Sankoff, Koonin, Bairoch, Brunak... Below you'll find the symposium overview and the complete list of speakers.

Advances in methods to sequence nucleic acids, coupled with more general advances in automation, robotization, and multiplexing, have resulted in the capacity to survey the phenomena of life in a global manner and with unprecedented resolution. As a result, Biology, traditionally an analytic science in which the natural world is dissected in its elemental components in order to be comprehended, is becoming a synthetic science, in which the phenomena of life is approached in more systemic way. In parallel, Biology, a science in which human effort been directed until very recently towards data acquisition, is increasingly becoming a discipline in which data is obtained with almost no human intervention, and the effort is being directed towards data analysis. Computational systems to store, analyze and model biological data have thus become an essential part of research in Biology. The connection between Biology and Computation, however, runs much deeper as we are coming to realize that the unfolding of the instructions in the genome is, stricto senso, a computation on the DNA sequence. Biology, thus, cannot be understood without Computation. The two-day CRG symposium on “Computational Biology of Molecular Sequences” will bring together renowned Computational Biologists from around the world, including both pioneers in the field, as well as promising young scientists. Presentations, discussions and dialogue during the Symposium will contribute to survey the status of a discipline that, at the intersection of Biology and Computation, will have an enormous impact on the world of the XXIst century.

Confirmed Speakers
Amos BAIROCH Swiss Institute of Bioinformatics (SIB) and University Geneva, Geneva CH
Mathieu BLANCHETTE McGill University, Montréal CA
Søren BRUNAK Technical University of Denmark, Kongens Lyngby DK
Philipp BUCHER Swiss Institute for Experimental Cancer Research (ISREC), Lausanne CH
Brendan FREY University of Toronto, Toronto CA
Mark GERSTEIN Yale University, New Haven US
Nick GOLDMAN European Bioinformatics Institute, Hinxton UK
Tim HUBBARD Wellcome Trust Sanger Institute, Hinxton UK
Eugene V. KOONIN National Center for Biotechnology Information, Bethesda US
Gene MYERS Janelia Farm Research Campus, Ashburn US
Chris PONTING University of Oxford, Oxford UK
David SANKOFF University of Ottawa, Ottawa CA
Ron SHAMIR Tel-Aviv University, Tel-Aviv IL
Temple F. SMITH BioMolecular Engineering Resource Center, Boston US
Terry SPEED Walter & Eliza Hall Institute of Medical Research, Parkville AU
Peter STADLER Universität Leipzig, Leipzig DE
Gary STORMO Washington University School of Medicine, Saint Louis US
Ana TRAMONTANO Sapienza University, Rome IT
Michele VENDRUSCOLO University of Cambridge, Cambridge UK
Martin VINGRON Max Planck Institute for Molecular Genetics, Berlin DE