Saturday, September 24, 2011

Special BiB issue on "Orthology and Applications"

 An special issue on "Orthology and Applications"  is out in the journal Briefings in Bioinformatics.

 This special issue has been edited by Christophe Dessimoz and comprises a number of interesting papers including several comprehensive reviews and also original research articles. Some of the papers emerge from efforts on orthology benchmarking and standardization of datasets that were initiated during the first "Quest for Orthologs meeting" in 2009. See this letter reporting from that meeting. We contributed with an article reporting on the comparison of expression patterns between across-species orthologs and paralogs of a similar evolutionary age.

Wednesday, September 21, 2011

On the "orthology conjecture"


 Jonathan Eisen has opened a thread in his blog to discuss the recent paper by Hahn and colleagues on the "ortholog conjecture"  You can read more about the discussions raised by this paper here.

This is what I wrote, a text which I had to split in three pieces in Eisen's blog given the word limit for comments!!


I appreciate the effort by Matthew Hahnn on explaining the story behind his paper on the so-called "Ortholog conjecture" and on facing some of the criticism. This paper attracted my interest as that of many others that work on or just use orthology. For instance it was chosen by one of my postdocs for our "Journal Club" meeting. And it was discussed during our last "Quest for Orthologs" meeting in Cambridge. I think is raising a necessary discussion and therefore I think is a good paper. This does not mean that I fully agree with the interpretation and conclusions ;-). I hope to modestly contribute to this debate with the following post.

I think one of the causes that this paper has caused so much debate is that the conclusions seem to challenge common practice (inferring function from orthologs), and could be interpreted as the need of changing the strategies of genome annotation. I think, however, that one should interpret carefully these results before start annotating based on paralogous proteins. As I will discuss below one of the problems is that we need to agree in what is the conjecture to then agree in how to test it. I see three main points that can be a source of confusion: i) the issue of what is actually stated by this conjecture, ii) the issue of annotation, and iii) the issue of time

1) What is the "ortholog conjecture"?
Or in other terms, when should we expect orthologs to be more likely to share function than paralogs?. Always? Of course not. All of us would agree that two recently duplicated paralogs are likely to be more similar in function than two distant orthologs, so it is obvious that the conjecture is not simply "orthologs are more similar in function than paralogs". In reality the expectation that orthologs are more likely to be similar in function than paralogs, as least this is how I interpret it, is directly related to the effect that duplication have on functional divergence. If gene duplication has some effect on functional divergence (even in not 100% of the cases), then, given all other things equal (divergence time, story of speciation/duplication events - except fpr the duplication defining the orthologs) one would expect orthologs to be more likely to conserve function.

I think this complexity is not well considered (by many authors, in general). Hahn refeers to the famous review of orthology by Koonin (2005) as the source for the term "ortholog conjecture". However, In that paper this conjecture is discussed always within the context of genes accross two particular species, whether in Hahn's paper it is taken as well to other contexts. Thus, the proper context in which to test this conjecture is only between orthologs and between-species paralogs. As we can see,  Red and purple lines in Hahn paper in figure2 do not show any clear difference.

 Secondly, Koonin was very cautions in his paper, stating that he was referring to "equivalent functions" and not exactly the same "function", correctly implying that the functional contexts would be different in the two different species. This brings me to the next point.

ii) annotation
If the expectation of functional conservation of orthologs refers to a given pair of species, then it makes no sense to test that expectation between paralogs within the same species and orthologs in different species. We were interested in this issue and it took us some effort to control for this "species" influence on the comparison, if you are interested you can read our paper on divergence of expression profiles between orthologs and paralogs (

As Hahn founds, and it was anticipated by Koonin in that review, there is a huge influence of the "species context", a big constraint of what fraction of the function is shared. Indeed I think is the dominant signal in Hahn's paper. Why is that? One possibility is that the functional context determines the function, I agree. However, we should not discard biases in how different communities working around a model species define processes and function, also the type of experiments that are usually done. For instance experimental inference from KO mutants might be common from mouse, but I guess is not the case in humans (!!). I think this may be having a big influence and might even be the dominant signal in Hahns paper.

Finally function has many levels and I expect subfunctionalization mostly affect lower levels (i.e. more specific). Biases may also
 exist in the level of annotation between species or between families of different size (contributing more or less to the orthologs/paralogs class).

Microarray data are less likely to be subject to biases (although some may exist), at least they should be expected to be free of "human interpretation biases" and so Hahn and colleaguies did well, in my opinion, of testing that dataset. It is important to note that for microarrays and for orthologs and between-species paralogs (which I think is the right frame for testing the conjecture) ortholgs are more likely to share an expression context. This is compatible to what we found in the paper mentioned above, and compatible with the orthology conjecture as stated by koonin (accross species)

iii) time
 Finally, one aspect which I think is fundamental is the notion of "divergence time". Since paralogs can emerge at different time-scales they are composed by a heterogeneous set of protein pairs. Most of comparisons of orthologs and paralogs (Hahn's as well) use sequence divergence as a proxy of time. However this is only a poor estimate, specially when duplications (as in here) are involved (we explored this issue in the past: This means that for a given divergence time paralogs may have larger sequence divergence than orthologs at the same divergence time, or otherwise (if gene conversion is playing a role). Is the conjecture based on sequence divergence or on divergence time?, I think the initial sense of using orthology to annotate accross species is based on the notion of comparing things at the same evolutionary distance. Thus basing our conclusions on divergence times might not be the proper way of doing it.


To conclude, and with the intention of going beyond this particular paper,
I would finish by saying that the key to the problem lies on how we interpret the so-called "ortholog conjecture" or how are our expectations on how function evolves. What I get from re-reading Eugene Koonin's paper and how I am using that "assumption" in my day-to-day work is the following:

"Orthologs in two given species are more likely to share equivalent functions than paralogs between these two species"

Therefore the notion of "accross the same pair of species" is important and thus only part of the comparisons made by Hahn and colleagues could directly test this. Looking at the microarray and between-species comparisons data, the conjecture may even hold true!!

I, however, do think that the conjecture as stated above is limited and does not capture the complexity of orthology relationships. Indeed us, and many other researchers, are tuning the confidence of the orthology-based annotation based on whether the orthologs are one-to-one, one-to-many or many-to-many, even when orthologs are "super-orthologs" (with no duplication event in the lineages separating the two orthologs).

Since, the underlying assumption of the ortholog conjecture is that duplication may (not necessarily always) promote functional shifts, then many-to-many orthology relationships will tend to include  orthologous pairs with different functions.

 Thus I would re-state the conjecture (or expectation) as follows:

 "In the absence of additional duplication events in the lineages separating them, two orthologous genes from two given species are more likely to share equivalent functions than two paralogs between these two species"

 This would be a more conservative expectation, which is closer to the current use of orthology-based annotation that tends to identify one-to-one orthologs, rather than any type.

 When duplications start appearing in subsequent lineages thus creating one- or many-to-many orthology relationships, the situation is less clear. Following the assumption that duplications may promote functional divergence. Then one could expand the conjecture by "the more duplications in the evolutionary history separating two genes, the lower the expectation that these two genes would share equivalent functions".

 I wrote this contribution on the fly, and surely there are ways of expressing this in more appropriate terms. In any case I hope I made clear the idea that the conjecture emerges from the notion of duplications causing functional shifts and that our expectations will be clearer if expressed on those terms. This goes on the lines of what Jonathan Eisen mentioned on considering the whole phylogenetic story to annotate genes.

 Under this perspective, the real important hypothesis is that "duplications tend promote functional shifts", I think this is based on solid grounds and has been tested intensively in the past.  


Toni Gabaldón

Wednesday, September 14, 2011

CRG Symposium: Computational Biology of Molecular Sequences. 10-11 November

Registration is open for the CRG symposium organized by our Bioinformatics and Genomics programme. This meeting will host internationally reknown scientists in the Bioinformatics field. Just to cite some: Smith, Tramontano, Ponting, Sankoff, Koonin, Bairoch, Brunak... Below you'll find the symposium overview and the complete list of speakers. 

Advances in methods to sequence nucleic acids, coupled with more general advances in automation, robotization, and multiplexing, have resulted in the capacity to survey the phenomena of life in a global manner and with unprecedented resolution. As a result, Biology, traditionally an analytic science in which the natural world is dissected in its elemental components in order to be comprehended, is becoming a synthetic science, in which the phenomena of life is approached in more systemic way. In parallel, Biology, a science in which human effort  been directed until very recently towards data acquisition, is increasingly becoming a discipline in which data is obtained with almost no human intervention, and the effort is being directed towards data analysis. Computational systems to store, analyze and model biological data have thus become an essential part of research in Biology. The connection between Biology and Computation, however, runs much deeper as we are coming to realize that the unfolding of the instructions in the genome is, stricto senso, a computation on the DNA sequence.  Biology, thus, cannot be understood without Computation. The two-day CRG symposium on “Computational Biology of Molecular Sequences” will bring together renowned Computational Biologists from around the world, including both pioneers in the field, as well as promising young scientists. Presentations, discussions and dialogue during the Symposium will contribute to survey the status of a discipline that, at the intersection of Biology and Computation, will have an enormous impact on the world of the XXIst century.
Confirmed Speakers
Amos BAIROCH Swiss Institute of Bioinformatics (SIB) and University Geneva, Geneva CH
Mathieu BLANCHETTE McGill University, Montréal CA
Søren BRUNAK Technical University of Denmark, Kongens Lyngby DK
Philipp BUCHER Swiss Institute for Experimental Cancer Research (ISREC), Lausanne CH
Brendan FREY University of Toronto, Toronto CA
Mark GERSTEIN Yale University, New Haven US
Nick GOLDMAN European Bioinformatics Institute, Hinxton UK
Tim HUBBARD Wellcome Trust Sanger Institute, Hinxton UK
Eugene V. KOONIN National Center for Biotechnology Information, Bethesda US
Gene MYERS Janelia Farm Research Campus, Ashburn US
Chris PONTING University of Oxford, Oxford UK
David SANKOFF University of Ottawa, Ottawa CA
Ron SHAMIR Tel-Aviv University, Tel-Aviv IL
Temple F. SMITH BioMolecular Engineering Resource Center, Boston US
Terry SPEED Walter & Eliza Hall Institute of Medical Research, Parkville AU
Peter STADLER Universität Leipzig, Leipzig DE
Gary STORMO Washington University School of Medicine, Saint Louis US
Ana TRAMONTANO Sapienza University, Rome IT
Michele VENDRUSCOLO University of Cambridge, Cambridge UK
Martin VINGRON Max Planck Institute for Molecular Genetics, Berlin DE

Monday, September 5, 2011

Article collection on the Tree Of Life in Biology Direct

The journal Biology Direct has initiated an article collection entitled "Beyond the tree of Life". The main focus of this series seems to be the challenges posed by evolutionary mechanisms, such as Horizontal Gene Transfer, that may blur or completely destroy the classical view of a bifurcating tree of life representing the evolutionary relationships of organisms, specially prokaryotes.

 The issue is not new, but the current wealth of genomic data and the availability of new methodological approaches to measure and compare the evolutionary signals of thousands of protein families has prompted a revival of the debate on how strong are tree-like and network-like signals in the different domains of life.

The series started last June and is being updated regularly. There are already interesting articles from various authors including William Martin, Eric Bateste, and Eugene Koonin.

A nice add-on, which is one of the features I lke the most about Biology Direct, is that reviewer's reports can be read along with the paper, thus having a complementary view of the author's interpretation of the data.

Thursday, September 1, 2011

A new journal for "big" Science

 I have a mixed feeling for the current proliferation of scientific journals. On the one hand I feel that it is a natural response to the increase in the number of researchers in the world and the growing specialization of science. Moreover, it serves to open up the publication system to the wider community and sometimes breaks up dangerous closed circles that monopolize the access to publication in certain areas. On the other hand, as a researcher with broad interests and one who wants to follow the progress of my field, I feel overwhelmed. Times when browsing the table of contents (TOCs) of a handful of journals was enough to identify almost all relevant papers are definitely over. Nowadays one needs complex literature-mining strategies to try to cope with the flow. It seems one also needs to search for potential new Journals that may become the forum for papers relevant to your research. I'm keen to use this blog to spread the word of new Journals that are relevant to my field, as I did in the past. Now I am doing it again.

When I recently heard about the new BMC-based journal Giga Science Journal I felt it will be worth to keep an eye. As they say, "GigaScience aims to revolutionize data dissemination, organization, understanding, and use. An on-line open-access open-data journal, we publish 'big-data' studies from the entire spectrum of life and biomedical sciences."

 The original idea of this journal is that it links standard publication with a database to store and search all asociated data. I personally think this journal will fill and important gap and seems perfectly prepared to do so, as judged by the The editorial board, which includes many researchers from centres that are at the forefront of massive data production such as BGI, Wellcome Trust, EBI, JCVI, 

I am looking forward to the first articles to see direct examples of how effective this system is and how the database fits the needs of inherently diverse types of data, but at a first glance it seems that this journal may meet the needs of upcoming studies on massive data such as those coming from genomics or systems biology.