Thursday, June 28, 2012

wrap-up of the orthology, paralogy, and function symposium at SMBE 2012

I promised some people to write a short summary of the symposium that Matthew Hahn, Marc Robinson-Rechavi, Iddo Friedberg, and I co-organized at SMBE 2012. I particularly enjoyed the symposium and the room was pretty full all the time, despite running in parallel to other interesting topics. I will just write an overall summary without going into too much details of each of the talks, and at the end I would list a number of papers that were commented on the various talks. I have to clarify that this informal wrap-up only contains my own views and has not been consensuated among the organizers. I invite any of the attendants to add comments to highlight some important aspects that I may have missed.

I’ll start by providing a summary of how all this started... which is a rather unusual way, I believe. Indeed the idea of the symposium was born in the blogosphere, in the popular Jonathan Eisen’s Tree of Life blog, where he invited Matthew Hahn to write a special guest post on the “history behind” his paper on testing the orthology conjecture. One of the conclusions from that paper was that paralogous sequences were more similar in function (and in expression patterns) than paralogs, which contradicted one of the major expectations (and assumptions) behind the theories of duplication-driven functional divergence and the strategies for inferring functions from orthologous sequences. That paper had already caused a bit of a turmoil in the orthology community (I remember this was a hot discussion during the last Quest for Orthologs meeting, at Cambridge), and several concerns were being raised about the suitability of comparisons of functional annotations from different species, and the conclusions derived within the paper. Rather rapidly, several people commented on Matt’s post and a lively discussion started (more than 40 comments in total!). The discussion was so interesting that Marc Robinson-Rechavi suggested we should bring this scientific debate in the form of a symposium in one of the upcoming conference, and so is how some of us started to work on this idea.To me it was the first time that I met the other organizers in person.

The symposium started with Eugene Koonin, who nicely introduced the topic of what conjectures could be implied by the definition of orthology, a purely evolutionary one as introduced by Walter Fitch in 1970. He then showed results from his lab that indicate that conjectures tend to hold, but that there may be exception. For instance, the conjecture that orthologs should be best reciprocal hits can be broken by an accelerated evolution in one of the true orthologs, he then showed work from other groups (Sali, Sonnhammer) on the higher conservation of structure and domain architecture in orthologs as compared to paralogs. He criticized the use of GO terms by Hahn and others and argued that one should at variety of data on function to test the conjecture. He presented results from his own group which show higher conservation of expression across species. He concluded that the functional conjecture still holds, although he observed that differences may not be spectacular.  Catherina Gushanski was next talking on changes in gene expression following segmental duplications in mammals. They have produced an impressive dataset of expression from  different tissues in various mammal species. She used that set to ask the question whether duplication was contributing more to divergence than time alone and showed that levels of expression were decreasing in younger duplicates, changes were different across different tissues. She observed no differences between one-to-one orthologs or old duplicate pairs, she also found no differences in terms of tissue specificity in orthologs vs paralogs.  Next on stage was Nicholas Furnham who presented new implementations in FUNTREE that would allow exploring functional evolution on trees. He warned that EC classification is not univocal and that can also have problems for functional comparisons. They have developed “EC-Blast” which directly measures distances between enzymatic reaction based on the molecular structures of substrates and products. Christophe Dessimoz presented results from his recent paper in which they show important biases in GO term annotations, genes from the same species and families tend to be annotated with more similar terms because of experimental biases and author biases. When correcting for this biases, the conjecture still holds. However he admitted that differences were not very big, but still significant. Romain Studer came next. He measured selection and changes in structural stability in orthologs and duplicated genes. He showed that selected sites in paralogs tend to be more clustered in the structure than in orthologs, however he observed no differences in the evolution of stability between orthologs and paralogues. He concluded that differences between paralogues may be smaller than previously thought.

After the coffee break Jianzhi Zhang told us about his work towards probing the orthology conjecture. After giving a try, he gave up of using GO terms because of the many inconsistencies, and the biases observed. He thus reverted to interrogate for conservation of protein-protein interactions using experimentally determined interactions in various yeast species. Unfortunately the many interactions to test experimentally in duplicated proteins prevented him to show a comparison of orthologs and paralogs in this talk. Nevertheless he found that all PPIs tested for orthologs were conserved, even those that seemed not to be, were caused by possible errors in previous large-scale Yeast 2 Hybrid experiments. Alex Nguyen also showed results on the budding yeast gene duplications. They focused on a more specific aspect of function: the presence of short-conserved linear motifs in protein. They found that these were more likely to disappear/diverge after the duplication event, consistent with neo- or sub-functionalization models. We moved to Drosophila with our next speaker, Lev Yamplosky who exploited expression and genomic data from the 12 Drosophila genomes. They showed larger differences in paralogs, as compared to orthologs in rates of divergence, which were also more asymmetrical. They also found that these differences varied for fast- or slow-evolving families. Finally they could also find larger differences in paralogs in terms of expression. Then it was my turn, and I mainly showed our results on comparison of expression patterns in human and mouse. Our experimental design is different from others in that we use topological dating (not sequence divergence) to establish orthologs and paralogs of a similar age, and, second, we compared always orthologs to inter-species paralogs to get rid of species-specific biases in the comparisons. Our results support a larger divergence of paralogues as compared to orthologs in tissue pattern expression. Thanks to our experimental design we could also assess that most of the differences between paralogs were gained shortly after the duplication, linking the duplication event to a big fraction of the divergence. Our last speaker was Paul Thomas who gave an overview of what can you expect and what can you not expect from GO annotations. He also showed progress on how the consortium is trying to model functional evolution through gene families, and how these models can help in the study of the relationship between orthology, paralogy and gene function.

Thus we had a diverse set of talks, most of them focusing on the comparison of different aspects of functional evolution (GO annotations, expression, functional motifs, interactions, divergence, structure) and also using varying experimental designs and species. I would say one of the main conclusion is that GO (and even EC numbers) annotation can be misleading in our ascertainment of functional evolution. My personal view is that most talks showed results consistent with the conjecture, although the level of differences between paralogs and orthologs was sometimes small. Function can be described at multiple levels, and I would expect that functional divergence after duplications may affect only one or few of these. Thus if one experimental design focuses on one of such levels it may be expected to miss divergence in the other ones. In addition those designs that average over all levels will inevitably dilute small but important aspects of functional divergence. In conclusion this is an exciting topic and with the number and variety of groups that are now interested in the topic, I am sure that we will be closer and closer to understanding the complex relationships between orthology, paralogy and functional divergence.

Some links and  papers mentioned during the symposium (I probably miss some):

Abstracts from oral presentations in SMBE, including our symposium

Another post on the orthology conjecture 

Announcement of our symposiyum 

FunTree: a resource for exploring the functional evolution of
structurally defined enzyme superfamilies.
Furnham N, Sillitoe I, Holliday GL, Cuff AL, Rahman SA, Laskowski RA,
Orengo CA, Thornton JM.
Nucleic Acids Res. 2012 Jan;40(Database issue):D776-82

Brawand, D., et. al. The evolution of gene expression levels in mammalian organs. URL

 Forslund et. al. Domain conservation architecture in orthologs

Huerta-Cepas and Gabaldón Assigning duplication events to relative temporal scales in genome-wide studies.

Nehrt et. al. Testing the Ortholog Conjecture with Comparative Functional Genomic Data from Mammals

Nguyen et. al. Proteome-Wide Discovery of Evolutionary Conserved Sequences in Disordered Regions;5/215/rs1
Peterson et. al. Evolutionary constraints on structural similarity in orthologs and paralogs

Thomas et. al. On the Use of Gene Ontology Annotations to Assess Functional Similarity among Orthologs and Paralogs: A Short Report

Large-scale analysis of orthologs and paralogs under covarion-like and
constant-but-different models of amino acid evolution.
Studer RA, Robinson-Rechavi M.
Mol Biol Evol. 2010 Nov;27(11):2618-27.

How confident can we be that orthologs are similar, but paralogs differ?
Studer RA, Robinson-Rechavi M.
Trends Genet. 2009 May;25(5):210-6.

Pervasive positive selection on duplicated and nonduplicated vertebrate
protein coding genes.
Studer RA, Penel S, Duret L, Robinson-Rechavi M.
Genome Res. 2008 Sep;18(9):1393-402.

Friday, June 1, 2012

Publicly available or not?

I have always had the naive understanding that databases such as GenBank were public, and that one was free to do research on data accessed from there, and eventually publish the results. However nothing seems to be as simple as that, since many of the genomes deposited in there have not been published yet. I have experienced myself and heard from many colleagues problematic situations regarding the use of genome data taken from public databases but yet to be published. Current guidelines are open to different interpretations, and different stakeholders (editors, reviewers, users, data producers) may have entirely different and conflicting views. With the current trend we will soon have more unpublished than published genomes in public databases, so I think it is worth re-assessing the policies. Here I share some views.

Policy guidelines regarding the use of genomic sequences prior to publication are available (see NHGRI rapid data release policy, and set reasonable rules. For instance that data producers should deposit the data publicly and should produce a paper citable for the source of the data within a short period of time. This could precede a full genome paper in which a more througough analysis is produced. Users should not take the public data to publish an analysis focused on that genome. But this situation should not be prolonged too much. The underlying idea is to reserve the opportunity to describe the main characteristics and findings to the researchers that do the effort of sequencing, assembling, and annotating a genome, while ensuring that the data serves the advancement of science by allowing other groups to perform research on the genome data as soon as it is produced. However, there are many interpretations on what possible uses of the data should be allowed. Moreover, although indicative time-frames for the preferential exploitation of the data are given (e.g. 6 months), these are only indications. In the absence of clear-cut rules, the situation is calling for conflict. With the current flow of sequencing data, we will increasingly face the situation that data produced for public use and accessible through public databases is not associated to a paper and thus unclear whether its use should require permission. In such situations one may have different interpretations on existing rules, that of the leader of the sequencing project, that of the researcher that is accessing the data, that of the agency that financed the sequencing, and even that of the editors and reviewers of papers using available but unpublished data. Below I list some undesirable situations that highlights the contradictions of the current system. These situations are not hypothetical but rather correspond to real cases that I experienced or heard from colleagues

  • Users of public databases may unadvertedly download unpublished data, specially when they use they do this at large scales. After all they are using a public repository, and it is contradictory that public databases provide data that are not usable.
  • Most genome sequencing projects are financed using public money or from agencies that require that the data is made publicly available as soon as it is produced, but this leads to the situation above, making it difficult to sequencing project leaders to know what use is being made of their data.
  • Referees may specifically ask authors to use genomes that are in databases, or simply reject a paper because it does not use this or that “publicly available” genome in the comparative analyses. In addition referees or editors may ask for evidence of a specific permission to use unpublished data.
  • Authors willing to ask for the use of an unpublished genomes may be required to explain the exact use of the data, which expose their ideas to possible direct competitors.
  • Leaders of genome projects may feel in the right to ask for authorship in exchange of data that is available on public databases.
  • Leaders of genome projects may intentionally delay the publication of the genome paper to extend the period of preferential use. They may even decide to publish partial analysis before the genome paper.
  • Some unpublished genomes are in public databases for several years, and still different interpretations are possible of whether these data could be freely used.
  • Some genomes may never be published in the form of a genome paper, because they were sequenced with a very particular purpose.
In my opinion the current situation is too ambiguous, generates conflicts and ultimately jeopardizes the advance of science. We need clear rules, rather than guidelines, and I below propose four simple rules that would simplify the process.

  • Granting agencies and sequencing centers should specify a reasonable time-frame for preferential use (6-12 months) before the data is released. This should suffice for giving the upper hand to the research team that is doing the sequencing effort, but will also force them to focus on publish inga genome paper as soon as possible.
  • During this period, sequencing projects may announce the availability of the data for restricted use, through a specific repository that can be accessed only after a specific permission is granted. This will enable use of the data from time 0.
  • Data is released to the public repositories (at least in the form of bulk download) only after that period.
  • All data in public repositories should thus be free to be used for any purpose, regardless whether a genome paper is published.

Personally, for what the activity in my lab concerns I have taken the decision that we will use any data publicly deposited in GenBank for more than a year, for any purpose other than doing a “genome paper” (of course!). I think this is in perfect agreement with the NHGRI recommendations and will definitely save us time, and worries.