Tuesday, May 31, 2016

Response to Late Mitochondrial Origin is Pure Artefact

We recently published a study showing that protobacterial derived proteins in the Last  Eukaryotic Common Ancestor (LECA) show a tendency to have shorter phylogenetic distances to their bacterial counterparts as compared to LECA proteins originating from other Bacteria or Archaea. We interpreted this as evidence suggesting a late acquisition of mitochondria by a host which already contained bacterial and archeal-derived protein families. Our work has been heavily criticized by William Martin -one of the main proponents of mito-early hypotheses- and colleagues. The critic was first submitted to Nature, reviewed by editors and independent reviewers and eventually rejected. The authors have decided to publish a slightly modified version of the letter in BioRxiv. In my opinion the tone of the letter is unacceptable for an open scientific discussion. In any case the bottom line is that their arguments do not support the claim that our results are artefactual, nor they show in which way the purported artefact produces the observed trend. For the sake of scientific discussion we have decided to publish our original response to their letter. We tried to post it in BioRxiv but it was declined because "is a rebuttal to a criticism not a research paper". Therefore I have decided to post it here. 

Martin et. al. criticize several methodological aspects of our study. We first want to note that none of the points raised affect the core of our conclusions -i.e. that differences in stem lengths relate to phylogenetic origin of LECA families so that they are shorter in bacterial, and particularly alpha-proteobacterial derived families- because the observed relationships i) are independent of the clustering performed in Figure 1 of Pittis and Gabaldón (2016), and ii) their criticism focuses on one single comparison of a single dataset but the differences are present across several datasets and approaches, including the very same dataset from the authors mentioned in their letter (Ku et. al. 2015), as we show below. Secondly, their interpretation of our stem length measurement and how they extrapolate to branches sub-tending eukaryotic clades is conceptually flawed, as we also demonstrate below. Thus none of their arguments compromise at any rate the main conclusions of our article. We nevertheless want to discuss their points.

Contrary to what Martin et al. claim we do not assume a normal distribution of the global distribution of stem lengths. The claim that our statistical analyses are inappropriate is simply not true, we clearly explain all the methods used, and the tests performed to support observed differences are all nonparametric, without any assumption of normality. In Figure 1 we did use a probabilistic clustering method that fits a Gaussian mixture model, a mixture of normal distributions, assuming multimodality in the data. Martin et. al. show that a unimodal log-normal distribution would better fit the data when the number of parameters is penalized. Does this demonstrates that the underlying distribution is not a composite of five gaussians? No, because when data are drawn from a five gaussian distributions with the obtained parameters, in 81% of the cases a log-normal distribution would be (wrongly) preferred using the BIC criterion. Also, the fact that any randomly sampled log-normal distribution could be fitted by a mixture model is by no means a surprise. In fact any distribution of data could be fitted by a finite number of mixture components, and this is precisely why these mixture models are commonly used as universal function approximators and as a tool to partition various kinds of data. Finally the definition of overfitting is not BIC inflation but the lack of predictive power. Thus other parameters have to be considered when assessing whether a model provides a reasonable representation of the data. The use of the EM algorithm is justified as a method for partitioning the data because i) we may expect composite of signals from a proteome (LECA) with at least two ancestral components (Archaeal host, and bacterial endosymbiont), and ii) prior studies have suggested that normalized branch lengths measurements as the ones used here to be approximately normal (Rasmussen and Kellis, 2007). The assumtion of a unimodal distribution such as the one proposed by Martin et. al. does not capture the expected mixture origins for a chimeric proteome and does not fit with the observation that differences in stem lengths relate to non-homogeneous phylogenetic origins. In any case our results are independent of this clustering exercise as the differences in stem lengths are apparent when simply grouping the LECA families according to their sister clades (Fig. 2 and Extended Data Fig. 1b of Pittis and Gabaldón, 2016), or when using other forms of clustering the data such as equal binning (results not shown).

Their purported extrapolation of our analyses to eukaryotic clades and their derived dates is totally flawed and misleading. First of all, we explicitly say that we do not assume constant rates (i.e. molecular clock), and our normalized branch length is a measurement that is proportional to time but multiplied by a ratio between the rate preceding and postdating LECA, so their timing exercise, providing date estimates, is completely ungrounded. Secondly, Martin et al. consider the normalized sl to yield arbitrary values, resulting in a log-normal distribution. This openly contradicts the observation that families of different prokaryotic origins show significant differences in sl and also rsl values. All our analyses robustly prove the opposite, there are differences and these differences reflect the relative divergence times. The cases of the cyanobacterial signal in Archaeplastida (Extended Data Fig. 3, Pittis and Gabaldón 2016) and of Lokiarchaeota signal in LECA (Extended Data Fig. 7, Pittis and Gabaldón 2016) nicely indicate the validity of the measurement. Expecting some extreme ebl values to reflect radical adaptations and fast rates of some lineages, we used the median because of its robustness with respect to extreme outliers (see Methods). We also tried not accounting for fast evolving taxonomic groups in the calculations, without any change in our main results. All these observations are not explained by the interpretation of the data provided by Martin. et. al. Furthermore, Martin et. al. show that the normalized branch lengths sub-tending each eukaryotic clade follow log-normal distributions, and conclude that this observation demonstrates that this is natural variation for branches meant to represent a single time interval (e.g. divergence of fungi from metazoans). By adopting this assumption they are surprisingly ignoring that eukaryotic families are also subject to differential gene loss and other processes, which would result in multiple underlying patterns of the sub-tending branches (i.e. the sub-tending branch of a fungal family, which was lost in metazoans does not derive from the divergence between fungi and metazoans, but from the deeper divergence of fungi and other unikonts). This becomes apparent when controlling for the relationship of the normalized branch lengths with the phylogenetic affiliation of the sister branch -a key step in our analyses which they ignore. Indeed applying to the eukaryotic clades an EM-based clustering and measuring enrichments in phylogenetic affiliations as we did in our previous analysis (Pittis and Gabaldón, 2016) reveals major underlying distributions related with the nature of the sister group (Figure 1). Thus, in this case also, the variation of sl values, interpreted by the authors as “vividly documenting abundant branch length variation”, is clearly shown to naturally carry the signal of different divergence times. So yes, the sl values in eukaryotic groups do imply phases of early and late divergence times due to gene loss or other biological events, as they do in the case of LECA. Of note this is a new, independent demonstration that variation in stem lengths relate with underlying variation in phylogenetic distribution, and provides additional support to our approach. 

 Figure | Ascomycota stem length analysis. Different phylogenetic sister groups show
significant differences in stem lengths according to their divergence times from Ascomycota.
Gene losses in the sister group lineage can explain the alternative tree topologies and
differences in estimated stem lengths.

Finally, Martin et. al. Focus their criticism in only one of our comparisons and on only one of the datasets used. For that dataset, they wrongly claim that we reused eukaryotic sequences in the different tree. This is false. Given the multidomain nature of eukaryotic protein sequences, the source of that dataset (Powell et. al. 2014) may incorporate a given protein to more than one orthologous cluster. However we made sure we only used the orthologous sequence regions in a given analysis, thus never re-using a given eukaryotic sequence. Our analyses use standard filtering approaches but they claim that statistical significance for one of our comparisons (alpha-proteobacterial to other bacteria) is lost when applying additional ad hoc filtering on top of our previous filtering steps. We must note that even applying their filterings and using a permutation test as the one used in our paper, the alpha-proteobacterial sl values, remain significantly lower compared to other bacteria (P=1e-2, accounting only for families with eukaryotic sequence lengths >= 100 and P=3.7e-2, accounting only for alignments with gaps <= 50%, 106 permutations). The loss of significance in some of the tests when artificially reducing the data is unsurprising. We are focusing on very ancient events and the signal we are measuring must be necessarily weak, and the number of LECA families that can be traced back to specific ancestries is limited. Indeed the statistical significance using a Mann-Whitney U-test is often lost (>60%-70% of the times) when randomly reducing the data to sizes similar to the resulting sizes in their filtered dataset, which suggest that the mere effect of reducing the size, rather than the particular additional filtering used is having a major effect. This is why we made sure the signal was robust across different datasets, always using state of the art filtering approaches. Given the suggestion by Martin et. al. that a recent phylogenetic analyses from them (which appeared after we had submitted the paper) represents a more careful dataset (Ku et al., 2015), we repeated our analyses using this dataset, which confirmed our results (650 eukaryotic clades, Archaeal vs Bacterial families, P=1.2e-41, two-tailed Mann-Whitney U-test and α-proteobacterial families’ sl significantly smaller within Bacterial, P=4.7e-2, permutation test, 106 permutations). Again, this result lends further support to our findings.

Altogether, we show that the criticisms raised by Martin et. al. do not compromise the main results and conclusions of our paper. Furthermore, we would like to stress that the new dataset and analyses brought about by this discussion lend additional support to our approach and conclusions.

  1. Ku, C. et al. Endosymbiotic origin and differential loss of eukaryotic genes. Nature 524, 427–432 (2015).
  2. Rasmussen, M. D. & Kellis, M. Accurate gene-tree reconstruction by learning gene- and species-specific substitution rates across multiple complete genomes. Genome Res. 17, 1932–42 (2007).
  3. Pittis, A. A. & Gabaldón, T. Late acquisition of mitochondria by a host with chimaeric prokaryotic ancestry. Nature 531, 101–4 (2016).
  4. Powell et. al. eggNOG v4.0: nested orthology inference across 3686 organisms. Nucleic Acids. Res. 42(Database issue):D231-9

Thursday, November 1, 2012

A genetic cartography of humans

The Phase I paper of the 1000 genomes project has been published in Nature.  Similarly to the completion of the first draft of the human genome sequence, this work constitutes a milestone in the path to understand the complex relationships between genotype and phenotype in our species. When we had the first human sequence we had, for the first time, a broad view of what were the genetic constituents of our species, no doubt that this has served to advance our understanding in many fields related to human biology and disease. What is then the significance of having 999 genomes more? I have been asked this question by some journalists in the last days. 

If one would like to describe our species purely in genetic terms, a single genome could be a good approximation, but only that, an approximation. We know that we all differ from each other genetically, and that some of these differences explain part of the observable differences (the phenotype). What is the extent and nature of the genetic differences that exists currently?, or that even existed before in the human population?, which of these differences are important in terms of phenotypic variability, including the propensity to suffer from certain diseases?, what fraction of these differences have no important effect and can vary freely?. All these questions cannot get an answer from the analyses of a single genome, and only the comparison of a large set of genomes would serve to have a better idea of what is the genome of our species.

The analogy of a map has been used several times to illustrate how the genome sequence has helped us navigate it and has enabled dramatic improvements in how we address questions related to human biology. I think the analogy is very good, since a map in itself has only a limited scientific value, since it is, basically, a description. However, similarly to how ancient maps dramatically affected the course of history, having this maps enable unanticipated scientific discoveries. This first 1000 (1092 to be extact) genomes constitutes a first cartography of human genetic variability. Providing detailed information of what mutations occur in different populations. This map is not complete, of course, but enables a good level of resolution. The authors estimate that we now have a catalogue of more than 98% of the mutations that occur at a frequency of at least 1%. Continuing with the analogy we still miss is the specific details of how the coastal areas are: like if we would see them from very far away. This missing variability may be important, since variants involved in deleterious phenotypes (disease) are expected to be at very low frequencies. Thus the effort of improving this cartography will continue and 1500 additional genomes are planned within the consortium. In parallel, many other projects and even some from particular private persons are producing more individual genome sequences. It will be important to ensure that all these information ends up in public repository, so that this information is efficiently exploited by the scientific community.  

The 1000 paper is very descriptive but already shows some important results that have an impact on how we think about the relationships of genotypes and phenotypes. They report that an individual would carry on average 200-300 variants that affect conserved residues in non-coding sequences, and even 2-4 that have been associated to disease in other studies. All individuals sequenced are healthy and thus this result tells us about the plasticity of the genome to tolerate mutations that may be deleterious in other genetic backgrounds. There is much to learn from this and the 1000 genomes will be a useful resource for studies trying to associate genetic backgrounds with disease propensity. In addition the genome sequences carry the footprints of the recent evolution of human populations, and the level of observable variability of a site can be informative of the potential functionality. Thus the possible applications of this data are many, and as I posed to a journalist. The main scientific discovery enable by this articles yet to come.

Finally, there is one important aspect that journalists do not pay much attention. Putting together this project has been a gigantic effort and has required the development of new tools and algorithms to work with this massive amount of data. Only the coordinated efforts of many groups has made this possible.This comes at a time in which such tools are desperately needed, given the growing impact of idividual genome sequencing in medicine and other fields. Similar to how an ambitious mission to bring a rover to Mars impacts scientific development beyond the particular purpose of this mission, the tools developed by the 1000 genomes project are already playing a role in hundreds other genomics project. Thus the merit of this big consortium project is not entirely the immediate scientific discoveries- at times deceiving because they are inevitably only descriptive- but their catalytic effect on a scientific field. 


Saturday, September 22, 2012

Can genomics save endangered species?

Nowadays genomics is pervading many research fields in biology, and conservation biology is not an exception anymore. The Giant panda was perhaps the first organism selected for sequencing in which the primary reason was its status as an endangered species. Since then, other species have been selected for sequencing, in an effort to contribute to their conservation. To name a few: the Californian condor, the Tiger, Tasmanian devil and the Iberian lynx, are also entering the genomic era. Our group is contributing to the efforts of sequencing and analyzing the Iberian Lynx genome, an emblematic predator of our peninsula which has the dubious honor to be the most endangered feline species on the planet. With a population below 400, a fragmented and restricted distribution area and a dangerously low level of genetic diversity, its situation is rather critical. Two years ago a consortium of Spanish research groups joined forces to sequence this species' genome. 

"Candiles" the sequenced Iberian Lynx male 

I have been asked many times if this effort will definitely save the species, or even whether the money would not be better invested in other efforts. How can a genome help in saving an endangered species?, are we feeding unreasonable expectations on the possible role of genomics in species conservation? Although only time will tell whether such efforts will pay off, I consider that genomics can certainly provide a new, very useful angle to species conservation. In any case, genomics should be considered just as another tool towards species conservation, rather than as the definitive solution. Species are endangered because of various causes, mostly territory loss and degradation, overexploitation, and alteration of their ecological networks. It is obvious that the main focus should be given to fight the causes that triggered population drops and create the necessary conditions for the populations to recover safely. As a powerful tool to understand a species' biology, and as a way to investigate past and current population dynamics, the availability of a genome can greatly help in understanding some of the factors that may have been decisive in population decline. Having a reference genome opens the door for a closer genetic monitoring of wild populations, not only because it enables the selection of new marker genes than can be sampled in many individuals but also because it paves the way for obtaining whole genome-level population data by re-sequencing strategies. Indeed, our project includes already re-sequencing of additional individuals from the main fragmented territories occupied by the species.

Having such kind of data is key to understand gene flow among the different populations, since it will provide a better picture of the genetic pools of the different populations. This will help to better plan crosses among captive individuals -mainly those with permanent injuries that cannot be successfully released to the wild- and future releases of their progeny. This will have a direct impact in the case of the Iberian lynx, where high levels of inbreeding and low genetic diversity exposes fragmented populations to a higher rate of diseases with a genetic basis (particularly a renal disease), and a reduced potential to overcome potential infectious diseases. A better knowledge of the genetic pool of both wild and captive populations will undoubtedly help in guiding strategies to help them recover. In addition individuals and their territories could be tracked from materials such as faeces or hairs.  Other applications may be more specific for a particular endangered species, for instance in the tasmanian devil, genomics has been used to track a transmissible cancer that causes a facial tumor disease that is transmitted by biting. 

Tasmanian devil with transmissible facial tumor

Other applications of conservation genomics that go beyond the sequencing of the endangered species itself, refer to the monitoring, using similar genomics tools, of important pathogens or symbionts of endangered species. Of course all these efforts will only be of little help if the causes that drove their decline are still around. Thus there is a growing number of promising possible applications of genomics to the conservation of endangered species, some of them already at work. I expect this field to grow fast in the coming years, as a concerned scientist I am proud that my particular corner of expertise can contribute to the noble cause of helping to keep the biodiversity of our planet.

Thursday, June 28, 2012

wrap-up of the orthology, paralogy, and function symposium at SMBE 2012

I promised some people to write a short summary of the symposium that Matthew Hahn, Marc Robinson-Rechavi, Iddo Friedberg, and I co-organized at SMBE 2012. I particularly enjoyed the symposium and the room was pretty full all the time, despite running in parallel to other interesting topics. I will just write an overall summary without going into too much details of each of the talks, and at the end I would list a number of papers that were commented on the various talks. I have to clarify that this informal wrap-up only contains my own views and has not been consensuated among the organizers. I invite any of the attendants to add comments to highlight some important aspects that I may have missed.

I’ll start by providing a summary of how all this started... which is a rather unusual way, I believe. Indeed the idea of the symposium was born in the blogosphere, in the popular Jonathan Eisen’s Tree of Life blog, where he invited Matthew Hahn to write a special guest post on the “history behind” his paper on testing the orthology conjecture. One of the conclusions from that paper was that paralogous sequences were more similar in function (and in expression patterns) than paralogs, which contradicted one of the major expectations (and assumptions) behind the theories of duplication-driven functional divergence and the strategies for inferring functions from orthologous sequences. That paper had already caused a bit of a turmoil in the orthology community (I remember this was a hot discussion during the last Quest for Orthologs meeting, at Cambridge), and several concerns were being raised about the suitability of comparisons of functional annotations from different species, and the conclusions derived within the paper. Rather rapidly, several people commented on Matt’s post and a lively discussion started (more than 40 comments in total!). The discussion was so interesting that Marc Robinson-Rechavi suggested we should bring this scientific debate in the form of a symposium in one of the upcoming conference, and so is how some of us started to work on this idea.To me it was the first time that I met the other organizers in person.

The symposium started with Eugene Koonin, who nicely introduced the topic of what conjectures could be implied by the definition of orthology, a purely evolutionary one as introduced by Walter Fitch in 1970. He then showed results from his lab that indicate that conjectures tend to hold, but that there may be exception. For instance, the conjecture that orthologs should be best reciprocal hits can be broken by an accelerated evolution in one of the true orthologs, he then showed work from other groups (Sali, Sonnhammer) on the higher conservation of structure and domain architecture in orthologs as compared to paralogs. He criticized the use of GO terms by Hahn and others and argued that one should at variety of data on function to test the conjecture. He presented results from his own group which show higher conservation of expression across species. He concluded that the functional conjecture still holds, although he observed that differences may not be spectacular.  Catherina Gushanski was next talking on changes in gene expression following segmental duplications in mammals. They have produced an impressive dataset of expression from  different tissues in various mammal species. She used that set to ask the question whether duplication was contributing more to divergence than time alone and showed that levels of expression were decreasing in younger duplicates, changes were different across different tissues. She observed no differences between one-to-one orthologs or old duplicate pairs, she also found no differences in terms of tissue specificity in orthologs vs paralogs.  Next on stage was Nicholas Furnham who presented new implementations in FUNTREE that would allow exploring functional evolution on trees. He warned that EC classification is not univocal and that can also have problems for functional comparisons. They have developed “EC-Blast” which directly measures distances between enzymatic reaction based on the molecular structures of substrates and products. Christophe Dessimoz presented results from his recent paper in which they show important biases in GO term annotations, genes from the same species and families tend to be annotated with more similar terms because of experimental biases and author biases. When correcting for this biases, the conjecture still holds. However he admitted that differences were not very big, but still significant. Romain Studer came next. He measured selection and changes in structural stability in orthologs and duplicated genes. He showed that selected sites in paralogs tend to be more clustered in the structure than in orthologs, however he observed no differences in the evolution of stability between orthologs and paralogues. He concluded that differences between paralogues may be smaller than previously thought.

After the coffee break Jianzhi Zhang told us about his work towards probing the orthology conjecture. After giving a try, he gave up of using GO terms because of the many inconsistencies, and the biases observed. He thus reverted to interrogate for conservation of protein-protein interactions using experimentally determined interactions in various yeast species. Unfortunately the many interactions to test experimentally in duplicated proteins prevented him to show a comparison of orthologs and paralogs in this talk. Nevertheless he found that all PPIs tested for orthologs were conserved, even those that seemed not to be, were caused by possible errors in previous large-scale Yeast 2 Hybrid experiments. Alex Nguyen also showed results on the budding yeast gene duplications. They focused on a more specific aspect of function: the presence of short-conserved linear motifs in protein. They found that these were more likely to disappear/diverge after the duplication event, consistent with neo- or sub-functionalization models. We moved to Drosophila with our next speaker, Lev Yamplosky who exploited expression and genomic data from the 12 Drosophila genomes. They showed larger differences in paralogs, as compared to orthologs in rates of divergence, which were also more asymmetrical. They also found that these differences varied for fast- or slow-evolving families. Finally they could also find larger differences in paralogs in terms of expression. Then it was my turn, and I mainly showed our results on comparison of expression patterns in human and mouse. Our experimental design is different from others in that we use topological dating (not sequence divergence) to establish orthologs and paralogs of a similar age, and, second, we compared always orthologs to inter-species paralogs to get rid of species-specific biases in the comparisons. Our results support a larger divergence of paralogues as compared to orthologs in tissue pattern expression. Thanks to our experimental design we could also assess that most of the differences between paralogs were gained shortly after the duplication, linking the duplication event to a big fraction of the divergence. Our last speaker was Paul Thomas who gave an overview of what can you expect and what can you not expect from GO annotations. He also showed progress on how the consortium is trying to model functional evolution through gene families, and how these models can help in the study of the relationship between orthology, paralogy and gene function.

Thus we had a diverse set of talks, most of them focusing on the comparison of different aspects of functional evolution (GO annotations, expression, functional motifs, interactions, divergence, structure) and also using varying experimental designs and species. I would say one of the main conclusion is that GO (and even EC numbers) annotation can be misleading in our ascertainment of functional evolution. My personal view is that most talks showed results consistent with the conjecture, although the level of differences between paralogs and orthologs was sometimes small. Function can be described at multiple levels, and I would expect that functional divergence after duplications may affect only one or few of these. Thus if one experimental design focuses on one of such levels it may be expected to miss divergence in the other ones. In addition those designs that average over all levels will inevitably dilute small but important aspects of functional divergence. In conclusion this is an exciting topic and with the number and variety of groups that are now interested in the topic, I am sure that we will be closer and closer to understanding the complex relationships between orthology, paralogy and functional divergence.

Some links and  papers mentioned during the symposium (I probably miss some):

Abstracts from oral presentations in SMBE, including our symposium http://imgpublic.mci-group.com/ie/PCO/OralAbstracts_Final.pdf

Another post on the orthology conjecture 

Announcement of our symposiyum 

FunTree: a resource for exploring the functional evolution of
structurally defined enzyme superfamilies.
Furnham N, Sillitoe I, Holliday GL, Cuff AL, Rahman SA, Laskowski RA,
Orengo CA, Thornton JM.
Nucleic Acids Res. 2012 Jan;40(Database issue):D776-82

Brawand, D., et. al. The evolution of gene expression levels in mammalian organs. URL

 Forslund et. al. Domain conservation architecture in orthologs

Huerta-Cepas and Gabaldón Assigning duplication events to relative temporal scales in genome-wide studies.

Nehrt et. al. Testing the Ortholog Conjecture with Comparative Functional Genomic Data from Mammals http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002073

Nguyen et. al. Proteome-Wide Discovery of Evolutionary Conserved Sequences in Disordered Regions http://stke.sciencemag.org/cgi/content/abstract/sigtrans;5/215/rs1
Peterson et. al. Evolutionary constraints on structural similarity in orthologs and paralogs

Thomas et. al. On the Use of Gene Ontology Annotations to Assess Functional Similarity among Orthologs and Paralogs: A Short Report

Large-scale analysis of orthologs and paralogs under covarion-like and
constant-but-different models of amino acid evolution.
Studer RA, Robinson-Rechavi M.
Mol Biol Evol. 2010 Nov;27(11):2618-27.

How confident can we be that orthologs are similar, but paralogs differ?
Studer RA, Robinson-Rechavi M.
Trends Genet. 2009 May;25(5):210-6.

Pervasive positive selection on duplicated and nonduplicated vertebrate
protein coding genes.
Studer RA, Penel S, Duret L, Robinson-Rechavi M.
Genome Res. 2008 Sep;18(9):1393-402.

Friday, June 1, 2012

Publicly available or not?

I have always had the naive understanding that databases such as GenBank were public, and that one was free to do research on data accessed from there, and eventually publish the results. However nothing seems to be as simple as that, since many of the genomes deposited in there have not been published yet. I have experienced myself and heard from many colleagues problematic situations regarding the use of genome data taken from public databases but yet to be published. Current guidelines are open to different interpretations, and different stakeholders (editors, reviewers, users, data producers) may have entirely different and conflicting views. With the current trend we will soon have more unpublished than published genomes in public databases, so I think it is worth re-assessing the policies. Here I share some views.

Policy guidelines regarding the use of genomic sequences prior to publication are available (see NHGRI rapid data release policy http://www.genome.gov/10506376), and set reasonable rules. For instance that data producers should deposit the data publicly and should produce a paper citable for the source of the data within a short period of time. This could precede a full genome paper in which a more througough analysis is produced. Users should not take the public data to publish an analysis focused on that genome. But this situation should not be prolonged too much. The underlying idea is to reserve the opportunity to describe the main characteristics and findings to the researchers that do the effort of sequencing, assembling, and annotating a genome, while ensuring that the data serves the advancement of science by allowing other groups to perform research on the genome data as soon as it is produced. However, there are many interpretations on what possible uses of the data should be allowed. Moreover, although indicative time-frames for the preferential exploitation of the data are given (e.g. 6 months), these are only indications. In the absence of clear-cut rules, the situation is calling for conflict. With the current flow of sequencing data, we will increasingly face the situation that data produced for public use and accessible through public databases is not associated to a paper and thus unclear whether its use should require permission. In such situations one may have different interpretations on existing rules, that of the leader of the sequencing project, that of the researcher that is accessing the data, that of the agency that financed the sequencing, and even that of the editors and reviewers of papers using available but unpublished data. Below I list some undesirable situations that highlights the contradictions of the current system. These situations are not hypothetical but rather correspond to real cases that I experienced or heard from colleagues

  • Users of public databases may unadvertedly download unpublished data, specially when they use they do this at large scales. After all they are using a public repository, and it is contradictory that public databases provide data that are not usable.
  • Most genome sequencing projects are financed using public money or from agencies that require that the data is made publicly available as soon as it is produced, but this leads to the situation above, making it difficult to sequencing project leaders to know what use is being made of their data.
  • Referees may specifically ask authors to use genomes that are in databases, or simply reject a paper because it does not use this or that “publicly available” genome in the comparative analyses. In addition referees or editors may ask for evidence of a specific permission to use unpublished data.
  • Authors willing to ask for the use of an unpublished genomes may be required to explain the exact use of the data, which expose their ideas to possible direct competitors.
  • Leaders of genome projects may feel in the right to ask for authorship in exchange of data that is available on public databases.
  • Leaders of genome projects may intentionally delay the publication of the genome paper to extend the period of preferential use. They may even decide to publish partial analysis before the genome paper.
  • Some unpublished genomes are in public databases for several years, and still different interpretations are possible of whether these data could be freely used.
  • Some genomes may never be published in the form of a genome paper, because they were sequenced with a very particular purpose.
In my opinion the current situation is too ambiguous, generates conflicts and ultimately jeopardizes the advance of science. We need clear rules, rather than guidelines, and I below propose four simple rules that would simplify the process.

  • Granting agencies and sequencing centers should specify a reasonable time-frame for preferential use (6-12 months) before the data is released. This should suffice for giving the upper hand to the research team that is doing the sequencing effort, but will also force them to focus on publish inga genome paper as soon as possible.
  • During this period, sequencing projects may announce the availability of the data for restricted use, through a specific repository that can be accessed only after a specific permission is granted. This will enable use of the data from time 0.
  • Data is released to the public repositories (at least in the form of bulk download) only after that period.
  • All data in public repositories should thus be free to be used for any purpose, regardless whether a genome paper is published.

Personally, for what the activity in my lab concerns I have taken the decision that we will use any data publicly deposited in GenBank for more than a year, for any purpose other than doing a “genome paper” (of course!). I think this is in perfect agreement with the NHGRI recommendations and will definitely save us time, and worries.

Sunday, March 25, 2012

Challenges in phylogenetic tree visualization

I recently read an excellent review by Roderic Page, on the challenges in phylogenetic tree representation and visualization. It provides an overview  on existing software and tools (although he missed our ETE package, see image below for an example of ETE's visualization features). The number and diversity of existing tools is overwhelming, but probably matches the diversity of different interests and possible applications of phylogenetic trees. One may be interested in  overlaying sequence information (see below), while other would be interested in displaying information on the geographical distribution of the species. Some may need to represent uncertainty and overly different topologies, or networks to represent transfers of genetic material, the possibilities are unlimited.

 Most importantly he mentions some of the challenges of tree visualization software such as the ability to represent huge trees and to allow interactive behavior with the user. In our group we have encountered such needs and this is the reason behind implementing more visualization features in ETE. Fortunately new technologies are offering new opportunities as well, and I enjoyed imagining the possibilities that 3D visualization and touchscreen technologies will provide to researchers. Definitely is a field to follow.

 If you are interested in the topic. I recommend this video.

Saturday, March 10, 2012

Open Letter for Research in Spain

As you surely have heard, Spain is facing a serious crisis in the context of a globalized market-economy (yes, it used to be a time when economical crisis related to something more tangible, such as a serious drought or a plague, but now one can only blame abstract fluxes of financial speculations). The new government is preparing a new budget which is predicted to include the most dramatic cuts in our history. Researchers here, who have already been hit by previous cuts (see this letter), are now embracing for the worst. 

 In this context, an open letter has been put together by the Confederation of Spanish Scientific Societies, the Federation of Young Researchers and others. I recommend you to read it (some cited figures and data are very revealing), and if yo agree with it sign it, as I just did.

 Open letter for research in Spain.