Friday, June 1, 2012

Publicly available or not?

I have always had the naive understanding that databases such as GenBank were public, and that one was free to do research on data accessed from there, and eventually publish the results. However nothing seems to be as simple as that, since many of the genomes deposited in there have not been published yet. I have experienced myself and heard from many colleagues problematic situations regarding the use of genome data taken from public databases but yet to be published. Current guidelines are open to different interpretations, and different stakeholders (editors, reviewers, users, data producers) may have entirely different and conflicting views. With the current trend we will soon have more unpublished than published genomes in public databases, so I think it is worth re-assessing the policies. Here I share some views.

Policy guidelines regarding the use of genomic sequences prior to publication are available (see NHGRI rapid data release policy, and set reasonable rules. For instance that data producers should deposit the data publicly and should produce a paper citable for the source of the data within a short period of time. This could precede a full genome paper in which a more througough analysis is produced. Users should not take the public data to publish an analysis focused on that genome. But this situation should not be prolonged too much. The underlying idea is to reserve the opportunity to describe the main characteristics and findings to the researchers that do the effort of sequencing, assembling, and annotating a genome, while ensuring that the data serves the advancement of science by allowing other groups to perform research on the genome data as soon as it is produced. However, there are many interpretations on what possible uses of the data should be allowed. Moreover, although indicative time-frames for the preferential exploitation of the data are given (e.g. 6 months), these are only indications. In the absence of clear-cut rules, the situation is calling for conflict. With the current flow of sequencing data, we will increasingly face the situation that data produced for public use and accessible through public databases is not associated to a paper and thus unclear whether its use should require permission. In such situations one may have different interpretations on existing rules, that of the leader of the sequencing project, that of the researcher that is accessing the data, that of the agency that financed the sequencing, and even that of the editors and reviewers of papers using available but unpublished data. Below I list some undesirable situations that highlights the contradictions of the current system. These situations are not hypothetical but rather correspond to real cases that I experienced or heard from colleagues

  • Users of public databases may unadvertedly download unpublished data, specially when they use they do this at large scales. After all they are using a public repository, and it is contradictory that public databases provide data that are not usable.
  • Most genome sequencing projects are financed using public money or from agencies that require that the data is made publicly available as soon as it is produced, but this leads to the situation above, making it difficult to sequencing project leaders to know what use is being made of their data.
  • Referees may specifically ask authors to use genomes that are in databases, or simply reject a paper because it does not use this or that “publicly available” genome in the comparative analyses. In addition referees or editors may ask for evidence of a specific permission to use unpublished data.
  • Authors willing to ask for the use of an unpublished genomes may be required to explain the exact use of the data, which expose their ideas to possible direct competitors.
  • Leaders of genome projects may feel in the right to ask for authorship in exchange of data that is available on public databases.
  • Leaders of genome projects may intentionally delay the publication of the genome paper to extend the period of preferential use. They may even decide to publish partial analysis before the genome paper.
  • Some unpublished genomes are in public databases for several years, and still different interpretations are possible of whether these data could be freely used.
  • Some genomes may never be published in the form of a genome paper, because they were sequenced with a very particular purpose.
In my opinion the current situation is too ambiguous, generates conflicts and ultimately jeopardizes the advance of science. We need clear rules, rather than guidelines, and I below propose four simple rules that would simplify the process.

  • Granting agencies and sequencing centers should specify a reasonable time-frame for preferential use (6-12 months) before the data is released. This should suffice for giving the upper hand to the research team that is doing the sequencing effort, but will also force them to focus on publish inga genome paper as soon as possible.
  • During this period, sequencing projects may announce the availability of the data for restricted use, through a specific repository that can be accessed only after a specific permission is granted. This will enable use of the data from time 0.
  • Data is released to the public repositories (at least in the form of bulk download) only after that period.
  • All data in public repositories should thus be free to be used for any purpose, regardless whether a genome paper is published.

Personally, for what the activity in my lab concerns I have taken the decision that we will use any data publicly deposited in GenBank for more than a year, for any purpose other than doing a “genome paper” (of course!). I think this is in perfect agreement with the NHGRI recommendations and will definitely save us time, and worries.


  1. Thhanks for an interesting post - this was an issue I came across previously but had not given much consideration to.

  2. Thanks for the post.

    As you mentioned, this situation is becoming in somehow the standard for many studies using complete-genomes data. So, I think it's time to seat down as community and discuss about new references for producing, making publicly available and using this data.

    With sequencing technologies getting cheaper, faster and more accurate, it will not make sense to publish a paper about the sequencing of a given genome, there are exceptions to this point. It will be valuable to publish a genome as a primary source of data for carrying on different analyses but we can't stop science progress until works including newly sequenced genomes get published, especially for those cases where public agencies have funded the project.

    Actually, I like the idea to promote of revisiting and updating the Fort Lauderdale Agreement from 2003 about sequencing project for its 10th Anniversary.

    1. Good idea Wess. The situation has changed so much in ten years that time is ripe for revisiting the situation. 10th anniversary is next year. Who should initiate such process?