I have always had the
naive understanding that databases such as GenBank were public, and
that one was free to do research on data accessed from there, and
eventually publish the results. However nothing seems to be as simple
as that, since many of the genomes deposited in there have not been
published yet. I have experienced myself and heard from many
colleagues problematic situations regarding the use of genome data
taken from public databases but yet to be published. Current
guidelines are open to different interpretations, and different
stakeholders (editors, reviewers, users, data producers) may have
entirely different and conflicting views. With the current trend we
will soon have more unpublished than published genomes in public
databases, so I think it is worth re-assessing the policies. Here I
share some views.
Policy guidelines
regarding the use of genomic sequences prior to publication are
available (see NHGRI rapid data release policy
http://www.genome.gov/10506376), and set reasonable rules. For
instance that data producers should deposit the data publicly and
should produce a paper citable for the source of the data within a
short period of time. This could precede a full genome paper in which
a more througough analysis is produced. Users should not take the
public data to publish an analysis focused on that genome. But this
situation should not be prolonged too much. The underlying idea is to
reserve the opportunity to describe the main characteristics and
findings to the researchers that do the effort of sequencing,
assembling, and annotating a genome, while ensuring that the data
serves the advancement of science by allowing other groups to perform
research on the genome data as soon as it is produced. However, there
are many interpretations on what possible uses of the data should be
allowed. Moreover, although indicative time-frames for the
preferential exploitation of the data are given (e.g. 6 months),
these are only indications. In the absence of clear-cut rules, the
situation is calling for conflict. With the current flow of
sequencing data, we will increasingly face the situation that data
produced for public use and accessible through public databases is
not associated to a paper and thus unclear whether its use should
require permission. In such situations one may have different
interpretations on existing rules, that of the leader of the
sequencing project, that of the researcher that is accessing the
data, that of the agency that financed the sequencing, and even that
of the editors and reviewers of papers using available but
unpublished data. Below I list some undesirable situations that
highlights the contradictions of the current system. These situations
are not hypothetical but rather correspond to real cases that I
experienced or heard from colleagues
- Users of public databases may unadvertedly download unpublished data, specially when they use they do this at large scales. After all they are using a public repository, and it is contradictory that public databases provide data that are not usable.
- Most genome sequencing projects are financed using public money or from agencies that require that the data is made publicly available as soon as it is produced, but this leads to the situation above, making it difficult to sequencing project leaders to know what use is being made of their data.
- Referees may specifically ask authors to use genomes that are in databases, or simply reject a paper because it does not use this or that “publicly available” genome in the comparative analyses. In addition referees or editors may ask for evidence of a specific permission to use unpublished data.
- Authors willing to ask for the use of an unpublished genomes may be required to explain the exact use of the data, which expose their ideas to possible direct competitors.
- Leaders of genome projects may feel in the right to ask for authorship in exchange of data that is available on public databases.
- Leaders of genome projects may intentionally delay the publication of the genome paper to extend the period of preferential use. They may even decide to publish partial analysis before the genome paper.
- Some unpublished genomes are in public databases for several years, and still different interpretations are possible of whether these data could be freely used.
- Some genomes may never be published in the form of a genome paper, because they were sequenced with a very particular purpose.
In my opinion the current
situation is too ambiguous, generates conflicts and ultimately
jeopardizes the advance of science. We need clear rules, rather than
guidelines, and I below propose four simple rules that would simplify
the process.
- Granting agencies and sequencing centers should specify a reasonable time-frame for preferential use (6-12 months) before the data is released. This should suffice for giving the upper hand to the research team that is doing the sequencing effort, but will also force them to focus on publish inga genome paper as soon as possible.
- During this period, sequencing projects may announce the availability of the data for restricted use, through a specific repository that can be accessed only after a specific permission is granted. This will enable use of the data from time 0.
- Data is released to the public repositories (at least in the form of bulk download) only after that period.
- All data in public repositories should thus be free to be used for any purpose, regardless whether a genome paper is published.
Personally, for what the
activity in my lab concerns I have taken the decision that we will
use any data publicly deposited in GenBank for more than a year, for
any purpose other than doing a “genome paper” (of course!). I think this is in
perfect agreement with the NHGRI recommendations and will definitely
save us time, and worries.
Thhanks for an interesting post - this was an issue I came across previously but had not given much consideration to.
ReplyDeleteThanks for the post.
ReplyDeleteAs you mentioned, this situation is becoming in somehow the standard for many studies using complete-genomes data. So, I think it's time to seat down as community and discuss about new references for producing, making publicly available and using this data.
With sequencing technologies getting cheaper, faster and more accurate, it will not make sense to publish a paper about the sequencing of a given genome, there are exceptions to this point. It will be valuable to publish a genome as a primary source of data for carrying on different analyses but we can't stop science progress until works including newly sequenced genomes get published, especially for those cases where public agencies have funded the project.
Actually, I like the idea to promote of revisiting and updating the Fort Lauderdale Agreement from 2003 about sequencing project for its 10th Anniversary.
Good idea Wess. The situation has changed so much in ten years that time is ripe for revisiting the situation. 10th anniversary is next year. Who should initiate such process?
Delete