When I was giving my first steps in the field of comparative genomics, there was not much to think about when deciding which genomic datasets to use: one would just take them all. With only a few dozens of genomes, mostly of bacteria, one could have everything at hand, in the local disk, just need to update every couple of months by adding one or two more...
These times have definitely passed, and now the flow of newly sequenced genomes is... well, overwhelming (see figure below, taken from Genomes Online). This is both a blessing and a curse for us doing comparative genomics, since we have an unprecedented amount of data which enables more resolution, but we are increasingly facing novel technical and analyitical challenges.
Just to give a taste of this avalanche of genomes from different species (projects for sequencing genomes for a given species, such as the 1000 genomes is another story) that is coming, I here list some of the projects I am aware of that aim at sequencing thousands of genomes from a given taxonomic group.
- i5K: 5000 arthropod genomes
- 1000 fungal genomes
- genome 10K: 10000 vertebrate genomes
- 10000 microbial genome project
As expected, in this kind of projects it is way more easy to come up with a bold number, than to actually define the list of species that are actually going to be sequenced. At least this is what I can tell from my involvement in the i5K initiative, in which prioritisation of species to be sequenced is not simple, since usually one wants to weigh in different criteria (phylogenetic relevance, biological, economical, and clinical importance, etc).
I'm sure I missed some, and, in addition, there is a growing flow of genomes that are sequenced by independent groups, including my modest own group. One common weakness of this large, and small-scale initiatives is that they sometimes come with the cost for covering the genome sequencing but do not account for the necessary bioinformatics analyses to actually make sense of the data. With the sequencing costs dropping and the potential analyses becoming more complex, the actual costs of sequencing projects will more and more be on the side of the analysis beyond the assembly and annotation phases. As a result, many bioinformatics groups are streching their resources to contribute to genomics projects without getting any specific funding.
In my opinion the planning of a sequencing project should account for all the downstream phases with their associated costs. With such an approach we may end up having a handful of genomes less, but we will definitely learn more from them.