We recently published a study showing that protobacterial derived proteins in the Last Eukaryotic Common Ancestor (LECA) show a tendency to have shorter phylogenetic distances to their bacterial counterparts as compared to LECA proteins originating from other Bacteria or Archaea. We interpreted this as evidence suggesting a late acquisition of mitochondria by a host which already contained bacterial and archeal-derived protein families. Our work has been heavily criticized by William Martin -one of the main proponents of mito-early hypotheses- and colleagues. The critic was first submitted to Nature, reviewed by editors and independent reviewers and eventually rejected. The authors have decided to publish a slightly modified version of the letter in BioRxiv. In my opinion the tone of the letter is unacceptable for an open scientific discussion. In any case the bottom line is that their arguments do not support the claim that our results are artefactual, nor they show in which way the purported artefact produces the observed trend. For the sake of scientific discussion we have decided to publish our original response to their letter. We tried to post it in BioRxiv but it was declined because "

*is a rebuttal to a criticism not a research paper*". Therefore I have decided to post it here.
Martin
et. al. criticize several methodological aspects of our study.
We
first want to note that none of the points raised affect the core of
our conclusions -

*i.e*. that differences in stem lengths relate to phylogenetic origin of LECA families so that they are shorter in bacterial, and particularly alpha-proteobacterial derived families- because the observed relationships i) are independent of the clustering performed in Figure 1 of Pittis and Gabaldón (2016), and ii) their criticism focuses on one single comparison of a single dataset but the differences are present across several datasets and approaches, including the very same dataset from the authors mentioned in their letter (Ku et. al. 2015), as we show below. Secondly, their interpretation of our stem length measurement and how they extrapolate to branches sub-tending eukaryotic clades is conceptually flawed, as we also demonstrate below. Thus none of their arguments compromise at any rate the main conclusions of our article. We nevertheless want to discuss their points.
Contrary
to what Martin et al. claim we do not assume a normal distribution of
the global distribution of stem lengths. The claim that our
statistical analyses are inappropriate is simply not true, we clearly
explain all the methods used, and the tests performed to support
observed differences are all nonparametric, without any assumption of
normality. In Figure 1 we did use a probabilistic clustering method
that fits a Gaussian mixture model, a mixture of normal
distributions, assuming multimodality in the data. Martin et. al.
show that a unimodal log-normal distribution would better fit the
data when the number of parameters is penalized. Does this
demonstrates that the underlying distribution is not a composite of
five gaussians?
No,
because when data are drawn from a five gaussian distributions with
the obtained parameters, in 81% of the cases a log-normal
distribution would be (wrongly) preferred using the BIC criterion.
Also, the fact that any randomly sampled log-normal distribution
could be fitted by a mixture model is by no means a surprise. In fact
any distribution of data could be fitted by a finite number of
mixture components, and this is precisely why these mixture models
are commonly used as universal function approximators and as a tool
to partition various kinds of data. Finally the definition of
overfitting is not BIC inflation but the lack of predictive power.
Thus other parameters have to be considered when assessing whether a
model provides a reasonable representation of the data. The use of
the EM algorithm is justified as a method for partitioning the data
because i) we may expect composite of signals from a proteome (LECA)
with at least two ancestral components (Archaeal host, and bacterial
endosymbiont), and ii) prior studies have suggested that normalized
branch lengths measurements as the ones used here to be approximately
normal (Rasmussen and Kellis, 2007). The assumtion of a unimodal
distribution such as the one proposed by Martin et. al. does not
capture the expected mixture origins for a chimeric proteome and does
not fit with the observation that differences in stem lengths relate
to non-homogeneous phylogenetic origins. In any case our results are
independent of this clustering exercise as the differences in stem
lengths are apparent when simply grouping the LECA families according
to their sister clades (Fig. 2 and Extended Data Fig. 1b of Pittis
and Gabaldón, 2016), or when using other forms of clustering the
data such as equal binning (results not shown).

Their
purported extrapolation of our analyses to eukaryotic clades and
their derived dates is totally flawed and misleading. First of all,
we explicitly say that we do not assume constant rates (

*i.e*. molecular clock), and our normalized branch length is a measurement that is proportional to time but multiplied by a ratio between the rate preceding and postdating LECA, so their timing exercise, providing date estimates, is completely ungrounded. Secondly, Martin et al. consider the normalized*sl*to yield arbitrary values, resulting in a log-normal distribution. This openly contradicts the observation that families of different prokaryotic origins show significant differences in*sl*and also*rsl*values. All our analyses robustly prove the opposite, there are differences and these differences reflect the relative divergence times. The cases of the cyanobacterial signal in Archaeplastida (Extended Data Fig. 3, Pittis and Gabaldón 2016) and of Lokiarchaeota signal in LECA (Extended Data Fig. 7, Pittis and Gabaldón 2016) nicely indicate the validity of the measurement. Expecting some extreme*ebl*values to reflect radical adaptations and fast rates of some lineages, we used the median because of its robustness with respect to extreme outliers (see Methods). We also tried not accounting for fast evolving taxonomic groups in the calculations, without any change in our main results. All these observations are not explained by the interpretation of the data provided by Martin. et. al. Furthermore, Martin et. al. show that the normalized branch lengths sub-tending each eukaryotic clade follow log-normal distributions, and conclude that this observation demonstrates that this is natural variation for branches meant to represent a single time interval (e.g. divergence of fungi from metazoans). By adopting this assumption they are surprisingly ignoring that eukaryotic families are also subject to differential gene loss and other processes, which would result in multiple underlying patterns of the sub-tending branches (i.e. the sub-tending branch of a fungal family, which was lost in metazoans does not derive from the divergence between fungi and metazoans, but from the deeper divergence of fungi and other unikonts). This becomes apparent when controlling for the relationship of the normalized branch lengths with the phylogenetic affiliation of the sister branch -a key step in our analyses which they ignore. Indeed applying to the eukaryotic clades an EM-based clustering and measuring enrichments in phylogenetic affiliations as we did in our previous analysis (Pittis and Gabaldón, 2016) reveals major underlying distributions related with the nature of the sister group (Figure 1). Thus, in this case also, the variation of*sl*values, interpreted by the authors as “vividly documenting abundant branch length variation”, is clearly shown to naturally carry the signal of different divergence times. So yes, the sl values in eukaryotic groups do**imply phases of early and late divergence times due to gene loss or other biological events, as they do in the case of LECA. Of note this is a new, independent demonstration that variation in stem lengths relate with underlying variation in phylogenetic distribution, and provides additional support to our approach.**
Figure | Ascomycota stem length analysis. Different phylogenetic sister groups show

significant differences in stem lengths according to their divergence times from Ascomycota.

Gene losses in the sister group lineage can explain the alternative tree topologies and

differences in estimated stem lengths.

significant differences in stem lengths according to their divergence times from Ascomycota.

Gene losses in the sister group lineage can explain the alternative tree topologies and

differences in estimated stem lengths.

Finally,
Martin et. al. Focus their criticism in only one of our comparisons
and on only one of the datasets used. For that dataset, they wrongly
claim that we reused eukaryotic sequences in the different tree. This
is false. Given the multidomain nature of eukaryotic protein
sequences, the source of that dataset (Powell et. al. 2014) may
incorporate a given protein to more than one orthologous cluster.
However we made sure we only used the orthologous sequence regions in
a given analysis, thus never re-using a given eukaryotic sequence.
Our analyses use standard filtering approaches but they claim that
statistical significance for one of our comparisons
(alpha-proteobacterial to other bacteria) is lost when applying
additional ad hoc filtering on top of our previous filtering steps.
We must note that even applying their filterings and using a
permutation test as the one used in our paper, the
alpha-proteobacterial sl values, remain significantly lower compared
to other bacteria (P=1e-2, accounting only for families with
eukaryotic sequence lengths >= 100 and P=3.7e-2, accounting only
for alignments with gaps <= 50%, 10

^{6}permutations). The loss of significance in some of the tests when artificially reducing the data is unsurprising. We are focusing on very ancient events and the signal we are measuring must be necessarily weak, and the number of LECA families that can be traced back to specific ancestries is limited. Indeed the statistical significance using a Mann-Whitney U-test is often lost (>60%-70% of the times) when randomly reducing the data to sizes similar to the resulting sizes in their filtered dataset, which suggest that the mere effect of reducing the size, rather than the particular additional filtering used is having a major effect. This is why we made sure the signal was robust across different datasets, always using state of the art filtering approaches. Given the suggestion by Martin et. al. that a recent phylogenetic analyses from them (which appeared after we had submitted the paper) represents a more careful dataset (Ku et al., 2015), we repeated our analyses using this dataset, which confirmed our results (650 eukaryotic clades, Archaeal vs Bacterial families,*P*=1.2e-41, two-tailed Mann-Whitney*U*-test and α-proteobacterial families’ sl significantly smaller within Bacterial,*P*=4.7e-2, permutation test, 10^{6}permutations). Again, this result lends further support to our findings.
Altogether, we show that the
criticisms raised by Martin et. al. do not compromise the main results
and conclusions of our paper. Furthermore, we would like to stress
that the new dataset and analyses brought about by this discussion
lend additional support to our approach and conclusions.

- Ku, C. et al. Endosymbiotic origin and differential loss of eukaryotic genes.
*Nature***524**, 427–432 (2015). - Rasmussen, M. D. & Kellis, M. Accurate gene-tree reconstruction by learning gene- and species-specific substitution rates across multiple complete genomes.
*Genome Res*.**17**, 1932–42 (2007). - Pittis, A. A. & Gabaldón, T. Late acquisition of mitochondria by a host with chimaeric prokaryotic ancestry.
*Nature***531**, 101–4 (2016). - Powell et. al. eggNOG v4.0: nested orthology inference across 3686 organisms.
*Nucleic Acids. Res.*42(Database issue):D231-9

Martin et al. model has 2 parameters. Yours has 14 parameters. The 2-parameter model fits the data better than the 14-parameter one. End of story. That's how science works: the most parsimonious model wins. The rest of your so-called rebuttal is merely sophism, similar in principle to medieval arguments on the number of strung angels one can thread through the eye of a needle. Sorry.

ReplyDeleteDan Graur

Sorry, but we are not modelling, nor using the estimated parameters. EM algorithm is used for clustering of the data. Same trend is observed when using equal-size binning or simply using the phylogenetic affiliation.

DeleteActually, by fitting your data to overlapping Gaussian distributions, you ARE modeling. The expectation–maximization (EM) algorithm does not guarantee the best fit. It is a heuristic. The fact IS that a simpler model with considerable fewer parameters was found that fits the data better according to the Akaike information criterion. You attribute the EM algorithm a power that it does not have.

DeleteI also should also say something about your making a big fuss about Martin's et al. rebuttal being rejected by Nature. First, the rebuttal was most probably sent to the same people who accepted your paper in the first place. Second, Nature only rarely publishes rebuttals. Since rebuttals are counted as papers, they tend to lower the citation index by lowering the quotation rate of bit the criticized article and the rebuttal.

Dan Graur

Answering the question of 'how many clusters are there' is notoriously difficult and AIC, BIC and other model fitting criteria can often give different results. Furthermore, there are a large number of possible parametric distributions to try...There won't be a simple answer to this problem. In my view, the mixture modeling part of this is not central to the argument. Pittis and Gabaldon showed that the branch leading to Archaea for archaeal origin proteins tends to be systematically larger than the branch leading to alpha-proteobacteria for alpha-origin proteins. This result holds regardless of anything that anyone has said. The question is why is this so.

DeleteA modified version of the response has been finally published as a preprint in BioRxive:

ReplyDeletehttp://biorxiv.org/content/early/2016/07/20/064873