Biologists often face the following problem: Given a set of taxa and
information about them-which can be morphological characters, corresponding
segments of biomolecular sequences, gene order in their whole genomes, etc. - ,
find the true phylogenetic tree.
Note that not every physical process allows to reconstruct its history,
therefore it is good luck if phylogeny can be reconstructed at all. The choice
of data and methods for phylogeny reconstruction determine the output, although
reasonable choices give similar results. I will discuss the sources of
disagreement about the available methods. If data type and method are agreed on,
we face a large-scale computing problem that has two complexity issues: the
amount of data it needs and the usual computational complexity.
Analytic investigation of the methods is made possible by models of
biomolecular sequence evolution. These models involve randomness.
A widely-studied model for generating binary sequences is to `evolve'
them on a tree according to a symmetric Markov process. This model is
known as the Cavender-Farris-Neyman model in phylogeny, and the
`symmetric binary channel' or the `symmetric 2-state Poisson model'
in other areas. The CFN model provides a simple model for the evolution
of purine-pyrimidine sequences. The significance of this simple model is,
that phenomena shown for the CFN model often extend to more realistic
models of sequence evolution.
The abstract phylogeny reconstruction problem is telling the true
(model) CFN tree from the generated sequences with high probability (whp).
We show that under the CFN model distinguishing the true tree from a false one whp,
using sequences generated on the true tree, is substantially `easier'
(in terms of the sequence length needed) than determining the true tree whp.