László Székely
((Mathematics Department, University of South Carolina)

Complexities of the Phylogeny Reconstruction Problem

Biologists often face the following problem: Given a set of taxa and information about them-which can be morphological characters, corresponding segments of biomolecular sequences, gene order in their whole genomes, etc. - , find the true phylogenetic tree.

Note that not every physical process allows to reconstruct its history, therefore it is good luck if phylogeny can be reconstructed at all. The choice of data and methods for phylogeny reconstruction determine the output, although reasonable choices give similar results. I will discuss the sources of disagreement about the available methods. If data type and method are agreed on, we face a large-scale computing problem that has two complexity issues: the amount of data it needs and the usual computational complexity.

Analytic investigation of the methods is made possible by models of biomolecular sequence evolution. These models involve randomness. A widely-studied model for generating binary sequences is to `evolve' them on a tree according to a symmetric Markov process. This model is known as the Cavender-Farris-Neyman model in phylogeny, and the `symmetric binary channel' or the `symmetric 2-state Poisson model' in other areas. The CFN model provides a simple model for the evolution of purine-pyrimidine sequences. The significance of this simple model is, that phenomena shown for the CFN model often extend to more realistic models of sequence evolution.

The abstract phylogeny reconstruction problem is telling the true (model) CFN tree from the generated sequences with high probability (whp). We show that under the CFN model distinguishing the true tree from a false one whp, using sequences generated on the true tree, is substantially `easier' (in terms of the sequence length needed) than determining the true tree whp.


Wednesday, January 18, 2006, 4:30pm, E1 Room 122

Last updated by pelsmajer@iit,edu on 01/26/06