Scalable Algorithms for Detecting Sequence Convergence during Genome Analysis
Interpreting the biological and medical impact of variants among patients and populations is a central problem in human genomics. Nearly all approaches to this problem rely on comparative methods to predict and help identify deleterious variants. Understanding the phylogenetic relationships among species is therefore an important prerequisite to designing experiments involving comparative biology (including in model organism studies). Recently, we have shown that selection-driven sequence convergence (the parallel substitution of the same amino acid states in different species) can happen at a tremendous scale and that, when present, sequence convergence positively misleads all known methods of phylogenetic reconstruction. At present, there does not yet exist a general statistical procedure for reliably distinguishing between random convergence and convergence that results from parallel selective pressures. To overcome this problem, we have developed a computational tool that rapidly estimates posterior convergent substitution probabilities across entire phylogenies. We propose to combine this approach with models of site-specific selective constraints, with which we can estimate the exact probability of observed levels of excess non-neutral convergent evolution under a model of sitewise negative selection. This general approach will provide a powerful lens through which selection-driven convergence can be identified in vertebrate genomes at a large scale. We plan to conduct a large-scale survey of sequence convergence using this approach across a set of approximately 100 vertebrate genomes.