Xiaoqiu (See-ow-chew) Huang
- Introduction to the Design and Analysis of Algorithms: Com S 311
- Computational Techniques for Genome Assembly and Analysis: Com S 551
Xiaoqiu Huang, with his collaborators, has developed algorithms and software in four areas of bioinformatics: sequence alignment, sequence assembly, gene finding and tree construction. He is also interested in understanding the prevalence and implications of gene flow in the evolution of asexual filamentous fungal pathogens (see blow).
In sequence alignment, a linear-space algorithm was developed for computing k best local alignments between two sequences (Huang and Miller 1991), significantly reducing the quadratic space complexity of the Smith-Waterman algorithm. According to Michael Waterman, the Smith-Waterman local alignment algorithm was developed to address the weakness of the Needleman-Wunsch global alignment algorithm in comparing homologous genomic segments with conserved regions (such as exons) separated by divergent regions (such as introns). Although the Huang-Miller algorithm could be used to find many pairs of conserved regions (in separate, unordered alignments) between two genomic segments, a linear-space global alignment algorithm was designed to compute one alignment of genomic segments with all pairs of conserved regions in similarity blocks and all pairs of divergent regions in difference blocks (in the natural order), precisely showing the boundary between similarity and difference blocks (Huang and Chao 2003). An appropriate substitution matrix (e.g. at the maximum-likelihood evolutionary distance) was selected in alignment of protein sequences to suit their conservation level (Huang 2008). Multiple sets of alignment parameters with different levels of stringency were dynamically selected to suit various kinds of sequence regions with different levels of conservation (Huang and Brutlag 2007).
In sequence assembly, Huang was encouraged by Ross Overbeek in 1990 to develop a contig assembly program (CAP). As one of the first sequence assembly programs based on the overlap-layout-alignment approach, CAP was special in using a maximum similarity algorithm, instead of a minimum distance algorithm, for computing overlaps between sequences, in generating contigs in a decreasing order of overlap quality without the need to determine the orientation of sequences before the layout stage, and in using a special alignment algorithm for generating a multiple alignment of sequences in a contig (Huang 1992). Paired-end reads and base quality values were used in the well-known CAP3 program (Huang and Madan 1999). A whole-genome assembly program named PCAP was developed to handle million of Sanger sequences in assembly of large animal genomes (Huang et al. 2003). PCAP was used in several large whole-genome assembly projects such as chimpanzee (Nature 437: 69-87), chicken (Nature 432: 695-716), and platypus (Nature 453: 175-183). The de novo feature of the PCAP chimpanzee genome assembly was useful in a study (BMC Genomics 7: 15). A unique feature of PCAP was the introduction of a superword array in finding overlaps between sequences (Huang et al. 2006). Unlike suffix arrays, superword arrays allow base mismatches (Huang 2017).
In gene finding, a linear-space DNA-protein alignment algorithm was developed to find exon-intron structures in a genomic sequence (Huang and Zhang 1996). This algorithm, along with a linear-space DNA-cDNA alignment algorithm, was combined with fast sequence comparison methods into a sequence annotation package named AAT (Huang et al. 1997). AAT was used to find gene structures in the first plant genome project (Nature 402: 761-768). Genes and other functional elements are often located in C+G rich regions of the genome, which can be found with a local content algorithm (Huang 1994). A useful feature of this algorithm is that the lengths of regions found are solely determined by their contents, not by any user-provided window size.
In tree construction, a new formulation of phylogenetic reconstruction based on the maximum sequence similarity was described (Huang and Vingron 2009). A hierarchical phylogeny construction algorithm for managing a large number of taxa was proposed (Das and Huang 2019).
Horizontal theory of molecular evolution
After examining highly variable genomic regions in several asexual fungal pathogens such as Verticillium dahliae (Huang 2014), Fusarium virguliforme (Huang et al. 2016), and Fusarium oxysporum (Huang 2019), Huang (in 2019) proposed the horizontal theory of molecular evolution, suggesting that in a major eukaryotic lineage such as fungi, most of the variation within and between recently-emerged asexual pathogenic species are due to horizontal transfer of genetic variants that are beneficial under certain conditions, maintain genetic structure, or generate genetic variation (structural variants as well as nucleotide substitutions). Horizontal transfer refers to the movement of genetic information between genomes other than by the vertical transmission of DNA from parent to offspring. This occurs on a regular and frequent basis (ongoing) when pathogens carry supernumerary chromosomes flanked by complementary subtelomeres that are host-specific and contain helicase-like genes. Supernumerary chromosomes are enriched in transposons and pathogenicity genes associated with heterokaryotic incompatibility (het) genes. Core chromosomes are flanked by the same type of complementary subtelomeres as supernumerary chromosomes, allowing the exchange of genes between core and supernumerary chromosomes by homologous recombination over their highy similar subtelomeres. Pathogenicity genes in core chromosomes are maintained through their association with het genes, because a loss of a region containing pathogenicity and het genes can result in incompatible het interactions triggering local and surrounding cell death. Horizontal theory concerning structural variants complements sexual reproduction and the neutral theory of molecular evolution concerning nucleotide substitutions by providing alternative solutions to Muller's ratchet and Haldane's dilemma.
Huang X, Miller W. (1991)
A Time-Efficient, Linear-Space Local Similarity Algorithm. Advances in Applied Mathematics 12:337-357. DOI
Huang X. (1992)
A Contig Assembly Program Based on Sensitive Detection of Fragment Overlaps. Genomics 14: 18-25. DOI
Huang X. (1994)
An Algorithm for Identifying Regions of a DNA Sequence that Satisfy a Content Requirement. Bioinformatics 10: 219-225. DOI
Huang X, Zhang J. (1996)
Methods for Comparing a DNA Sequence with a Protein Sequence. Bioinformatics 12: 497-506. DOI
Huang X, Adams MD, Zhou H, Kerlavage AR. (1997)
A Tool for Analyzing and Annotating Genomic Sequences. Genomics 46: 37-45. DOI
Huang X, Madan A. (1999)
CAP3: A DNA Sequence Assembly Program. Genome Research 9: 868-877. DOI
Huang X, Chao K-M. (2003)
A Generalized Global Alignment Algorithm. Bioinformatics 19: 228-233. DOI
Huang X, Wang J, Aluru S, Yang S-P, Hillier L. (2003)
PCAP: A Whole-Genome Assembly Program. Genome Research 13: 2164-2170. DOI
Ye L, Huang X. (2005)
MAP2: Multiple Alignment of Syntenic Genomic Sequences. Nucleic Acids Research 33: 162-170. DOI
Huang X, Yang S-P, Chinwalla A, Hillier L, Minx P, Mardis E, Wilson R. (2006)
Application of a Superword Array in Genome Assembly. Nucleic Acids Research 34: 201-205. DOI
Huang X, Brutlag DL. (2007)
Dynamic Use of Multiple Parameter Sets in Sequence Alignment. Nucleic Acids Research 35: 678-686. DOI
Huang X. (2008)
Sequence Alignment with an Appropriate Substitution Matrix. Journal of Computational Biology 15: 129-138. DOI
Huang X, Vingron M. (2009)
Maximum Similarity: A New Formulation of Phylogenetic Reconstruction Journal of Computational Biology 16: 887-896. DOI
Huang X. (2014)
Horizontal Transfer Generates Genetic Variation in an Asexual Pathogen. PeerJ 2: e650. DOI
Huang X, Das A, Sahu BB, Srivastava SK, Leandro LF, O'Donnell K, Bhattacharyya MK. (2016)
Identification of Highly Variable Supernumerary Chromosome Segments in an Asexual Pathogen. PLoS ONE 11: e0158183. DOI
Huang X. (2017)
Sequence Assembly. In Keith JM. (ed.) Bioinformatics - Volume 1: Data, Sequence Analysis, and Evolution. Humana Press, Second Edition, 35-45. DOI
Das A, Huang X. (2019)
HPC: Hierarchical Phylogeny Construction. PLoS ONE 14: e0221357. DOI
We have developed a number of computer programs for analysis of DNA and protein sequences.
The programs below are used by scientists around the world in their research.
SIM: A local sequence alignment program,
MAP: A multiple sequence alignment program,
CAP3 and PCAP: Sequence and genome assembly programs (downloading),
AAT: A set of programs for finding genes in DNA sequences (downloading).