Skip to main content

Comparative genomics and genome size evolution

Comparative evolutionary genomics and domestication genomics are among the most active areas of research in the lab. Please check back here often for updates on our massive genomic resequencing effort, our comparative molecular evolutionary analyses, and our work on domestication genomics.

An additional remarkable feature of the cotton genus is that it exhibits some rather extraordinary genome size evolution.  Some of this is captured in the figure below, which shows the three-fold variation just among the diploids, from a low in the New World D-genome cottons to a high in the Australian K-genome species. This raises some obvious questions!  For example, how does this happen?  What parts of the genomes are growing and shrinking and by what mechanisms?  Also, notice that the allopolyploids, which contain two genomes (A and D), have a combined genome size that is smaller than the sum of their progenitors.  How does this genome down-sizing occur, and what are the consequences for the organisms?  

Comparative evolutionary genomics. We are addressing the foregoing questions using a number of comparative genomic sequencing technologies, all with collaborators in the US and abroad. In one international collaborative project (with J Udall and colleagues in the US, and with several colleagues in China), we are  comparing high-quality de novo genome assemblies among diploids that vary greatly in genome size (figure below), in an effort to understand the processes and patterns that generate such remarkable genome size variation. We also are collaboratively developing a Gossypium "pangenome", so that we might better understand the genes and genomic constituents that are universal at various levels within the genus, including for the genus as a whole, as well as those faster evolving or lineage-specific components. 

Gossypium phylogeny

In addition to genome size variation between species, we also are studying genome size variation within the domesticated species. Graduate student Emma Dostal recently has shown that there exists surprising amounts of infraspecific genome size differences within G. hirsutum, G. barbadense, and G. herbaceum. These discoveries are allowing us to gain some insight into the many external and internal factors that might collectively shape genome size in plants:

Mechanisms of growth and reduction

We also have extended this work to the closest relatives of cotton, which is a small clade of two genera with a fantastically disjunct geographic distribution.  Kokia is an island endemic genus of just several species, from Hawaii, and Gossypioides is its closest relative, from Madagascar (figure from Grover et al., 2017):

Map of the geographic distribution of Kokia and Gossypioides

As seen in the figure, the genomes of these two species are even smaller than the smallest in Gossypium, and by a lot!  With Associate Scientist Corrinne Grover, graduate student Justin Conover, and colleagues elsewhere in the US, we recently described these two "outgroup" genomes, and documented an extraordinary case of gene loss accompanying genome downsizing. At present, we have no understanding of how these two lineages managed to prune so many genes from their genomes, and yet survive until the present.

Genetic diversity and domestication genomics of the cultivated cotton species.

The evolutionary history of cotton (G. hirsutum, Upland Cotton, and G. barbadense, Pima cotton) includes both polyploidization and independent domestication. In both of the domesticates, modern cultivars were derived from wild ancestors by ancient civilizations in the Yucatan Peninsula (G. hirsutum) and western Peru/Equador (G. barbadense). Moreover, in the Old World, two different diploid species of cotton, G. arboreum and G. herbaceum, were also independently domesticated. This parallel domestication process, resulting in convergent plant architecture and convergent long, strong, white, cotton fibers, offers us a marvelous opportunity to understand the genetic targets of selection (in this case strong directional selection practiced by humans starting more than 5000 years ago), and more generally, how evolution can create and shape and mold plant phenotypes. Several examples of this morphological transformation are shown below.

Parallel domestication of cotton

One can readily appreciate the differences between the wild forms of cotton and the modern varieties.  Here is a wild plant of G. hirsutum:

G. hiirsutum race Yucatanese in flower, 2017, from Georges Ano

And this is what humans have created from it!:

Cotton blooms

Here is another example, showing the transition from wild (left) to modern (right) Pima cotton (G. barbadense):

Wild and modern G. barbadense

A close-up picture of the fibers shows how selection has altered this single-celled epidermal trichome, in three different species (4th not shown):

Wild vs. domesticated cotton

We recently initiated a large genome resequencing project (led by Josh Udall, with Daojun Yuan and Thiru Ramaraj) funded by the National Science Foundation Plant Genome Program.  In this project, we are addressing biological questions of broad relevance to polyploid plant genomes, while at the same time generating detailed information on genomic and phylogeographic patterns of diversity in wild and domesticated cotton, advancing our capacity to utilize an expanded gene pool for cotton improvement.

To date, we have used whole genome sequencing to generate an average of 21.5X coverage for a total of 633 accessions of the 2.4 Gb cotton genome. We are using these data of polyploid cotton to address fundamental questions for polyploid plant genomes. For example, how is genetic diversity partitioned between the two subgenomes in an allopolyploid?  How have humans modified these two subgenomes, and are the effects of directional selection the same in both genomes? Have genes in the two genomes contributed equally to advanced, modern fiber phenotypes?

In addition, we are using the data to define both the portion and the proportion of the wild diversity that was captured during the domestication process. This will help us understand the present gene pool of modern cultivars, and how they have been shaped by selection, and also by interspecific gene flow between the two cultivated species.  

PIs, Co-PIs, Collaborators and Senior Personnel

Last Name

First Name


















Assoc. Prof. (China)

Post-doc appt. ISU




Assistant Scientist


Progress to Date

Based on a series of genetic diversity studies with RFLP, SSR, GBS and the Cotton Illumina SNP array (63K), we selected 735 representative accessions to sequence, including 441 for G. hirsutum, 182 for G. barbadense, 56 for G. arboreum, 23 for G. herbaceum and 33 others. High-quality DNA was extracted and PCR-free Illumina sequencing libraries were constructed.

    To date, we have re-sequenced 735 accessions with 150 PE reads. For the tetraploid accessions, we have approximately 21.5X effective coverage of the 2.4 GB genome (~51.6 Gbytes clean data per sample). Based on the current amount of sequencing, 33.2 Tera-bases (278.5 Giga-reads) were produced for 643 tetraploid accessions. The base-pair quality scores Q20 (quality > 20) and Q30 (base quality > 30) are 97.28% and 93.30%, respectively.

A bioinformatics analysis pipeline was constructed, including cleaned reads, mapped reads, realignment of reads around indels, recalibrated base quality scores with confident SNPs, and creation of bam and VCF files.

After removing high missing rate samples and filtering (Minor allele frequency, MAF is 0.1%, and missing rate for loci is 50%), approximate 70 M SNPs were kept for diversity, genome evolution, and phylogenetic analyses.

Population structure was studied using PCA, phylogenetic analyses and structure methods. We also assessed the amount of genetic diversity loss that accompanied domestication, identified regions of historical introgression between species, and identified the genes targeted during domestication. 


After quality control, 98.7% of the reads were processed through the pipeline. 91.4% of clean reads were mapped to the reference genome and 77.2% of the mapping reads were high quality mapping (MAPQ>30). The profiles of sequence coverage indicated that PCR-free library construction was successful.

In addition to our newly sequenced 643 accessions, 789 additional accessions were collected from elsewhere. In total, 1,432 tetraploid accessions were analyzed with our bioinformatics pipeline. 1,024 samples were retained following quality filtering, including 795 G. hirsutum, 201 G. barbadense and 28 other tetraploid samples (641 samples were from this project, 383 samples from public data).

The results of our analyses are pretty exciting. Based on PCA, phylogenetic tree analysis of SNPs, and Structure analysis, G. hirsutum is separated into 4 subgroups (wild, landrace 1, landrace 2 and cultivar), as seen in the following figure:

Population structure of G. hirsutum

Population structure of G. hirsutum. (A) A map illustrating where different landrace and wild accessions originated. (B) A rooted phylogenetic tree of G. hirsutum with outgroup accessions of G. mustelinum (brown). Four main groups of accessions are visible: wild (red), Landrace 1 (orange), Landrace 2 (orange), and cultivars (green). (C) PCA plot of germplasm (4-fold degenerate sites). (D) Structure analysis of all G. hirsutum accessions, as well as a zoomed section of the Structure plot to view the wild, Landrace 1, and Landrace 2 groups. Each of the three analytical methods agree on the number of groups and that modern cultivars largely arose from Landrace 2 (Central American) accessions.

We also did the same kind of analysis for G. barbadense, which may be conceptualized as encompassing diversity that is genetically categorized into 3 classes (wild, landrace and cultivar), as shown here:

Population structure of G. barbadense

Population structure of G. barbadense. (A). A map illustrating where different landrace and wild accessions originated. (B). A rooted phylogenetic tree of G. barbadense with outgroup accessions of G. mustelinum (brown). Four main groups of accessions are visible: wild (red), olg-series (purple), Landrace (orange), and cultivars (green). (C). PCA plot of germplasm (4-fold degenerate sites). (D). Structure analysis of all of the G. barbadense accessions. Each of the three analytical results agree on the number of groups and that modern cultivars largely arose from South America.

Stay tuned for a series of publications describing this work!  These papers will include detailed descriptions of the total genetic diversity in each species, how domestication and hybridization have shaped this diversity, the total loss of nucleotide diversity accompanying domestication, and the many domestication or selective sweeps that occurred in cotton genomes as a result of selection. This work also sets the foundation for understanding genes and genetic elements important for cotton improvement.