Higher order correlations in Biology

1.  Sequence matching is improved by utilizing correlated structure information

This work combines the vast sequence data with the large body of experimental protein and RNA structures that are available in uniquely innovative ways. The effort is a new project to combine the sequence and structure data.  The premise is that the tightly packed and strongly interacting amino acids are important for determining what substitutions are feasible and that this interdependence means there are more substitutions that presently would be found by running BLAST.  This new approach enables immediately the identification of the functions of many human genes presently having no assigned function, and in extensions, this will permit reliably developing the connections between genes. Because the proteins are the principal class of functional biomolecules, their substitutions determine what substitutions are permitted in the genes themselves.  Development of software and databases to capture the best ways to improve sequence matching will lead to a more significant understanding of gene functions related to all classes of disease.  In addition, many other fields such as evolution will be impacted by these new tools. The structures provide a framework for performing 3-dimensional sequence matching which is completely innovative and provides immediate insight into function.

We will analyze and make available information about tightly packed strongly interacting groups of amino acids from the large body of protein sequences and protein structures to provide informed guidance for understanding gene and protein mutations.  This information can assist in protein design and the design of mutagenesis experiments by providing rational choices for what groups of residues to mutate, within the context of the densely packed proteins, and will lead to more reliable conclusions regarding mechanisms. 

2. Mutations can be identified to rescue function by using this type of data

Btk linker-kinase showing the residues in the most strongly correlated triplets.

In one of our most remarkable studies, we have developed a new way to identify additional mutations that can compensate for the effects of a deleterious mutation.  This is work that is being submitted for publication. It involves searching through the vast sequence data, using the most correlated parts of these sequences where the correlated parts include the site of the deleterious mutant. The most correlated triplets of amino acids in the multiple sequence alignment are used to identify the most correlated parts. A small subset of such triplets is considered and the sequence variants at the other positions are considered.  Simulations of these sequence variants are carried out to learn about which ones can restore the normal dynamics to a protein structure having two mutations –deleterious mutation and a mutation at one other site.  Usually, these sites are distant from one another in the structure.  While this compensating mutation itself restores the function, the other way it could be applied would be to identify a site for binding of a small molecule that would restore the function.  Such allosteric sites have not previously been identified and are not readily identifiable in other ways.  By combining the sequence and structure data in this specific innovative way, we are able to achieve this amazing result. This use of higher-order correlations in biology is, in general, very important and our group is unique in studying these.  High packing density in protein structures and throughout biology means that such correlations can be significantly more informative than the more common pairs that others investigate. The key information is the big data in the related sequences. In the case of the application described above, the multiple sequence alignment for the kinases had over 31,000 sequences.  Such a large set of sequences readily allows for the extraction correlated triplets.

3. Protein mutations

The advent of cheap genome sequencing means that patient genomes are now available, but existing methods don’t provide a clear way to distinguish between the preponderance of mutations that do not have adverse effects and those that are critical and related to the disease.  At the moment, one major obstacle for the development of genome-sequence-based medicine is the lack of effective methods to distinguish between neutral sequence variants and those that are deleterious. The ability to make this distinction will come by utilizing all of the knowledge of the effects of amino acid variants on structures and mechanisms. This requires an intensive effort to connect between sequence and protein mechanism. Deep knowledge of protein structure is critical to comprehend function, to understand the molecular mechanisms of disease, and to improve the evaluation of protein mutants for the more reliable distinction.  In the practice of precision genome-based medicine, it is critical to identify the important disease mutants. Proteins are the workforce of the cell.  Progress in genomics should include protein information. We are utilizing structure, stability, and dynamics to enable the reliable distinction between mutants that maintain function and those expected to be dysfunctional.

4. Detecting Remote Protein Homologs

Datamining the ESM1b large protein language model is leading to the detection of remote homologs having as low as 10% sequence identity. This leads to new inferred functions for many cases. Validation is carried out by comparing the corresponding structures.

Funding Organization: 

Principal Investigator(s): Robert L.Jernigan