Important New Findings from the Use of Large Protein Language Models

  • Validating Results Is Important to Avoid Erroneous Results (Hallucinations) As in Any Research Project
  • Predicted Structures Are Now Sufficiently Reliable to Be Used for Validation
  • Treating the Language Embedding Layers as Data and Training Their Use Is a Successful Way to Tackle a Wide Range of Protein Problems (see recent Publication - Kilinc M, Jia K, Jernigan RL. Major advances in protein function assignment by remote homolog detection with protein language models - A review. Curr Opin Struct Biol. 2025 Feb;90:102984. doi: 10.1016/j.sbi.2025.102984. Epub 2025 Jan 27. PMID: 39864241; PMCID: PMC12168796).

    Two graphs: ESM-2 Layer Performances and ESM-1 Layer Performances

    Selecting the Layers for Best Performance. 

    ESM1b per-layer performance with DCT compression on PFAM dataset, showing that the last layer of a pLM is not the best layer for homolog detection. Instead, the models peak around a third quarter point in the list of layers, following which the performance drops. See, Kilinc M, Jia K, Jernigan RL. Major advances in protein function assignment by remote homolog detection with protein language models - A review. Curr Opin Struct Biol. 2025 Feb;90:102984. doi: 10.1016/j.sbi.2025.102984. Epub 2025 Jan 27. PMID: 39864241; PMCID: PMC12168796).

    The Protein Language Models Have Digested Large Sequence Sets That Provide Reliable Information about the Relationships among Genes/Proteins

  • We are Discovering Many Important Things - Large Numbers of Paralogs Leading to Important New Conclusions, Such as Regulation Is Significantly More Complex than Previously Realized.