BY Diksha Batra ’26
Staff Writer
Proteins are commonly associated with foods such as meats, lentils and eggs. While they are crucial for building muscles, they are also involved in numerous other life processes. According to Science in the News, Kerry Geiler-Samerotte, a 2011 doctoral graduate from the Harvard Department of Organismic and Evolutionary Biology, said there are about “20,000 to over 100,000 unique types of proteins within a typical human cell.” She added that “each [protein] expertly performs a specific task, and knowing the structure of a protein is extremely helpful.”
Understanding a protein’s structure requires the knowledge of the folding process. The basic units of proteins are amino acids. In human cells, there are 20 amino acids, and these amino acids are recommended to different sequences. There are four levels of protein folding that occur to get each amino acid of protein functioning properly.
According to Geiler-Samerotte, “protein [chains] coil up into slinky-like formations called ‘alpha helices,’ while other regions fold into zigzag patterns called ‘beta sheets,’ which resemble the folds of a paper fan.” The next step of folding, according to Geiler-Samerotte, is these helices and sheets gaining the ability to “interact to form more complex structures.”
Protein folding is extensively studied. Dr. Ardala Breda noted in her co-authored book “Bioinformatics in Tropical Disease Research: A Practical and Case-Study Approach” that understanding this process will help “answer questions such as why we have cancer, why we grow old, why we get sick, how can we find cures for many diseases, why life as we know it has evolved in this way and on this planet and not anywhere else, at least for the moment.”
The recent advent of A.I. models, such as ChatGPT, allows for the generation of new content based on user input. These models rely on neural networks, which is another name for deep learning. Neural networks or deep learning, according to Larry Hardesty from the Massachusetts Institute of Technology, are a means “in which a computer learns to perform some task by analyzing training examples.” While ChatGPT is one of the most well-known mainstream models, AlphaFold and ESM Metagenomic Atlas are two of the most popular language models in the world of biophysics.
An article in Science, a scientific journal, called “Evolutionary-scale prediction of atomic-level protein structure with a language model” compares both of the models. According to the article, “protein structure and function can ... be inferred from the patterns in sequences,” a key concept in computational structure prediction.
The authors add that computer models “have the potential to learn patterns in protein sequences across evolution.” Computer models or language models contain parameters. Parameters, according to Sean Micheale Kerner, are “variables present in the model on which [an AI] was trained that can be used to infer new content.”
For both models, the primary sequence, which is the sequence of amino acids linked together to form a polypeptide chain, is inputted. AlphaFold, released by Google DeepMind, is known to have highly accurate structure, yet slow performance. According to the European Bioinformatics Institute, the “training data for AlphaFold2 came from the Protein Data Bank,” which is a free database containing all macromolecular structure of molecules.
When determining the structure of amino acid, AlphaFold “aligns it to the sequences of other similar proteins,” creating a multiple sequence alignment. Then the MSA is inputted into a neural network, which “[compares] and [analyzes] the sequences of similar proteins from different organisms” and results in 3D structure of a protein. When AlphaFold entered into a competition called CASP, a experimental test of protein structure prediction, it “outperformed all the other entrants by a wide margin,” according to the European Bioinformatics Institute.
On the other hand, ESMFold is already trained on all the protein structures and uses computational methods to determine the structure, bypassing the need for MSA. The article in Science states that “[ESMFold is] six times faster in predicting the structure compared to [AlphaFold]”, and on “shorter sequences the [speed improves] up to 60 [times].” Even though it is efficient, according to Nature Methods, “ESMFold does not quite meet the accuracy of AlphaFold.”
Scientists are developing artificial intelligence models that are advancing the process of protein structure determination, making it faster and more accurate and opening up new possibilities for future study.