Progress at protein structure prediction, as seen in CASP15
In December 2020, the results of AlphaFold version 2 were presented at CASP14.
AlphaFold version 2 sparked a revolution in protein structure predictions.
For the first time, a purely computational method could challenge experimental accuracy in predicting the structure of single protein domains.
The code for AlphaFold version 2 was released in the summer of 2021.
Since its release, AlphaFold version 2 has been shown to accurately predict the structure of most ordered proteins and many protein-protein interactions.
The release of AlphaFold version 2 sparked an explosion of development and improvement in AI-based methods for predicting protein complexes, disordered regions, and protein design.
The review will discuss some of the inventions and advancements sparked by the release of AlphaFold.
Protein structure prediction is often divided into two categories: template-based and de-novo.
Starting in 1969, methods were developed to predict protein structure by copying coordinates from experimentally determined structures.
There have been methodological advancements in protein structure prediction since its earliest attempts.
The increase in sequence and structure data has enabled more accurate modeling of proteins.
De novo modelling traditionally means modelling a protein without using a template.
The strict definition of de novo modelling has blurred over the last two decades due to the use of fragments.
Methods such as Rosetta, I-Tasser, and FragFold employ fragments of varying sizes to predict protein structures.
The fragments utilized in these methods may originate from homologous protein structures or unrelated sources.
The Rosetta program, developed by the Baker group, is particularly successful in this field.
The fragment-based approach, as implemented by Rosetta and similar methods, has limited applicability to general protein structure predictions.
In the 1990s, it was proposed that protein structure could be predicted using co-evolutionary signals in multiple sequence alignments.
The performance of these methods was minimal, with predictions only slightly better than random.
In 1999, a method to increase the accuracy of contact predictions was suggested by separating direct and indirect contacts using a trainable model.
A similar idea had been proposed previously.
These papers were largely ignored by the community.
The methods may have been more successful if more protein families had sufficient members.
At CASP13, DeepMind introduced AlphaFold (version 1). The network architecture used by AlphaFold was deeper than earlier attempts. The network predicted distance probabilities instead of contacts. This resulted in sufficient accuracy to use a simple steepest descent methodology to predict protein structures. Only a tiny fraction of proteins without templates could be predicted at experimental accuracies. Several academic groups quickly reproduced the performance of AlphaFold v1.
DeepMind presented AlphaFold v2.0 at CASP14. The results of AlphaFold were impressive. The average GDTts was close to 90 for individual domains. The AlphaFold results were, in principle, as good as experimental structures. In the summer of 2021, the AlphaFold papers were published. The first AlphaFold paper described the method. There was a second paper that described an extensive database of predicted structures. The database of predicted structures has recently been extended to cover virtually all proteins in UniProt.
DeepMind released the source code of AlphaFold with an open-source license. The community can test and extend the AlphaFold method. The release spurred a jump in development speed. AlphaFold consists of two main stages. The first stage of AlphaFold is the 48 EvoFormer blocks. The second stage of AlphaFold is the structural module. Both stages of AlphaFold contain important innovations. These innovations are inspired by previous academic contributions. AlphaFold can be seen as an engineering feat.
Thanks to a detailed presentation of the algorithm at the CASP conference and the release of the code as open-source in June 2021, AlphaFold sparked tremendous activity in the entire community, which also expanded rapidly.
Many labs use the AlphaFold paper and code.
An essential tool is ColabFold, which runs as a Jupyter notebook on Google's Colab platform.
MMseqs2 is used for sequence searching, making model-building extremely rapid.
The interface is straightforward and easy to use, demonstrating the importance of open science.
The code was available.
A large set of protein structures was experimentally solved due to the availability of the code.
The models were of sufficient quality for molecular replacement.
The model of the nucleopore complex was one of the most impressive results.
AlphaFold was developed to predict the structure of a single protein chain.
It is easy to trick the program into predicting the structure of dimers or higher multimers.
Two methods to trick the program are using a poly-Gly linker or changing the residue numbering.
The first attempts at predicting dimer structures had limited accuracy.
The performance increased significantly when "paired" alignments linking two or more chains were used.
AlphaFold can be used to study and design protein-peptide interactions.
Paired alignments identify pairs of proteins that interact by looking at orthologs or first hits in complete genomes while ignoring paralogs.
Predicting interactions is important because protein-protein interactions are key to understanding molecular functions.
AlphaFold has learned to recognize generalizable structural features of protein structure.
AlphaFold is not just a lookup table memorizing all of PDB.
DeepMind developed a version of AlphaFold specifically for the task of predicting the structure of complexes. The first version was released in November 2021. This initial version had a tendency to permit protein overlap. The second version was released in April 2022. The second version performs slightly better than hacks using the original AlphaFold. The second version can accurately predict the structure of about half of all complexes up to 6 chains. It is feasible to predict the structure of larger complexes or modify predicted subcomplexes. An updated and retrained version was released.
The release of the AlphaFold code and its detailed description prompted significant reproduction efforts.
RoseTTAFold was published concurrently with AlphaFold, and the public description of AlphaFold clearly inspired their work.
RoseTTAFold includes novelties, such as utilizing a three-track network, with one being an SE(3) representation of the coordinates.
Initially, RoseTTAFold's performance was not as strong as AlphaFold's.
RoseTTAFold appeared to have not yet learned structural features, as it could not predict protein-protein interactions.
Later versions of RoseTTAFold compete with AlphaFold in accuracy and have facilitated predicting proteins in complex with RNA or DNA.
It is possible to use AlphaFold's ideas for RNA structure predictions, but they did not yield remarkable results in CASP15.
OpenFold is a clone of AlphaFold that utilizes a different AI framework (PyTorch).
UniFold is another clone of AlphaFold that also uses the PyTorch AI framework.
OpenFold and UniFold were released approximately one year after the introduction of AlphaFold.
AlphaFold uses a multiple sequence alignment to predict the structure of a single protein.
In some rare cases, the MSA is unnecessary.
Predictions without an MSA or with a very shallow MSA are generally significantly worse.
A proposed method to improve predictions for such proteins is to use a language model, or a model trained to predict the sequence from the sequence.
OmegaFold and ESMfold are two implementations that seem similar to AlphaFold but do not use an MSA.
The performance of these models is questionable when it comes to using a single sequence.
The performance is significantly worse for orphan proteins that do not have many homologs in the sequence databases.
The language models appear to memorize the MSA.
ESMfold is computationally efficient.
ESMfold has been used to predict the structure of all proteins from an extensive metagenomics database.
At CASP15, ESMfold and OmegaFold performed significantly worse than AlphaFold.
Two years after AlphaFold's revolution in CASP14, CASP15 was held.
Most groups used AlphaFold in some form during CASP15.
The term "AlphaFold" appears on average 1.4 times per page in the CASP15 abstract book.
The standard AlphaFold protocol performed better than over half of the groups.
A few groups showed significant improvements over the standard AlphaFold method for individual proteins and protein assemblies.
Five groups were selected to present at CASP15, describing how they outperformed AlphaFold.
There are two main ways to improve upon the standard AlphaFold: improving template usage, increasing sampling through alternative MSA generation methods, or modifying AlphaFold to use dropouts.
The best prediction, as measured by the estimated qualities, was used for the predictions.
This method generates an increased sampling directly from AlphaFold in an efficient manner.
An alternative way to create structural diversity from AlphaFold is to employ dropouts or disable pairing.
Dropouts were utilized to create thousands of models per target.
Some groups used predicted distances and other constraints from AlphaFold models in their methods.
Several groups used improved template-searching methods.
HHsearch was used by some groups instead of HMMsearch.
Alternative modeling or rigid-body protein docking protocols were utilized when templates were identified.
The added value of these alternative methods over vanilla AlphaFold was minimal.
Alternative scoring functions were employed to identify the best models.
One of the scoring functions used was the one built into AlphaFold.
Scoring functions based on Voronoi surfaces were also utilized.
Deep Learning-based scoring functions were another option employed.
Since its release in 2021, AlphaFold has been utilized by hundreds of papers. The field of AlphaFold has shown steady progress. In CASP15, it was revealed that the average TM-score is about 0.7 when running default AlphaFold on challenging multichain targets. Using primarily increased sampling can raise the TM-score to about 0.8. For the majority of proteins and protein complexes, AlphaFold can generate a model close to experimental quality.
AE was funded by the Vetenskapsrådet Grant No. 202103979. AE was also funded by the Knut and Alice Wallenberg Foundation. Computations and data handling for AE were enabled by the supercomputing resource Berzelius. Berzelius is provided by the National Supercomputer Centre at Linköping University. Berzelius is also provided by the Knut and Alice Wallenberg Foundation and SNIC. The grant number for Berzelius is SNIC 2021/5-297.
Yang J., Anishchenko I., Park H., Peng Z., Ovchinnikov S., and Baker D. found that using predicted inter-residue orientations improves protein structure prediction. Their work was published in 'Proceedings of the National Academy of Sciences' in 2020.
Xu J., McPartlon M., and Li J. showed that deep learning, independent of co-evolution information, can enhance protein structure prediction. Their research appeared in 'Nature Machine Intelligence' in 2021.
The AlphaFold-Multimer paper discusses retraining AlphaFold to enhance multimer structure predictions.
As of April 2023, three versions of AlphaFold-Multimer have been released: v2.1, v2.2, and v2.3.
Version 2.1, released in December 2021, encountered issues with clashes in disordered regions.
Version 2.2, released in April 2022, addressed and fixed these problems.
Version 2.3, released in December 2022, underwent a complete retraining process and exhibited improved performance.
Wensi Zhu, Aditi Shenoy, Petras Kundrotas and Arne Elofsson evaluated AlphaFold-Multimer's ability to predict multi-chain protein complexes.
AlphaFold-Multimer and FoldDock can accurately model dimers, but their performance on larger complexes is unclear.
The authors analysed AlphaFold-Multimer's performance on a homology-reduced dataset of homo- and heteromeric protein complexes.
They highlighted the differences between the pairwise and multi-interface evaluation of chains within a multimer.
They described why certain complexes perform well on one metric (e.g. TM-score) but poorly on another (e.g. DockQ).
They proposed a new score, Predicted DockQ version 2 (pDockQ2), to estimate the quality of each interface in a multimer.
They modelled protein complexes (from CORUM) and identified two highly confident structures that do not have sequence homology to any existing structures.
All scripts, models, and data used to perform the analysis in this study are freely available at https://gitlab.com/ElofssonLab/afm-benchmark.
Patrick Bryant, Gabriele Pozzati, Wensi Zhu, Aditi Shenoy, Petras Kundrotas, and Arne Elofsson are the authors of the paper. The paper's title is "Predicting the structure of large protein complexes using AlphaFold and Monte Carlo tree search." The paper was published in Nature Communications in 2022. The paper presents a method for predicting the structure of very large protein complexes using a sequential assembly strategy. AlphaFold can predict the structure of single- and multiple-chain proteins with high accuracy, but accuracy decreases with the number of chains. The size of protein complexes that can be predicted is limited by the available GPU memory. The structure of large complexes can be predicted by starting from predictions of subcomponents. The authors assembled 91 out of 175 complexes with 10-30 chains from predicted subcomponents using Monte Carlo tree search, with a median TM-score of 0.51. There were 30 highly accurate complexes (TM-score ≥0.8, 33% of complete assemblies). A scoring function, mpDockQ, was created to distinguish if assemblies are complete and predict their accuracy. Complexes containing symmetry are accurately assembled, while asymmetrical complexes remain challenging. The method is freely available as a Colab notebook. The paper includes an assembly principle for the acetoacetyl-CoA thiolase/HMG-CoA synthase complex (complex 6ESQ). The structure of all interacting chains is predicted by protein sequences from each chain and the interaction network. An assembly path is constructed using the predictions as a guide, and one new chain is added through a network edge in each step, resulting in a sequential construction of the complex. The paper also describes the Monte Carlo tree search process, which starts from a node (subcomplex) and selects a new node based on the previously backpropagated scores. A complete assembly process is simulated by adding nodes randomly until an entire complex is assembled or a stop is reached due to too much overlap. The complex is scored, and the score is backpropagated to all previous nodes, yielding support for the previous selections. The final result is a path containing all chains that are most likely to result in high-scoring complexes. The paper compares the performance of AlphaFold-multimer version 2 (AFM) and the FoldDock protocol using AlphaFold (AF) for predicting pairwise interactions, finding that AFM models are slightly better. The paper also discusses the limitations of AlphaFold for predicting protein complexes with 10-30 chains and presents a graph-traversal algorithm that excludes overlapping interactions, making it possible to assemble large protein complexes in a stepwise fashion. The authors analyze the success rates of assembly using either AFM or FoldDock and examine the use of predicted subcomponents of native dimeric or trimeric sub-components. The paper then presents the final protocol based on all possible trimeric subcomponents without assuming knowledge of the interactions. The authors extracted 175 high-resolution non-redundant complexes from the PDB with more than nine chains and no nucleic acids or interactions from different organisms to analyze the possibility of assembling these protein complexes. The paper includes an example of the assembly of complex 6ESQ using subcomponents predicted with AFM, starting from two dimers (AC and CH) and creating the trimer ACH through superposition. The paper discusses the issue of limited conformational sampling in dimers during assembly due to incorrect or missing interfaces in some dimers and suggests predicting trimeric interactions to generate alternative interfaces. Using all native trimeric interactions, the authors assembled complexes with a median TM-score of 0.80 for FoldDock and 0.74 for AFM. The paper analyzes the possibility of distinguishing when a complex is assembled to completion and has a high TM-score, finding that the mpDockQ score can be used for this purpose. The TM-score distribution of assembled complexes using all possible trimeric subcomponents from FoldDock is presented, and the impact of the type of symmetry of the complexes and the accuracy of the subcomponents is discussed. The paper analyzes aspects affecting the assembly of protein complexes, including the kingdom, the number of total chains, the oligomeric type, the number of effective sequences (Neff), the subcomponent accuracy, the type of symmetry, and the interface accuracy between predicted and assembled interfaces. Bacteria is the most abundant kingdom and has the highest fraction of complete assemblies (29/85) with a median TM-score of 0.85. Most complete assemblies have fewer chains and are of homomeric type, and the TM-scores are higher for complexes with higher average Neff values. The average TM-score of the subcomponents provides a good explanation of when an assembled complex is accurate. The symmetry of the complexes is also significant, with dihedral symmetry being the most common and having the highest number of complete assemblies. Asymmetric complexes have low TM-scores, suggesting that only symmetrical complexes can be assembled successfully using subcomponents and MCTS. The paper compares the performance of AFM and FoldDock for predicting the structure of protein complexes with 4-9 chains, finding that FoldDock models outperform AFM overall. The paper also compares the performance of FoldDock and AFM with Multi-LZerD and Haddock, finding that Haddock completed 77 complexes with a median TM-score of 0.29, while Multi-LZerD was unable to complete any complex in the dataset.
Gustaf Ahdritz, Nazim Bouatta, Sachin Kadyan, Qinghui Xia, William Gerecke, Timothy J O’Donnell, Daniel Berenberg, Ian Fisk, Niccolò Zanichelli, Bo Zhang, Arkadiusz Nowaczynski, Bei Wang, Marta M Stepniewska-Dziubinska, Shang Zhang, Adegoke Ojewole, Murat Efe Guney, Stella Biderman, Andrew M Watkins, Stephen Ra, Pablo Ribalta Lorenzo, Lucas Nivon, Brian Weitzner, Yih-En Andrew Ban, Peter K Sorger, Emad Mostaque, Zhao Zhang, Richard Bonneau, and Mohammed AlQuraishi published a paper in bioRxiv.
AlphaFold2 can predict protein structures with high accuracy.
AlphaFold2 lacks the code and data required to train new models.
OpenFold is a fast, memory-efficient, and trainable implementation of AlphaFold2.
OpenProteinSet is the largest public database of protein multiple sequence alignments.
OpenFold matches AlphaFold2 in accuracy.
OpenFold is robust at generalizing.
OpenFold learns spatial dimensions sequentially.
OpenProteinSet is an open-source corpus of more than 16 million MSAs.
Uni-Fold is an open-source platform for developing protein folding models beyond AlphaFold.
Uni-Fold was created by Ziyao Li, Xuyang Liu, Weijie Chen, Fan Shen, Hangrui Bi, Guolin Ke, and Linfeng Zhang.
Uni-Fold was presented as a solution to the lack of training utilities in AlphaFold's current open-source code.
Uni-Fold reimplemented AlphaFold and AlphaFold-Multimer in the PyTorch framework.
Uni-Fold achieved equivalent or better accuracy than AlphaFold.
Uni-Fold achieved about 2.2 times training acceleration compared to AlphaFold under a similar hardware configuration.
Uni-Fold outperformed AlphaFold-Multimer by approximately 2% on the TM-Score.
Uni-Fold is the only open-source repository that supports both training and inference of multimeric protein models.
Uni-Fold is based on a distributed PyTorch framework, Uni-Core.
Uni-Fold introduces the highest efficiency among existing AlphaFold implementations.
Uni-Fold supports fast prediction of large symmetric complexes with UF-Symmetry.
Uni-Fold is implemented on a distributed PyTorch framework, Uni-Core.
Uni-Fold is an ongoing project with the target of developing better protein folding models.
Uni-Fold is licensed under the permissive Apache Licence, Version 2.0.
Uni-Fold model parameters are made available under the Creative Commons Attribution 4.0 International license.
Uni-Fold was modified from the open-source code of DeepMind AlphaFold v2.0.
Uni-Fold supports training on a single node without MPI and with multiple GPUs using MPI.
Uni-Fold supports inference from features.pkl and FASTA files.
Uni-Fold supports the generation of MSAs with MMseqs2.
The authors of the paper are Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, and Alexander Rives.
The paper's DOI is https://doi.org/10.1101/2022.07.20.500902.
The paper was published in bioRxiv.
The paper was published on 20 July 2022.
The paper's title is "Language models of protein sequences at the scale of evolution enable accurate structure prediction".
The paper's abstract is: "Large language models have recently been shown to develop emergent capabilities with scale, going beyond simple pattern matching to perform higher level reasoning and generate lifelike images and text. While language models trained on protein sequences have been studied at a smaller scale, little is known about what they learn about biology as they are scaled up. In this work we train models up to 15 billion parameters, the largest language models of proteins to be evaluated to date. We find that as models are scaled they learn information enabling the prediction of the three-dimensional structure of a protein at the resolution of individual atoms. We present ESMFold for high accuracy end-to-end atomic level structure prediction directly from the individual sequence of a protein. ESMFold has similar accuracy to AlphaFold2 and RoseTTAFold for sequences with low perplexity that are well understood by the language model. ESMFold inference is an order of magnitude faster than AlphaFold2, enabling exploration of the structural space of metagenomic proteins in practical timescales."