doi.bio/casp15_progress

Progress at protein structure prediction, as seen in CASP15

Arne Elofsson

  1. About ten years ago, the idea of indirect correlation was rediscovered.
  2. Other methods were also proposed.
  3. These methods provided the ability to predict the structure of many proteins.
  4. The method was limited to large protein families.
  5. The models' accuracy was limited.

At CASP13, DeepMind introduced AlphaFold (version 1). The network architecture used by AlphaFold was deeper than earlier attempts. The network predicted distance probabilities instead of contacts. This resulted in sufficient accuracy to use a simple steepest descent methodology to predict protein structures. Only a tiny fraction of proteins without templates could be predicted at experimental accuracies. Several academic groups quickly reproduced the performance of AlphaFold v1.

DeepMind presented AlphaFold v2.0 at CASP14. The results of AlphaFold were impressive. The average GDTts was close to 90 for individual domains. The AlphaFold results were, in principle, as good as experimental structures. In the summer of 2021, the AlphaFold papers were published. The first AlphaFold paper described the method. There was a second paper that described an extensive database of predicted structures. The database of predicted structures has recently been extended to cover virtually all proteins in UniProt.

DeepMind released the source code of AlphaFold with an open-source license. The community can test and extend the AlphaFold method. The release spurred a jump in development speed. AlphaFold consists of two main stages. The first stage of AlphaFold is the 48 EvoFormer blocks. The second stage of AlphaFold is the structural module. Both stages of AlphaFold contain important innovations. These innovations are inspired by previous academic contributions. AlphaFold can be seen as an engineering feat.

  1. A learnt representation of the raw MSA is used as an input to the Evoformer.
  2. A pair representation is also inputted into the Evoformer.
  3. The Evoformer is a two-track network.
  4. The Evoformer uses a row and column-wise self-attention mechanism to update the MSA representation.
  5. An outer product is used to update the pairwise description.
  6. Triangle updates are a key innovation, capable of learning to maintain triangle equalities.
  7. The output of one EvoFormer module is fed into the next.
  8. The final pair representation is inputted into the structural module, alongside the representative sequence.

Thanks to a detailed presentation of the algorithm at the CASP conference and the release of the code as open-source in June 2021, AlphaFold sparked tremendous activity in the entire community, which also expanded rapidly.

Many labs use the AlphaFold paper and code.

An essential tool is ColabFold, which runs as a Jupyter notebook on Google's Colab platform.

MMseqs2 is used for sequence searching, making model-building extremely rapid.

The interface is straightforward and easy to use, demonstrating the importance of open science.

DeepMind developed a version of AlphaFold specifically for the task of predicting the structure of complexes. The first version was released in November 2021. This initial version had a tendency to permit protein overlap. The second version was released in April 2022. The second version performs slightly better than hacks using the original AlphaFold. The second version can accurately predict the structure of about half of all complexes up to 6 chains. It is feasible to predict the structure of larger complexes or modify predicted subcomplexes. An updated and retrained version was released.

Since its release in 2021, AlphaFold has been utilized by hundreds of papers. The field of AlphaFold has shown steady progress. In CASP15, it was revealed that the average TM-score is about 0.7 when running default AlphaFold on challenging multichain targets. Using primarily increased sampling can raise the TM-score to about 0.8. For the majority of proteins and protein complexes, AlphaFold can generate a model close to experimental quality.

AE was funded by the Vetenskapsrådet Grant No. 202103979. AE was also funded by the Knut and Alice Wallenberg Foundation. Computations and data handling for AE were enabled by the supercomputing resource Berzelius. Berzelius is provided by the National Supercomputer Centre at Linköping University. Berzelius is also provided by the Knut and Alice Wallenberg Foundation and SNIC. The grant number for Berzelius is SNIC 2021/5-297.

Dmitry A. Afonnikov, Institute of Cytology and Genetics, Siberian Branch, Russian Academy of Sciences, Lavrentieva ave., 10, Novosibirsk, 630090, Russia

Yu. V. Kondrakhin, Institute of Cytology and Genetics, Siberian Branch, Russian Academy of Sciences, Lavrentieva ave., 10, Novosibirsk, 630090, Russia

I. I. Titov, Institute of Cytology and Genetics, Siberian Branch, Russian Academy of Sciences, Lavrentieva ave., 10, Novosibirsk, 630090, Russia

N. A. Kolchanov, Institute of Cytology and Genetics, Siberian Branch, Russian Academy of Sciences, Lavrentieva ave., 10, Novosibirsk, 630090, Russia

Detecting direct correlation between positions in multiple alignment of amino-acid sequences

In Computer science and biology. Genome informatics: Function, structure, phylogeny

Edited by Frishman D, Mewes HW

Yang J., Anishchenko I., Park H., Peng Z., Ovchinnikov S., and Baker D. found that using predicted inter-residue orientations improves protein structure prediction. Their work was published in 'Proceedings of the National Academy of Sciences' in 2020.

Xu J., McPartlon M., and Li J. showed that deep learning, independent of co-evolution information, can enhance protein structure prediction. Their research appeared in 'Nature Machine Intelligence' in 2021.

The AlphaFold-Multimer paper discusses retraining AlphaFold to enhance multimer structure predictions.

As of April 2023, three versions of AlphaFold-Multimer have been released: v2.1, v2.2, and v2.3.

Version 2.1, released in December 2021, encountered issues with clashes in disordered regions.

Version 2.2, released in April 2022, addressed and fixed these problems.

Version 2.3, released in December 2022, underwent a complete retraining process and exhibited improved performance.

Wensi Zhu, Aditi Shenoy, Petras Kundrotas and Arne Elofsson evaluated AlphaFold-Multimer's ability to predict multi-chain protein complexes.

AlphaFold-Multimer and FoldDock can accurately model dimers, but their performance on larger complexes is unclear.

The authors analysed AlphaFold-Multimer's performance on a homology-reduced dataset of homo- and heteromeric protein complexes.

They highlighted the differences between the pairwise and multi-interface evaluation of chains within a multimer.

They described why certain complexes perform well on one metric (e.g. TM-score) but poorly on another (e.g. DockQ).

They proposed a new score, Predicted DockQ version 2 (pDockQ2), to estimate the quality of each interface in a multimer.

They modelled protein complexes (from CORUM) and identified two highly confident structures that do not have sequence homology to any existing structures.

All scripts, models, and data used to perform the analysis in this study are freely available at https://gitlab.com/ElofssonLab/afm-benchmark.

Patrick Bryant, Gabriele Pozzati, Wensi Zhu, Aditi Shenoy, Petras Kundrotas, and Arne Elofsson are the authors of the paper. The paper's title is "Predicting the structure of large protein complexes using AlphaFold and Monte Carlo tree search." The paper was published in Nature Communications in 2022. The paper presents a method for predicting the structure of very large protein complexes using a sequential assembly strategy. AlphaFold can predict the structure of single- and multiple-chain proteins with high accuracy, but accuracy decreases with the number of chains. The size of protein complexes that can be predicted is limited by the available GPU memory. The structure of large complexes can be predicted by starting from predictions of subcomponents. The authors assembled 91 out of 175 complexes with 10-30 chains from predicted subcomponents using Monte Carlo tree search, with a median TM-score of 0.51. There were 30 highly accurate complexes (TM-score ≥0.8, 33% of complete assemblies). A scoring function, mpDockQ, was created to distinguish if assemblies are complete and predict their accuracy. Complexes containing symmetry are accurately assembled, while asymmetrical complexes remain challenging. The method is freely available as a Colab notebook. The paper includes an assembly principle for the acetoacetyl-CoA thiolase/HMG-CoA synthase complex (complex 6ESQ). The structure of all interacting chains is predicted by protein sequences from each chain and the interaction network. An assembly path is constructed using the predictions as a guide, and one new chain is added through a network edge in each step, resulting in a sequential construction of the complex. The paper also describes the Monte Carlo tree search process, which starts from a node (subcomplex) and selects a new node based on the previously backpropagated scores. A complete assembly process is simulated by adding nodes randomly until an entire complex is assembled or a stop is reached due to too much overlap. The complex is scored, and the score is backpropagated to all previous nodes, yielding support for the previous selections. The final result is a path containing all chains that are most likely to result in high-scoring complexes. The paper compares the performance of AlphaFold-multimer version 2 (AFM) and the FoldDock protocol using AlphaFold (AF) for predicting pairwise interactions, finding that AFM models are slightly better. The paper also discusses the limitations of AlphaFold for predicting protein complexes with 10-30 chains and presents a graph-traversal algorithm that excludes overlapping interactions, making it possible to assemble large protein complexes in a stepwise fashion. The authors analyze the success rates of assembly using either AFM or FoldDock and examine the use of predicted subcomponents of native dimeric or trimeric sub-components. The paper then presents the final protocol based on all possible trimeric subcomponents without assuming knowledge of the interactions. The authors extracted 175 high-resolution non-redundant complexes from the PDB with more than nine chains and no nucleic acids or interactions from different organisms to analyze the possibility of assembling these protein complexes. The paper includes an example of the assembly of complex 6ESQ using subcomponents predicted with AFM, starting from two dimers (AC and CH) and creating the trimer ACH through superposition. The paper discusses the issue of limited conformational sampling in dimers during assembly due to incorrect or missing interfaces in some dimers and suggests predicting trimeric interactions to generate alternative interfaces. Using all native trimeric interactions, the authors assembled complexes with a median TM-score of 0.80 for FoldDock and 0.74 for AFM. The paper analyzes the possibility of distinguishing when a complex is assembled to completion and has a high TM-score, finding that the mpDockQ score can be used for this purpose. The TM-score distribution of assembled complexes using all possible trimeric subcomponents from FoldDock is presented, and the impact of the type of symmetry of the complexes and the accuracy of the subcomponents is discussed. The paper analyzes aspects affecting the assembly of protein complexes, including the kingdom, the number of total chains, the oligomeric type, the number of effective sequences (Neff), the subcomponent accuracy, the type of symmetry, and the interface accuracy between predicted and assembled interfaces. Bacteria is the most abundant kingdom and has the highest fraction of complete assemblies (29/85) with a median TM-score of 0.85. Most complete assemblies have fewer chains and are of homomeric type, and the TM-scores are higher for complexes with higher average Neff values. The average TM-score of the subcomponents provides a good explanation of when an assembled complex is accurate. The symmetry of the complexes is also significant, with dihedral symmetry being the most common and having the highest number of complete assemblies. Asymmetric complexes have low TM-scores, suggesting that only symmetrical complexes can be assembled successfully using subcomponents and MCTS. The paper compares the performance of AFM and FoldDock for predicting the structure of protein complexes with 4-9 chains, finding that FoldDock models outperform AFM overall. The paper also compares the performance of FoldDock and AFM with Multi-LZerD and Haddock, finding that Haddock completed 77 complexes with a median TM-score of 0.29, while Multi-LZerD was unable to complete any complex in the dataset.

Gustaf Ahdritz, Nazim Bouatta, Sachin Kadyan, Qinghui Xia, William Gerecke, Timothy J O’Donnell, Daniel Berenberg, Ian Fisk, Niccolò Zanichelli, Bo Zhang, Arkadiusz Nowaczynski, Bei Wang, Marta M Stepniewska-Dziubinska, Shang Zhang, Adegoke Ojewole, Murat Efe Guney, Stella Biderman, Andrew M Watkins, Stephen Ra, Pablo Ribalta Lorenzo, Lucas Nivon, Brian Weitzner, Yih-En Andrew Ban, Peter K Sorger, Emad Mostaque, Zhao Zhang, Richard Bonneau, and Mohammed AlQuraishi published a paper in bioRxiv.

AlphaFold2 can predict protein structures with high accuracy.

AlphaFold2 lacks the code and data required to train new models.

OpenFold is a fast, memory-efficient, and trainable implementation of AlphaFold2.

OpenProteinSet is the largest public database of protein multiple sequence alignments.

OpenFold matches AlphaFold2 in accuracy.

OpenFold is robust at generalizing.

OpenFold learns spatial dimensions sequentially.

OpenProteinSet is an open-source corpus of more than 16 million MSAs.










sness@sness.net