Alphafold: (Nearly) One Year On

Alphafold: (Nearly) One Year On

During 2020’s CASP14 challenge (Critical Assessment of Structure Prediction), Google’s DeepMind showcased Alphafold2, their revised protein folding model, and to say it was impressive is an understatement. The model successfully won the competition with a median Global Distance Test (GDT) score of 92.4 (Figure 1) (> 90 considered competitive to experimentally determined structures), beating over 100 other teams which entered [1]. This announcement sent shockwaves through the biomedical community to much fanfare with many articles stating that ‘the protein folding problem has been solved’ setting a new benchmark in the field of predicted protein modelling. If this is true, this has huge scientific ramifications. However, a caveat soon emerged during the competition; Google had not open-sourced their code, and as of March 2021, the company had not announced that they had any intention to do so. This introduced several questions surrounding Googles intention with this new, powerful, technology. How will it be licensed? Will it be a paid Application Programming Interface (API) service? Will it be exclusive to only paying pharmaceutical companies or corporate collaborations? Nevertheless, these initial fears were laid to rest when in July 2021, DeepMind released the source code to Alphafold2 under the Apache 2.0 licence allowing anyone to use the technology. This post will explore how Alphafold2 and the field of predictive protein modelling has changed since then.

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==

Figure 1: Chart showing the performance of the Alphafold2 model during the previous CASP events highlighting the advances in the GDT scores.

The open-source nature of Alphafold2’s software has led to a Cambrian explosion of development in how to modify, deploy, and use the Alphafold2 model. This has been particularly useful to the average user due to the sizeable resource requirements initially needed to run the model, e.g. 2 TB of homology data, and access to powerful, expensive, graphics processing units. While obtainable, this is still a large resource requirement outside of dedicated computational biology groups. Not only that, but the code was not in an easily deployable format. Early on, efforts were undertaken by bioinformatic groups to simplify the environment for end-users to run the model by deploying it on Google Colab, ColabFold [2]. Here, users, have access to a remote compute cluster, and homology data which is accessible over an API. Consequently, now anyone could run the Alphafold2 model on the protein sequence of their choice, allowing for much faster experimentation.

Case studies

 Databases

In July when the article for Alphafold2 was published in Nature [3] (Jumper et al., 2021), it was shortly followed by the associated publication based on the structural prediction of the human proteome [4] (Tunyasuvunakool et al., 2021), the data of which have been collated into a user-friendly database called AlphafoldDB [5]. The release of the human proteome to AlphafoldDB was accompanied by a further additional 20 key proteomes including Mus musculus, Drosophila melanogaster, and Plasmodium falciparum. The database was established by a collaboration between Google’s DeepMind and EMBL’s European Bioinformatics Institute, showcasing the vast possibility to predictively model the entire proteomes of organisms and collate the data in an easy-to-use environment. Since this initial release, the goal is to expand the database to contain representative predictive structures for the entire Uniprot90 library, representing over 100 million non-redundant sequences. As of January 2022, an additional 27 proteomes have been curated including 17 proteomes which represent organisms responsible for neglected tropical diseases afflicting up to 1 billion people globally.

Structural search engines

The rapid expansion of protein models generated by Alphafold2 has now created another issue; how to structurally compare these models to existing structures? Currently, when a new protein structure is solved, it can be parsed against the existing structural database through servers such as Dali [6] or PDBeFold [7] (Krissinel et al., 2004). This process is based on secondary structure matching, to identify structurally related proteins. It is particularly useful for uncharacterised proteins whereby the protein sequence alone is not sufficient to functionally classify it. Now with Alphafold2 models, this process has gained an extra dimension. Not only can we predictively model, for example, an uncharacterised protein and use now use these structural prediction pipelines, but the vast number of models being generated by Alphafold2 allows for them to be used as a library of structures to be searched against. This may prove extremely useful for grouping and classifying structurally orphan proteins (proteins with no structural relationship in the database) Therefore, servers that can leverage the new Alphafold2 models can function as structural search engines allowing for this rapid comparison. An example of this in development is Foldseek [8], which expands on the existing structural data and combines it with the predicted proteomes generated from Alphafold2.

Multimers

During CASP14, Alphafold2 was only entered into the regular protein folding category, i.e., single polypeptides, and was not optimised to predict protein complexes; upon release, the model was designed to only parse a single protein sequence. Nevertheless, this did not stop people from experimenting. Users soon discovered that the addition of a lengthy poly-G linker could bypass this feature limitation by exploiting how the paired homology modelling is interpreted and ‘trick’ the model into folding two proteins and docking them together. Despite the model never being trained with data from protein complexes, it performed surprisingly well [9]. Furthermore, this can be used to stitch together overlapping, truncated, fragments of interacting subunits, folding them in sections thereby reducing the computational load, before stitching together each predicted complex into the entire assembly. Subsequently, the community effort has encouraged Deepmind to re-train the Alphafold2 model to specifically take into account complexes, Alphafold-multimer, such that two or more sequences can be passed to the model for structural predictions (Figure 2) [10]. A good example of how the open-source nature of the system is driving its development.

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==

Figure 2 – Multimeric prediction of the interferon gamma signalling complex consisting of A2B2C2 generating a hexameric assembly with a TM-score of 97.4 [10].

CryoEM docking

The development of cryoEM in recent years has advanced at such a pace that solving large structural systems has become commonplace in published literature. With each new EM structure, this limit keeps expanding. Inherently, this itself generates an issue, such that these large ensembles consist of an ever-growing number of protein subunits, each of which must be accurately modelled into the electron density maps. A classical approach to this problem would be to solve the structure for individual domains utilising X-ray crystallography, before docking these models into the EM maps, with the final EM models being modified to reflect the EM maps. This, however, presents an obvious issue, the domain must have either already been solved or is capable of being solved which is not always the case. Enter Alphafold2, which now can generate these smaller domain structures rapidly, which can then be docked into the EM maps and modified where appropriate. This is particularly useful when trying to model the correct register of amino acids in secondary structural elements e.g., b-strands, where it can be easy to deviate from the correct path! This hybrid approach of leveraging Alphafold2 and EM docking has been elegantly shown in the recent release of the structure of the human nuclear pore complex [11], a 120 MDa complex consisting of over 100 protein components (Figure 3), for perspective, that is over 15x the size of the ribosome!

https://twitter.com/jankosinski/status/1453530872427253765?lang=en-GB [12]

Figure 3 – Animation of the human nuclear pore complex, highlighting its complexity. The animation displays the inherent flexibility that the pore can exhibit pore.

Molecular replacement

The most common source of phasing in X-ray crystallography is molecular replacement, the process of using the phase information from a related protein structure and applying it to the experimental data collected from the new protein crystal. The obvious limitation to this method is the requirement of a sufficiently similar protein model or model ensemble to have already been solved! Currently, molecular replacement pipelines have been produced to autonomously find, prepare, and test suitable models for use in molecular replacement, hoping that one can be found that is suitably similar to find a solution e.g., MrBUMP [13]. Alphafold2, in essence, presents a model which optimises this process presenting a single model to be tested, with striking success, such that for the first-time structural data submitted to CASP were solved through molecular replacement by the very models generated for the test itself [14]! This has a profound effect on structure determination pipelines, with the potential to shift the largest barrier in protein structure determination away from solving the phase problem, to generating suitable protein crystals. Automated molecular replacement pipelines are already being modified to take advantage of Alphafold2 generated models, speeding up the molecular replacement process and providing more robust initial models to aid in minor model building [13].

Weird and wacky

Beyond, the conventual line of application Alphafold2 was designed for, there have been some more weird and wacky ideas that have come about from the model’s plasticity including ‘predicting’ the fold of scientists’ favourite poems and songs, albeit with the caveat that it must be converted to a form of English with the letters B, J, O, U, X, and Z missing as they are not represented in the single letter code for amino acids! Here is an example of the opening verse from Queens – ‘Bohemian Rhapsody’ modelled using Alphafold2 (Figure 4).

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==

Figure 4 – Alphafold2 model of Queens – ‘Bohemian Rhapsody’ opening verse.

Limitations

For all of the strengths Alphafold2 has displayed, it is not without its limitations. Alphafold2 has shown its capability to predict structures of proteins that contain defined secondary structures; however, it is significantly limited within regions of disorder. This poses an issue for the model’s coverage of known proteins as it is predicted that up to 10 % of all proteins are inherently disordered and ~ 40 % of eukaryotic proteins have a loop of > 50 residues which are thought to be disordered [15]. Furthermore, proteins often require the presence of additional co-factors and/ or metal ions to complete their tertiary structure. These are currently not trained in the model and therefore will not be taken into account during the modelling process.

While Alphafold2’s release was accompanied by considerable media attention on July 15, with not only the open-source release of the code but also a description of their methods in a Nature article [3], it is less well known that the Baker academic group also published on the same day, this time in the journal Science, a structure prediction model, RoseTTAfold – a deep-learning algorithm, offering a similar model to Alphafold2 in both methodology and performance [16]. Upon release, RoseTTAFold had some advantages to Alphafold2 including a lower resource requirement and direct incorporation onto the group’s Robetta server allowing simple and rapid use by researchers through a web interface. Despite this, the release of RoseTTAfold is often overshadowed, no doubt due to DeepMind’s marketing approach.

Further possibilities and the future

For everything that has been achieved by Alphafold2 this past year, there are numerous possible routes to be explored by its future development. One of these is another spin out of the Alphabet company – Isomorphic labs [17]. Isomorphic labs plan to leverage Alphafold2 to reimagine and accelerate the drug development pipeline, and model the fundamental mechanisms of life, a process traditionally reliant on experimental modelling of protein:drug complexes. They will be joining companies like Insitro and Excientia in the race to use artificial intelligence to drive drug development.

Alphafold2 is undoubtfully an impressive breakthrough in the use of deep learning in the field of structural biology. It has already led to some impressive discoveries as described above and has a prosperous future that will undoubtedly lead to huge advances in the field of biology and medicine. However, “The age of information has not yet become the age of understanding” (from ‘The Sovereign Individual). The ability of Alphafold2 to generate a model for whatever protein sequence you provide it with, irrespective of the context, brings into question the ease at which these models can be produced and how this information will be disseminated and understood by a broader audience than just structural biologists, who by and large, up until now, have been the gatekeepers of generating protein models through the meticulous analysis of structural data. There is a real chance that these newly predicted models will be over-interpreted and, without a deep understanding and respect for what protein models can provide, act as a false base to which future work is built. Ultimately, it will require a synergistic development between structural biology and the rapidly developing computational biology to truly leverage this new technology to the best of its potential.

Written by Dr Sam Dix.

References

1 – Deepmind article on Alphafold. https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology.

2 – Mirdita, M., Schütze, K., Moriwaki, Y., Heo, L., Ovchinnikov, S., & Steinegger, M. (2022). ColabFold – Making protein folding accessible to all. BioRxiv, 2021.08.15.456425.

3 – Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., … Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature 596(7873), 583–589.

4 – Tunyasuvunakool, K., Adler, J., Wu, Z., Green, T., Zielinski, M., Žídek, A., Bridgland, A., Cowie, A., Meyer, C., Laydon, A., Velankar, S., Kleywegt, G. J., Bateman, A., Evans, R., Pritzel, A., Figurnov, M., Ronneberger, O., Bates, R., Kohl, S. A. A., … Hassabis, D. (2021). Highly accurate protein structure prediction for the human proteome. Nature, 596(7873), 590–596.

5 – AlphafoldDB. https://alphafold.ebi.ac.uk/.

6 – Holm, L., & Rosenstrï, P. (2010). Dali server: conservation mapping in 3D. Nucleic Acids Research, 38, W545–W549.

7 – Krissinel, E., & Henrick, K. (2004). Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallographica Section D-Biological Crystallography, 60, 2256–2268.

8 – Kempen, M. van, Kim, S. S., Tumescheit, C., Mirdita, M., Söding, J., & Steinegger, M. (2022). Foldseek: fast and accurate protein structure search. BioRxiv, 2022.02.07.479398.

9 – Sergey Ovchinnikov (@sokrypton). Twitter post: Surprisingly, works quite well, even when the sequences from each gene are not paired! Seems learned to pair sequences!? “Artifact” jackhammer MSAs (that tends to split domains into separate sequences)? Colab notebook. https://twitter.com/sokrypton/status/1417628739828162565.

10 – Evans, R., O’Neill, M., Pritzel, A., Antropova, N., Senior, A., Green, T., Žídek, A., Bates, R., Blackwell, S., Yim, J., Ronneberger, O., Bodenstein, S., Zielinski, M., Bridgland, A., Potapenko, A., Cowie, A., Tunyasuvunakool, K., Jain, R., Clancy, E., … Hassabis, D. (2022). Protein complex prediction with AlphaFold-Multimer. BioRxiv, 2021.10.04.463034.

11 – Mosalaganti, S., Obarska-Kosinska, A., Siggel, M., Turonova, B., Zimmerli, C. E., Buczak, K., Schmidt, F. H., Margiotta, E., Mackmull, M.T., Hagen, W., Hummer, G., Beck, M., & Kosinski, J. (2021). Artificial intelligence reveals nuclear pore complexity. BioRxiv, 2021.10.26.465776.

12 – Jan Kosinski (@jankosinski). Twitter post: We combined AlphaFold and cryoEM to build a new model of the nuclear pore complex, the largest complex in the human cell! The structure covered by the model is 15x bigger than the human ribosome and 2x larger than old nuclear pore models. https://twitter.com/jankosinski/status/1453530872427253765?lang=en-GB.

13 – Keegan, R. M., & Winn, M. D. (2007). Automated search-model discovery and preparation for structure solution by molecular replacement. Acta Crystallographica Section D: Biological Crystallography, 63(4), 447–457.

14 – McCoy, A. J., Sammito, M. D., & Read, R. J. (2022). Implications of AlphaFold2 for crystallographic phasing by molecular replacement. Acta Crystallographica. Section D, Structural Biology, 78(1), 1–13.

15 – Tompa, P. (2002). Intrinsically unstructured proteins. Trends in Biochemical Sciences, 27(10), 527–533.

16 – Baek, M., DiMaio, F., Anishchenko, I., Dauparas, J., Ovchinnikov, S., Lee, G. R., Wang, J., Cong, Q., Kinch, L. N., Dustin Schaeffer, R., Millán, C., Park, H., Adams, C., Glassman, C. R., DeGiovanni, A., Pereira, J. H., Rodrigues, A. v., van Dijk, A. A., Ebrecht, A. C., … Baker, D. (2021). Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557), 871–876.

17 – Isomorphoic Labs. https://www.isomorphiclabs.com/

Go to Top