AlphaFold 2

Daniel Lim (St Catharine’s). September 22, 2021.

The protein folding problem

Proteins are well-known to be the building blocks of life, the foundation upon which almost all biological interactions are reliant on. Composed of chains of amino acids, proteins have been recognized as a distinct class of biological molecules since 1838. The first complete amino acid sequence was uncovered by Frederick Sanger in 1958, earning him a Nobel Prize in Chemistry. Four years later, Max Perutz and John Kendrew earned the 1962 Nobel Prize in Chemistry for their work in determining the first atomic structure of a protein using X-ray crystallography. Today, the structures of around 100,000 unique proteins have been determined, but this barely scratches the surface of the billions of protein sequences known to us.

The function of a protein is heavily dependent on its unique 3D structure. When a protein is synthesized, it starts as a chain of amino acids which rapidly folds into its specific three-dimensional conformation based on the chemistry of the amino acid sequence. The 20 different types of amino acids interact with each other, some repelling and some attracting other amino acids, forming curls, loops and pleated sheets in an act of spontaneous yet intricate origami. In this state, it can start carrying out its function properly.

In 1972, biochemist Christian Anfinsen postulated that all the information needed for a protein to assume its final conformation was encoded in its amino acid sequence. However, a major hurdle is the tremendous number of potential configurations that a protein could have, estimated to be 10300 for a typical protein, which would require longer than the age of the universe to calculate by brute force. Determining the 3D structure of a protein solely from its amino acid sequence is thus known as the protein folding problem.

Most protein structures known today were found by experimental techniques such as X-ray crystallography and nuclear magnetic resonance, but these techniques can take years of painstaking trial and error, not to mention expensive specialised equipment. A method to computationally predict a protein’s 3D structure would drastically reduce the resources needed to understand their structure, giving us a better understanding of a given protein’s function and mechanism.

The search for solution

To monitor progress and evaluate the accuracy of new prediction techniques, professors John Moult and Krzysztof Fidelis founded the Critical Assessment of protein Structure Prediction (CASP) in 1994. CASP is both a global community forum and a biennial assessment used as the gold standard for structure prediction methods. The assessment uses proteins whose structures have only very recently been experimentally determined, and so are not yet published. Participants must predict the structures of these proteins, and their predictions are compared against the experimental data. Their prediction is then given a score known as the global distance test (GDT) score, up to 100 for complete accuracy.

In 2018, DeepMind Technologies, a subsidiary of Google, entered CASP13 with AlphaFold. DeepMind used machine learning to train its neural networks to predict two properties of the protein: the distance between pairs of amino acids and the angles between the bonds connecting those amino acids. These results were then aggregated and compared with known protein structures to construct a hypothetical 3D structure. This model was improved through gradient descent, a technique used in machine learning to make incremental improvements. Across the different assessed proteins, AlphaFold achieved a median GDT score of 58.9, placing first in the overall ranking.

Based on the success of the machine learning system, DeepMind continued work on AlphaFold. In 2020, DeepMind entered AlphaFold 2 at CASP14. Much like its predecessor, AlphaFold 2 placed first overall, but this time it achieved an unprecedented median GDT score of 92.4, far surpassing any other group, let alone any previous results. AlphaFold 2’s predicted structures deviated from the actual structures by an average of 1.6 Angstroms, comparable to the width of an atom. In one case, a research group studying a phage tail protein noticed that AlphaFold 2’s model greatly agreed with their experimental model except in one segment, and upon reviewing the analysis they realized they had made a mistake in their experimental interpretation and corrected it.

So how does it work?

AlphaFold 2 differs significantly from the original AlphaFold, and its system can be roughly divided into three parts.

Firstly, from the amino acid sequence data, AlphaFold 2 constructs a multiple sequence alignment. This means that it identifies similar sequences in living organisms, which determines which part of the sequence is more likely to mutate, and which part is more likely to be conserved. Sequences that have been conserved indicate that they are likely more integral to the protein structure, as mutations or alterations to these sequences would be detrimental to protein function, and thus to the organism as a whole. AlphaFold 2 also identifies proteins that have potentially similar structures to the given protein, and uses it as a template to create an initial hypothesis of the structure.

Next, the multiple sequence alignment and templates are passed through a transformer, which identifies which pieces of information are more important. This transformer, dubbed the Evoformer, is the key novelty of AlphaFold 2. The Evoformer allows the neural network to focus its attention on specific parts of the input, greatly improving the performance overall. Both the multiple sequence alignment and templates are processed together, exchanging information and improving the predicted structure. For instance, if the multiple sequence alignment suggests amino acids A and B are closely related, this information is used to modify the template. Afterwards, the modified template might suggest that, given A and B are close, amino acids C and D have a high chance of being related as well. This is then passed back to the multiple sequence alignment. The end result of this is a detailed model of the various interactions between the amino acids in the sequence.

Lastly, this model is taken and processed by the structure module, which constructs a final, three-dimensional model. Each amino acid is modelled as a triangle, reflecting the three atoms in each amino acid molecule. The final product is a long list of Cartesian coordinates that shows the position of each atom of the protein, which can be rendered into a 3D model.

So where do we go from here?

It cannot be overstated the significance of this breakthrough. AlphaFold 2’s ability to predict protein structures massively reduces the time and cost needed to determine the basic structure of certain proteins. Although they are not entirely perfect, researchers will now be able to obtain some structural information in a matter of days rather than years. This will be especially helpful for proteins that are difficult to crystalize and thus analyse experimentally, such as membrane proteins. We are already seeing its impact, as it has predicted several protein structures of the SARS-CoV-2 virus, including ORF3a, whose structure was previously unknown.

Of course, AlphaFold 2 is not without its limitations. It relies on existing amino acid databases to construct its multiple sequence alignment, and so it may be ineffective for proteins where these databases are shallow. This can be a problem for protein families with large variation, such as antibodies. However, it is possible to augment the existing model with other methods of structure prediction.

DeepMind has worked together with the European Bioinformatics Institute to launch the AlphaFold Protein Structure Database, claimed to be the most complete and accurate database for human protein structures. You can access it here, and look up almost any protein you can think of. You can also access the original AlphaFold 2 code on Github, though it requires a bit more processing power than most standard laptops can deliver.

AlphaFold 2’s effects are just starting to surface, with projects such as advancing research in lifesaving drugs with the Drugs for Neglected Diseases Initiative and engineering enzymes to recycle single-use plastics with the Centre for Enzyme Innovation. DeepMind’s work represents a breakthrough in molecular and computational biology, and we will likely see its impacts in years to come.

Daniel is a first-year undergraduate at St. Catharine’s College studying Natural Sciences (Biology).

References:

Trackback from your site.

Leave a comment

You must be logged in to post a comment.

Our Sponsors

Jane Street