AlphaFold 2

Daniel Lim (St Catharine’s). September 22, 2021.

The protein folding problem

Proteins are well-known to be the building blocks of life, the foundation upon which almost all biological interactions are reliant on. Composed of chains of amino acids, proteins have been recognized as a distinct class of biological molecules since 1838. The first complete amino acid sequence was uncovered by Frederick Sanger in 1958, earning him a Nobel Prize in Chemistry. Four years later, Max Perutz and John Kendrew earned the 1962 Nobel Prize in Chemistry for their work in determining the first atomic structure of a protein using X-ray crystallography. Today, the structures of around 100,000 unique proteins have been determined, but this barely scratches the surface of the billions of protein sequences known to us.

The function of a protein is heavily dependent on its unique 3D structure. When a protein is synthesized, it starts as a chain of amino acids which rapidly folds into its specific three-dimensional conformation based on the chemistry of the amino acid sequence. The 20 different types of amino acids interact with each other, some repelling and some attracting other amino acids, forming curls, loops and pleated sheets in an act of spontaneous yet intricate origami. In this state, it can start carrying out its function properly.

In 1972, biochemist Christian Anfinsen postulated that all the information needed for a protein to assume its final conformation was encoded in its amino acid sequence. However, a major hurdle is the tremendous number of potential configurations that a protein could have, estimated to be 10300 for a typical protein, which would require longer than the age of the universe to calculate by brute force. Determining the 3D structure of a protein solely from its amino acid sequence is thus known as the protein folding problem.

Most protein structures known today were found by experimental techniques such as X-ray crystallography and nuclear magnetic resonance, but these techniques can take years of painstaking trial and error, not to mention expensive specialised equipment. A method to computationally predict a protein’s 3D structure would drastically reduce the resources needed to understand their structure, giving us a better understanding of a given protein’s function and mechanism.

The search for solution

To monitor progress and evaluate the accuracy of new prediction techniques, professors John Moult and Krzysztof Fidelis founded the Critical Assessment of protein Structure Prediction (CASP) in 1994. CASP is both a global community forum and a biennial assessment used as the gold standard for structure prediction methods. The assessment uses proteins whose structures have only very recently been experimentally determined, and so are not yet published. Participants must predict the structures of these proteins, and their predictions are compared against the experimental data. Their prediction is then given a score known as the global distance test (GDT) score, up to 100 for complete accuracy.

In 2018, DeepMind Technologies, a subsidiary of Google, entered CASP13 with AlphaFold. DeepMind used machine learning to train its neural networks to predict two properties of the protein: the distance between pairs of amino acids and the angles between the bonds connecting those amino acids. These results were then aggregated and compared with known protein structures to construct a hypothetical 3D structure. This model was improved through gradient descent, a technique used in machine learning to make incremental improvements. Across the different assessed proteins, AlphaFold achieved a median GDT score of 58.9, placing first in the overall ranking.

Based on the success of the machine learning system, DeepMind continued work on AlphaFold. In 2020, DeepMind entered AlphaFold 2 at CASP14. Much like its predecessor, AlphaFold 2 placed first overall, but this time it achieved an unprecedented median GDT score of 92.4, far surpassing any other group, let alone any previous results. AlphaFold 2’s predicted structures deviated from the actual structures by an average of 1.6 Angstroms, comparable to the width of an atom. In one case, a research group studying a phage tail protein noticed that AlphaFold 2’s model greatly agreed with their experimental model except in one segment, and upon reviewing the analysis they realized they had made a mistake in their experimental interpretation and corrected it.

So how does it work?

AlphaFold 2 differs significantly from the original AlphaFold, and its system can be roughly divided into three parts.

Firstly, from the amino acid sequence data, AlphaFold 2 constructs a multiple sequence alignment. This means that it identifies similar sequences in living organisms, which determines which part of the sequence is more likely to mutate, and which part is more likely to be conserved. Sequences that have been conserved indicate that they are likely more integral to the protein structure, as mutations or alterations to these sequences would be detrimental to protein function, and thus to the organism as a whole. AlphaFold 2 also identifies proteins that have potentially similar structures to the given protein, and uses it as a template to create an initial hypothesis of the structure.

Next, the multiple sequence alignment and templates are passed through a transformer, which identifies which pieces of information are more important. This transformer, dubbed the Evoformer, is the key novelty of AlphaFold 2. The Evoformer allows the neural network to focus its attention on specific parts of the input, greatly improving the performance overall. Both the multiple sequence alignment and templates are processed together, exchanging information and improving the predicted structure. For instance, if the multiple sequence alignment suggests amino acids A and B are closely related, this information is used to modify the template. Afterwards, the modified template might suggest that, given A and B are close, amino acids C and D have a high chance of being related as well. This is then passed back to the multiple sequence alignment. The end result of this is a detailed model of the various interactions between the amino acids in the sequence.

Lastly, this model is taken and processed by the structure module, which constructs a final, three-dimensional model. Each amino acid is modelled as a triangle, reflecting the three atoms in each amino acid molecule. The final product is a long list of Cartesian coordinates that shows the position of each atom of the protein, which can be rendered into a 3D model.

So where do we go from here?

It cannot be overstated the significance of this breakthrough. AlphaFold 2’s ability to predict protein structures massively reduces the time and cost needed to determine the basic structure of certain proteins. Although they are not entirely perfect, researchers will now be able to obtain some structural information in a matter of days rather than years. This will be especially helpful for proteins that are difficult to crystalize and thus analyse experimentally, such as membrane proteins. We are already seeing its impact, as it has predicted several protein structures of the SARS-CoV-2 virus, including ORF3a, whose structure was previously unknown.

Of course, AlphaFold 2 is not without its limitations. It relies on existing amino acid databases to construct its multiple sequence alignment, and so it may be ineffective for proteins where these databases are shallow. This can be a problem for protein families with large variation, such as antibodies. However, it is possible to augment the existing model with other methods of structure prediction.

DeepMind has worked together with the European Bioinformatics Institute to launch the AlphaFold Protein Structure Database, claimed to be the most complete and accurate database for human protein structures. You can access it here, and look up almost any protein you can think of. You can also access the original AlphaFold 2 code on Github, though it requires a bit more processing power than most standard laptops can deliver.

AlphaFold 2’s effects are just starting to surface, with projects such as advancing research in lifesaving drugs with the Drugs for Neglected Diseases Initiative and engineering enzymes to recycle single-use plastics with the Centre for Enzyme Innovation. DeepMind’s work represents a breakthrough in molecular and computational biology, and we will likely see its impacts in years to come.

Daniel is a first-year undergraduate at St. Catharine’s College studying Natural Sciences (Biology).

References:

SciSoc Spotlight Issue 23 – Prof. Marian Holness

13 May 2021. Prof. Marian Holness is with the Department of Earth Sciences. A PDF version of this Issue is available here.

Research focus: Igneous petrology

My research is concentrated on understanding the processes which occur during the melting and solidification of rocks – these include the formation and segregation of crustal melts, and the evolution of the crystal mush forming at the margins of cooling magma chambers. I approach these problems by starting with detailed field observation and sample collection, with careful microstructural observations using microscopes (both optical and electron) coupled with geochemical analysis to decode rock history.

What made you decide to pursue research?

I have always been interested in pattern-finding, and understanding why things are the way they are. I decided I wanted to be a scientist when I was 14 and have stuck with it ever since. A scientist is essentially “who I am”…even if I weren’t doing research in a university, I would be puzzling things out and trying to work out why things are the way they are. I am lucky in that I have found my niche, with lots of engaging problems to work on and the opportunity to get outside and visit interesting places while doing that.

One piece of advice…

The main thing is to find out what you’re good at. It took me a while to realise that my skills lie in observation, rather than numerical descriptions. I was lucky to find interesting and important problems to work on that are fundamentally grounded in seeing what is in front of our eyes. Find your superpower and then work out how best to use it!

SciSoc Spotlight Issue 25 – Dr. Jenny Zhang

27 May 2021. Dr. Jenny Zhang was with the Yusuf Hamied Department of Chemistry, School of Physical Sciences. A PDF version of this Issue is available here.

Research focus: Photosynthesis, electrochemistry, bioenergy

We are fascinated by the chemistry occurring within photosynthesis, an important process that sustains life on Earth as we know it. In particular, we analyse how solar energy is harvested by photosynthetic machineries to move electrons around for the breaking and making of bonds. We wish to better understand the nature of the electrode movements so that we can eventually ‘re-wire’ photosynthesis using chemical approaches for bespoke purposes – such as the generation of renewable bioenergy or the creation of novel biosensors.

What made you decide to pursue research?

I loved how research challenges me to be my most curious, rational and creative self. Being at the forefront of discovery is also pretty addictive!

One piece of advice…

Don’t go for the easy wins, challenge yourself to master something difficult and turn it into your superpower.

SciSoc Spotlight Issue 22 – Dr Chiara Giorio

8 May 2021. Dr Chiara Giorio is with the Yusuf Hamied Department of Chemistry. A PDF version of this Issue is available here.

Research focus: Atmospheric Chemist

Air pollution causes 7 million deaths per year worldwide. One the most concerning air pollutant is particulate matter (small dust suspended in the air). We study the atmospheric processes that can modify the composition of particulate matter during its lifetime in the atmosphere, and we aim to understand the link between composition and toxicity. We look at the molecular mechanism by which particulate matter can cause lung inflammation or diseases such as Alzheimer.

What made you decide to pursue research?

I have always been interested in understanding the natural environment, promote good practices to preserve it, and solve environmental issues for the benefit of society in general. Academia gives you the freedom to follow your inner passion in ways that no other environment can.

One piece of advice…

For many years I thought I was not good enough for doing research; my main drivers were curiosity and passion. I am now a lecturer and I have my own research group. Over the years I have grown and I have learnt that persistence is more important than talent, and that you need to work hard, look for opportunities to grow and be open to take the chances that will come towards you. My advice is to follow your passion and never give up when you face adversity.

Myself at the botanical garden here in Cambridge

Me, in my lab during the first lockdown in 2020, collaborating to a research project that looks for methods to clean ambulances with ozone to decrease turnaround time after transporting covid-patients.

Me inside a maritime container, hosting instrumentation, during a cruise in the mediterranean sea (#peacetime cruise) in 2017 looking at how the atmosphere composition changes when dusts from the Saharan desert get transported and deposited in the sea, providing more nutrients for phytoplankton species.

SciSoc Spotlight Issue 24 – Dr. Sam Troughton

20 May 2021. Dr. Sam Troughton was with the Keronite International. A PDF version of this Issue is available here.

Research focus: Plasma Electrolytic Oxidation (PEO) Coatings

PEO coatings are produced on lightweight metals in an aqueous bath of eco-friendly chemicals under applied potentials of hundreds of volts. This generates extremely hot, but small and short-lived plasma discharges on the surface which creates a super hard protective oxide coating. Working in an industrial R&D environment means you are usually working on multiple projects simultaneously, so it is hard to summarise everything, but I’m mostly focused on optimising the process to achieve ultra wear or corrosion resistant coatings.

What made you decide to pursue research?

Having a variety of projects and being able to see the results of your research put directly into use is one of the main reasons I decided to move into industry after completing my PhD in Materials Science at Cambridge. I really enjoy having several different projects to work on, and having various time scales for them to run – some can be very short turnaround times of just a few weeks which allows you to see results put into action very rapidly, whereas others require years of careful research. Another big draw for me was being able to interact with many different companies and being able to see what ideas are being worked on at the cutting edge of space, aerospace, automotive, and manufacturing industries.

One piece of advice…

Do something you enjoy and talk to lots of people about it, both in your own department and other departments. Other people can often give you great ideas or help you solve a problem from a different perspective, and they may have suggestions of things to try that you haven’t heard about yet.

Our Sponsors

Chesterford Research Park