BERTology Meets Biology: Interpreting Attention in Protein Language Models (Paper Explained)

preview_player
Показать описание
Proteins are the workhorses of almost all cellular functions and a core component of life. But despite their versatility, all proteins are built as sequences of the same 20 amino acids. These sequences can be analyzed with tools from NLP. This paper investigates the attention mechanism of a BERT model that has been trained on protein sequence data and discovers that the language model has implicitly learned non-trivial higher-order biological properties of proteins.

OUTLINE:
0:00 - Intro & Overview
1:40 - From DNA to Proteins
5:20 - BERT for Amino Acid Sequences
8:50 - The Structure of Proteins
12:40 - Investigating Biological Properties by Inspecting BERT
17:45 - Amino Acid Substitution
24:55 - Contact Maps
30:15 - Binding Sites
33:45 - Linear Probes
35:25 - Conclusion & Comments

Abstract:
Transformer architectures have proven to learn useful representations for protein classification and generation tasks. However, these representations present challenges in interpretability. Through the lens of attention, we analyze the inner workings of the Transformer and explore how the model discerns structural and functional properties of proteins. We show that attention (1) captures the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but spatially close in the three-dimensional structure, (2) targets binding sites, a key functional component of proteins, and (3) focuses on progressively more complex biophysical properties with increasing layer depth. We also present a three-dimensional visualization of the interaction between attention and protein structure. Our findings align with known biological processes and provide a tool to aid discovery in protein engineering and synthetic biology. The code for visualization and analysis is available at this https URL.

Authors: Jesse Vig, Ali Madani, Lav R. Varshney, Caiming Xiong, Richard Socher, Nazneen Fatema Rajani

Links:
Рекомендации по теме
Комментарии
Автор

Well done, Yannic!

Overall, the whole video is very descriptive; however, I want to mention that the 3D conformation of proteins is NOT determined by molecular simulations but with physical experimental methods (e.g., x-ray crystallography and cryo-EM). These physical methods are handicapped because either you cannot use x-ray crystallography at all for a specific protein or it just too expensive like cryo-EM. As of right now, the number of protein sequences versus physical structures has exploded thanks to sequence technology, therefore, there still remains a plethora of protein sequences without corresponding physical structures. A huge endeavor in the science community is to predict structure with only the protein sequence, considering huge datasets (e.g. Protein databank, Uniprot, etc.) and powerful models like BERT have emerged.

niksapraljak
Автор

Great stuff as always! I'm curious how much time it takes you every day to produce this much high quality content

julianke
Автор

That is funny! I just started using transformers to address a "similar" problem using proteins. I think one of the reasons the model can't predict the bindings sites are the pos-translational modifications. This process happens when other proteins add modified amino acid versions or sugars at proteins structure. These modifications can change totally the protein folding and affect the binding site positions.

bioinfolucas
Автор

Hey, I can't find the figure 2 you are showing in the paper itself. Do you have a different resource?

hagardolev
Автор

thanks, this explanation saved my life understanding this paper!!! could it be an interesting thing to do another video with the latest 2021 version of the paper?

chiararodella
Автор

How can I encode epitope sequences for a binary classification task. I tried prottrans embedding but the accuracy is quite low

inamulhaq
Автор

How are the proteins encoded so that they can be consumed by the neural network? Is there a Word2Vec/Glove for proteins?
.
Are all proteins linear? How do we encode non-linear proteins?
.
Good paper. The next step would be to gradient descent backwards through the learnt model to generate proteins which meet some criteria.

herp_derpingson
Автор

but aren't they trying to predict the contact maps? the eq. with the attention they seem to be adding this f(i, j) as an input, is that part of the training data? 25:45 what is this alignment they speak of?

samanthaqiu
Автор

I guess there will be quite a lot similar papers coming soon (for example, chromosome close interactions, RNA-DNA interactions, ORF identifications, CRISPR gRNA design/evaluations...

ec
Автор

To bad that you are not doing more RL, your videos are so good

christianleininger
Автор

You should be a professor in every university

allessandroable
Автор

Please upload for ViLBert, VisualBert and VisualBert COCO

KoliHarshad
Автор

With same naivety language model can predict any written programm output - this can't be done

dmitrysamoylenko
Автор

Almost nailed the pronunciations. The Chinese name is pronounced tsai-ming shjong, pretty close.

minhuang
Автор

An unrelated question that keeps bugging me: why do you mention your "Attention is all you need" video in every video you produce? Are you trying to push the number of views there to a maximum, or am I just seeing patterns where there are none? In any case, your videos are as great as always, it just keeps tripping me up in every one of them :D

alexanderchebykin
Автор

This has no applications, the tridimensional and post-translational modification are the state-of-the-art of protein research. Sequence it's not insteresting anymore.

hiramcoriarodriguez