BERTology Meets Biology: Interpreting Attention in Protein Language Models (Paper Explained)

Показать описание

Proteins are the workhorses of almost all cellular functions and a core component of life. But despite their versatility, all proteins are built as sequences of the same 20 amino acids. These sequences can be analyzed with tools from NLP. This paper investigates the attention mechanism of a BERT model that has been trained on protein sequence data and discovers that the language model has implicitly learned non-trivial higher-order biological properties of proteins.

OUTLINE:
0:00 - Intro & Overview
1:40 - From DNA to Proteins
5:20 - BERT for Amino Acid Sequences
8:50 - The Structure of Proteins
12:40 - Investigating Biological Properties by Inspecting BERT
17:45 - Amino Acid Substitution
24:55 - Contact Maps
30:15 - Binding Sites
33:45 - Linear Probes
35:25 - Conclusion & Comments

Abstract:
Transformer architectures have proven to learn useful representations for protein classification and generation tasks. However, these representations present challenges in interpretability. Through the lens of attention, we analyze the inner workings of the Transformer and explore how the model discerns structural and functional properties of proteins. We show that attention (1) captures the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but spatially close in the three-dimensional structure, (2) targets binding sites, a key functional component of proteins, and (3) focuses on progressively more complex biophysical properties with increasing layer depth. We also present a three-dimensional visualization of the interaction between attention and protein structure. Our findings align with known biological processes and provide a tool to aid discovery in protein engineering and synthetic biology. The code for visualization and analysis is available at this https URL.

Authors: Jesse Vig, Ali Madani, Lav R. Varshney, Caiming Xiong, Richard Socher, Nazneen Fatema Rajani

Links:

Рекомендации по теме

Комментарии

Well done, Yannic!

Overall, the whole video is very descriptive; however, I want to mention that the 3D conformation of proteins is NOT determined by molecular simulations but with physical experimental methods (e.g., x-ray crystallography and cryo-EM). These physical methods are handicapped because either you cannot use x-ray crystallography at all for a specific protein or it just too expensive like cryo-EM. As of right now, the number of protein sequences versus physical structures has exploded thanks to sequence technology, therefore, there still remains a plethora of protein sequences without corresponding physical structures. A huge endeavor in the science community is to predict structure with only the protein sequence, considering huge datasets (e.g. Protein databank, Uniprot, etc.) and powerful models like BERT have emerged.

niksapraljak

Great stuff as always! I'm curious how much time it takes you every day to produce this much high quality content

julianke

That is funny! I just started using transformers to address a "similar" problem using proteins. I think one of the reasons the model can't predict the bindings sites are the pos-translational modifications. This process happens when other proteins add modified amino acid versions or sugars at proteins structure. These modifications can change totally the protein folding and affect the binding site positions.

bioinfolucas

Hey, I can't find the figure 2 you are showing in the paper itself. Do you have a different resource?

hagardolev

thanks, this explanation saved my life understanding this paper!!! could it be an interesting thing to do another video with the latest 2021 version of the paper?

chiararodella

How can I encode epitope sequences for a binary classification task. I tried prottrans embedding but the accuracy is quite low

inamulhaq

How are the proteins encoded so that they can be consumed by the neural network? Is there a Word2Vec/Glove for proteins?
.
Are all proteins linear? How do we encode non-linear proteins?
.
Good paper. The next step would be to gradient descent backwards through the learnt model to generate proteins which meet some criteria.

herp_derpingson

but aren't they trying to predict the contact maps? the eq. with the attention they seem to be adding this f(i, j) as an input, is that part of the training data? 25:45 what is this alignment they speak of?

samanthaqiu

I guess there will be quite a lot similar papers coming soon (for example, chromosome close interactions, RNA-DNA interactions, ORF identifications, CRISPR gRNA design/evaluations...

ec

To bad that you are not doing more RL, your videos are so good

christianleininger

You should be a professor in every university

allessandroable

Please upload for ViLBert, VisualBert and VisualBert COCO

KoliHarshad

With same naivety language model can predict any written programm output - this can't be done

dmitrysamoylenko

Almost nailed the pronunciations. The Chinese name is pronounced tsai-ming shjong, pretty close.

minhuang

An unrelated question that keeps bugging me: why do you mention your "Attention is all you need" video in every video you produce? Are you trying to push the number of views there to a maximum, or am I just seeing patterns where there are none? In any case, your videos are as great as always, it just keeps tripping me up in every one of them :D

alexanderchebykin

This has no applications, the tridimensional and post-translational modification are the state-of-the-art of protein research. Sequence it's not insteresting anymore.

hiramcoriarodriguez

BERTology Meets Biology: Interpreting Attention in Protein Language Models (Paper Explained)

BERTology Meets Biology: Interpreting Attention in Protein Language Models (Paper Explained)

BERTology Meets Biology: Interpreting Attention in Protein Language Models | AISC

BERTology meets Biology | Solving biological problems with Transformers

PR-259: BERTology meets Biology: Interpreting attention in protein language modeling

A Primer in BERTology: What We Know About How BERT Works | L3-AI 2021

CS886 BERTology by MojtabaValipour

ProteinBERT: A universal protein language & function... - Dan Ofer - MLCSB - Poster - ISMB/ECCB ...

'🔍 Unveiling the Power of Attention in NLP! 🧠 Visualizing Language Understanding with BERTViz...

Proteins as language: NLP, Machine Learning & Protein... - Dan Ofer - MLCSB - Poster - ISMB 2022

Understanding BERT: BERTology and Making BERT Smaller

Transformer-based Protein Function Annotation with... - Yue Cao - Function - Talk - ISMB/ECCB 2021

ProteinBERT: a universal deep-learning model of... - Dan Ofer - Function - Abstract - ISMB 2022

Graph embeddings for protein structural... - Daniel Berenberg - 3DSIG - Talk - ISMB/ECCB 2021

Using Protein Language Models For Drug Discovery

Language Models Generate New Proteins! Here's How.

Alphafold2 for biology n00bs

V Stebliankin. PIsToN: Evaluating protein binding interfaces with transformer networks

【Bioinformatics | Papers of the week | EP2】ProGen: An AI-driven model for large-scale protein design...

VM2 Protein-Ligand Affinities: Mining Minima Theory, Implementation, Benchmarks, and Directions

Predicting Protein Location using Protein Language Models | Paper Summary

Karel Berka and Marian Novotný - Alphafoldology - Machine Learning revolution in structural biology

Daily AI Show: Ankh- Optimized Protein Language Model | #ai #artificialintelligence #datascience

Transformers: The science, not the fiction

H. Nielsen DeepLoc2.0: multi-label subcellular localization prediction using protein language models