To start, let’s establish what a graph is. A graph represents the relations (edges) between a collection of entities (nodes).
To further describe each node, edge or the entire graph, we can store information in each of these pieces of the graph.
We can additionally specialize graphs by associating directionality to edges (directed, undirected).
A GNN is an optimizable transformation on all attributes of the graph (nodes, edges, global-context) that preserves graph symmetries (permutation invariances).
GNNs adopt a “graph-in, graph-out” architecture meaning that these model types accept a graph as input, with information loaded into its nodes, edges and global-context, and progressively transform these embeddings, without changing the connectivity of the input graph.
The design space for our GNN has many levers that can customize the model:
Other types of graphs
The Challenges of Computation on Graphs:
Lack of consistent structure, Graphs are extremely flexible mathematical models; but this means they lack consistent structure across instances. Consider the task of predicting whether a given chemical molecule is toxic, the following issues quickly become apparent:
Representing graphs in a format that can be computed over is non-trivial, and the final representation chosen often depends significantly on the actual problem.
Node-Order Equivariance, eg. The same graph labelled in two different ways.
Scalability, Graphs can be really large! Think about social networks like Facebook and Twitter, which have over a billion users. Operating on data this large is not easy. Luckily, most naturally occuring graphs are sparse
Introduce operation to GNN:
Embedding Computation, Message-passing forms the backbone of many GNN architectures today.
- imputed direction 插补方向
ESM-1b: 第一篇《Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences 》介绍了该团队基于Transformer训练的顶尖水准(state-of-the-art)蛋白质语言模型ESM-1b,能够直接通过蛋白的氨基酸序列预测该蛋白的结构、功能等性质。
ESM-MSA-1b: 第二篇《MSA Transformer》在ESM-1b的基础上作出改进,将模型的输入从单一蛋白质序列改为MSA矩阵,并在Transformer中加入行、列两种轴向注意力机制,对位点分别计算第个序列和第个对齐位置的影响,充分利用二维输入的优势。
ESM-1v:第三篇《Language models enable zero-shot prediction of the effects of mutations on protein function 》中提出了ESM-1v模型,该模型与ESM-1b模型构架相同,只是预训练数据集改为UR90(ESM-1b预训练数据集为UR50)。ESM-1v为一种通用的蛋白质语言模型,能够实现蛋白质功能的zero-shot预测,即模型只需经过预训练即可应用于各种具体问题,对于特定蛋白质预测问题(例如针对特定蛋白家族)无需额外训练即可直接解决。本文中使蛋白质语言模型具有zero-shot预测能力的机制是,采用含有海量进化信息的蛋白数据库进行预训练。当所用数据库涵盖的序列足够多、足够多样(large and diverse),那么模型就有可能从数据库中学到横跨整个进化树的序列模式,那么该模型也就很可能会在预训练阶段学习到它将要应用的家族的序列模式,迁移应用时也就无需再额外训练。
ESM-IF1: 第四篇《Learning inverse folding from millions of predicted structures》
ESM-Fold:第五篇《Language models of protein sequences at the scale of evolution enable accurate structure prediction》
Shorthand | esm.pretrained. | Dataset | Description |
ESM-1b | esm1b_t33_650M_UR50S() | UR50, 12m seqs | SOTA general-purpose protein language model. Can be used to predict structure, function and other protein properties directly from individual sequences. Released with Rives et al. 2019 (Dec 2020 update). |
ESM-MSA-1b | esm_msa1b_t12_100M_UR50S() | UR50 + MSA | MSA Transformer language model. Can be used to extract embeddings from an MSA. Enables SOTA inference of structure. Released with Rao et al. 2021 (ICML'21 version, June 2021). |
ESM-1v | esm1v_t33_650M_UR90S_1() ... esm1v_t33_650M_UR90S_5() | UR90 | Language model specialized for prediction of variant effects. Enables SOTA zero-shot prediction of the functional effects of sequence variations. Same architecture as ESM-1b, but trained on UniRef90. Released with Meier et al. 2021. |
ESM-IF1 | esm_if1_gvp4_t16_142M_UR50() | CATH + UR50 | Inverse folding model. Can be used to design sequences for given structures, or to predict functional effects of sequence variation for given structures. Enables SOTA fixed backbone sequence design. Released with Hsu et al. 2022. |
模型名称 | 输入数据类型 | 普适性 |
ESM-1b | single sequence | family-specific |
ESM-MSA-1b | MSA | few-shot |
ESM-1v | single sequence | zero-shot |
基因突变数据集的标签来自于临床观察, 一般是定性的标记int dtype(pathogenic, benign, uncertain),没有准确的score
Evaluation metric of unsupervised method:
- Given a protein sequence x, with final hidden representation , we define the embedding of the sequence tobe a vector which is the average of the hidden representations across the positions in the sequence:
- We can compare the similarity of two protein sequences, and having embeddings and using a metric in the embedding space.
- We evaluate the L2 distance and the cosine distance . Additionally we evaluated the L2 distance after projecting the e vectors to the unit sphere.
- in the protein dataset, through deep mutational scanning experiment, each amino acid sequence has a probability/score in float dtype, representing突变对蛋白质功能/活性的影响, 分数甚至可以 > 1, 表示突变有助于提高蛋白质活性
- To fine-tune the model to predict the effect of changing a single amino acid or combination of amino acids we regress the scaled mutational effect with:
- Where is the mutated amino acid at position , and is the wildtype amino acid. The sum runs over the indices of the mutated positions.
- As an evaluation metric, we report the Spearman between the model’s predictions and experimentally measured values. Each column consists of of all sequences in test dataset.
Model modification:
- 都期望通过使用large and diverse dataset,让模型预训练是学习到足够多的序列模式,即使不通过finetune,zero-shot/unsupervised方式也能实现比SOTA/traditional method更好的performance;
- zero-shot的方式,提出了four methods of scoring the effects of mutations using the model,与我们task3 unsupervised方式类似:
- 比较wild type和mutant type sequence经过pre-trained model得到的predict probability间的差距
- ESM-1v的输入是做/未做 [token mask for mutant position] 的wt/mt序列,比较的是output probability of mutant position的log-odds distance
- 我们的输入是未做 [token mask for mutant position] 的wt/mt序列,比较的是output probability of acceptor/donor position的KL divergence
Here the sum is over the mutated positions, and the sequence input to the model is masked at every mutated position.
ParseError: KaTeX parse error: Undefined control sequence: \T at position 37: …_t=x_t^{mt}|x_{\̲T̲})-log p(x_t=_t…
Predict protein sequence based on protein structure through auto-regressive training.从蛋白质骨架坐标(每个氨基酸三个原子C2N的中心坐标)中预测出它的蛋白质序列。