目录

Transformer

Transformer的出现时预训练模型发生质变的关键因素,它的前任LSTM相较而言无法捕捉更长的语义信息。

2017 Attention is All You Need

stack stack

Encoder and Decoder Stacks

Encoder

Decoder

Attention

Given WQ,WK,WVQ=XqWQK=XkWKV=XvWVAttention(Q,K,V)=softmax(QKTdkattention mask+f(padding mask))V

Why self-attention

As side benefit, self-attention could yield more interpretable models. Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences.

why-self-attn

Compare with RNN

A self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires O(n) sequential operations. In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length n is smaller than the representation dimensionality d, which is most often the case with sentence representations used by state-of-the-art models in machine translations, such as word-piece and byte-pair representations. To improve computational performance for tasks involving very long sequences, self-attention could be restricted to considering only a neighborhood of size r in the input sequence centered around the respective output position.

Compare with CNN

A single convolutional layer with kernel width k < n does not connect all pairs of input and output positions. Doing so requires a stack of O(n=k) convolutional layers in the case of contiguous kernels, or O(logk(n)) in the case of dilated convolutions, increasing the length of the longest paths between any two positions in the network. Convolutional layers are generally more expensive than recurrent layers, by a factor of k.

2019 BERT

BERT系列的模型为自编码语言模型,其通过随机mask掉一些单词,在训练过程中根据上下文对这些单词进行预测,使预测概率最大化。

Embeddings

Encoder stack

只使用transformer的encoder模块的级联,没有使用sequence mask,因此attention mask=padding mask;

sequence mask的作用是防止前面的token看见未来的token,因此当不使用sequence mask时,self-attention自然天然的能实现bidirectional;

与bert对比的是GPT1,其使用transformer的decoder模块,同时引入sequence mask和padding mask,因此GPT1是unidirectional;

这也是BERT论文中提到的:

Attention layer

Given WQ,WK,WVQ=XpWQK=XpWKV=XpWVAttention(Q,K,V)=dropout[softmax(QKTdk+f(padding mask))]Vwhere f(padding mask)=10000(1padding mask),padding mask={1,realistic token0,padding tokenwhich means padding tokens correspond to 10000,and equals 0 after softmax,otherwise keep original value for real tokens.

bert

Training

embedding、input encoding是关键,比如引入的positional embedding

BERT的本质上是通过在海量的语料的基础上运行自监督学习方法,为单词学习一个好的特征表示,所谓自监督学习是指在没有人工标注的数据上运行的监督学习;在以后特定的NLP任务中,在自监督学习获得的预训练模型基础上,微调模型即可。

因此,BERT训练分为两个阶段:

Pre-training:

Fine-tuning

bert

In contrast to denoising auto-encoders, we only predict the masked words rather than reconstructing the entire input.

Eleven downstream tasks

GPT

GPT,全称是Generative Pre-training,顾名思义,就是预训练模型。

在GPT出现之前,通用的使用预训练的方式是word2vec,即学习词语的表达。而在GPT出现之后,通用的预训练方式是预训练整个网络然后通过微调(fine-tune)去改进具体的任务。

这种模型之所以效果好是因为在每个新单词产生后,该单词就被添加在之前生成的单词序列后面,这个序列会成为模型下一步的新输入。这种机制叫做自回归(auto-regression),同时也是令 RNN 模型效果拔群的重要思想。

Embedding

Decoder stack

Our model largely follows the original transformer work [62]. We trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads).

decoder

Attention layer

Input:h0Given WiQ,WiK,WiV for layer iQi=XpWiQKi=XpWiKVi=XpWiVintermediate=softmax(Qi[Ki1,Ki]Tdksequencemask+f(sequencemask))[Vi1,Vi]Mask Attentioni(Q,K,V)=proj(intermediate)where sequencemask is a triangle matrix,1 is in the lower triangle,counting from the lower right corner;f(sequencemask)=1e10(1sequencemask)

假设Qi,Ki,ViR[seqlength,768],则Qi[Ki1,Ki]T[768,2seqlength],再与V相乘后仍然是Qi[Ki1,Ki]T[Vi1,Vi]R[768,seqlength]

proj后仍是768到768的映射,相当于dense layer,该步映射是GPT-attention相对BERT-attention更多的一层

Training

在未标注数据上的学习这一部分,需要学习一个语言模型,所谓的语言模型,就是依据前面的context,去预测下一个词。如公式所示:L1(u)=ilogP(ui|uik,...,ui1;θ).

在得到了基于Transformer的模型之后,针对具体的某个任务,假设输入和输出分别是x和y,那么我们在transformer的输出上再加一层:P(y|x1,...,xm)=softmax(hlmWy),得到新的损失函数L2=(x,y)logP(y|x1,...,xm)

最终在downstream task上finetune时把语言模型的目标函数当做辅助目标函数来增强最终的效果,总损失函数定义为L=L1+L2

只做分类任务

GPT2

GPT-2论证了什么事情呢?对于语言模型来说,不同领域的文本相当于一个独立的task,而如果把这些task组合起来学习,那么就是multi-task学习。所特殊的是这些task都是同质的,即它们的目标函数都是一样的,所以可以统一学习。那么当增大数据集后,相当于模型在更多领域上进行了学习,即模型的泛化能力有了进一步的增强。

The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText.

Language Models are Unsupervised Multitask Learners,zero-shot是指没有gradient update,但有引导

Dataset:

任务:

GPT3

Language Models are Few-Shot Learners

Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art finetuning approaches.

Dataset:

gpt3-data

For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model.

zeroshot

Reference