目录
Transformer2017 Attention is All You NeedEncoder and Decoder StacksEncoderDecoderAttentionWhy self-attentionCompare with RNNCompare with CNN2019 BERTEmbeddingsEncoder stackAttention layerTrainingEleven downstream tasksGPTEmbeddingDecoder stackAttention layerTrainingGPT2GPT3Reference
Transformer的出现时预训练模型发生质变的关键因素,它的前任LSTM相较而言无法捕捉更长的语义信息。
Scaled dot-product attention
Multi-head attention: similar as group/depth separable convolution
Embeddings
Application of attention
encoder 中的attention:只使用padding mask,sequence mask ==1, 即attention mask=padding mask,
decoder中:
As side benefit, self-attention could yield more interpretable models. Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences.
A self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires O(n) sequential operations. In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length n is smaller than the representation dimensionality d, which is most often the case with sentence representations used by state-of-the-art models in machine translations, such as word-piece and byte-pair representations. To improve computational performance for tasks involving very long sequences, self-attention could be restricted to considering only a neighborhood of size r in the input sequence centered around the respective output position.
A single convolutional layer with kernel width k < n does not connect all pairs of input and output positions. Doing so requires a stack of O(n=k) convolutional layers in the case of contiguous kernels, or O(logk(n)) in the case of dilated convolutions, increasing the length of the longest paths between any two positions in the network. Convolutional layers are generally more expensive than recurrent layers, by a factor of k.
BERT系列的模型为自编码语言模型,其通过随机mask掉一些单词,在训练过程中根据上下文对这些单词进行预测,使预测概率最大化。
classification,
Input
output
question answer
只使用transformer的encoder模块的级联,没有使用sequence mask,因此attention mask=padding mask;
sequence mask的作用是防止前面的token看见未来的token,因此当不使用sequence mask时,self-attention自然天然的能实现bidirectional;
与bert对比的是GPT1,其使用transformer的decoder模块,同时引入sequence mask和padding mask,因此GPT1是unidirectional;
这也是BERT论文中提到的:
embedding、input encoding是关键,比如引入的positional embedding
BERT的本质上是通过在海量的语料的基础上运行自监督学习方法,为单词学习一个好的特征表示,所谓自监督学习是指在没有人工标注的数据上运行的监督学习;在以后特定的NLP任务中,在自监督学习获得的预训练模型基础上,微调模型即可。
因此,BERT训练分为两个阶段:
Pre-training:
Fine-tuning
In contrast to denoising auto-encoders, we only predict the masked words rather than reconstructing the entire input.
GLUE: General language understanding evaluation benchmark, eight classification tasks
SQuAD v1.1: stanford question answering dataset, a collection of 100k crowdsourced question/answer pairs. Given a question and a passage from Wikipedia containing the answer, the task is to predict the answer text span in the passage.
SQuAD v2.0: The SQuAD 2.0 task extends the SQuAD 1.1 problem definition by allowing for the possibility that no short answer exists in the provided paragraph, making the problem more realistic.
SWAG: Given a sentence, the task is to choose the most plausible continuation among four choices.
GPT,全称是Generative Pre-training,顾名思义,就是预训练模型。
在GPT出现之前,通用的使用预训练的方式是word2vec,即学习词语的表达。而在GPT出现之后,通用的预训练方式是预训练整个网络然后通过微调(fine-tune)去改进具体的任务。
这种模型之所以效果好是因为在每个新单词产生后,该单词就被添加在之前生成的单词序列后面,这个序列会成为模型下一步的新输入。这种机制叫做自回归(auto-regression),同时也是令 RNN 模型效果拔群的重要思想。
Positional embedding: We used learned position embeddings instead of the sinusoidal version proposed in the original work. 1024D
Token embedding: We used learned token embeddings. 50257D
输入和输出共享
Our model largely follows the original transformer work [62]. We trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads).
假设
,则 ,再与 相乘后仍然是
后仍是768到768的映射,相当于dense layer,该步映射是GPT-attention相对BERT-attention更多的一层
在未标注数据上的学习这一部分,需要学习一个语言模型,所谓的语言模型,就是依据前面的context,去预测下一个词。如公式所示:
在得到了基于Transformer的模型之后,针对具体的某个任务,假设输入和输出分别是x和y,那么我们在transformer的输出上再加一层:
最终在downstream task上finetune时把语言模型的目标函数当做辅助目标函数来增强最终的效果,总损失函数定义为
只做分类任务
GPT-2论证了什么事情呢?对于语言模型来说,不同领域的文本相当于一个独立的task,而如果把这些task组合起来学习,那么就是multi-task学习。所特殊的是这些task都是同质的,即它们的目标函数都是一样的,所以可以统一学习。那么当增大数据集后,相当于模型在更多领域上进行了学习,即模型的泛化能力有了进一步的增强。
The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText.
Language Models are Unsupervised Multitask Learners,zero-shot是指没有gradient update,但有引导
Dataset:
任务:
Language Models are Few-Shot Learners
Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art finetuning approaches.
Dataset:
For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model.