Transformer2017 Attention is All You NeedEncoder and Decoder StacksEncoderDecoderAttentionWhy self-attentionCompare with RNNCompare with CNN2019 BERTEmbeddingsEncoder stackAttention layerTrainingEleven downstream tasksGPTEmbeddingDecoder stackAttention layerTrainingGPT2GPT3Reference
Scaled dot-product attention
Multi-head attention: similar as group/depth separable convolution
Application of attention
encoder 中的attention:只使用padding mask,sequence mask ==1, 即attention mask=padding mask,
As side benefit, self-attention could yield more interpretable models. Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences.
A self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires O(n) sequential operations. In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length n is smaller than the representation dimensionality d, which is most often the case with sentence representations used by state-of-the-art models in machine translations, such as word-piece and byte-pair representations. To improve computational performance for tasks involving very long sequences, self-attention could be restricted to considering only a neighborhood of size r in the input sequence centered around the respective output position.
A single convolutional layer with kernel width k < n does not connect all pairs of input and output positions. Doing so requires a stack of O(n=k) convolutional layers in the case of contiguous kernels, or O(logk(n)) in the case of dilated convolutions, increasing the length of the longest paths between any two positions in the network. Convolutional layers are generally more expensive than recurrent layers, by a factor of k.
question answer
只使用transformer的encoder模块的级联,没有使用sequence mask,因此attention mask=padding mask;
sequence mask的作用是防止前面的token看见未来的token,因此当不使用sequence mask时,self-attention自然天然的能实现bidirectional;
与bert对比的是GPT1,其使用transformer的decoder模块,同时引入sequence mask和padding mask,因此GPT1是unidirectional;
embedding、input encoding是关键,比如引入的positional embedding
In contrast to denoising auto-encoders, we only predict the masked words rather than reconstructing the entire input.
GLUE: General language understanding evaluation benchmark, eight classification tasks
SQuAD v1.1: stanford question answering dataset, a collection of 100k crowdsourced question/answer pairs. Given a question and a passage from Wikipedia containing the answer, the task is to predict the answer text span in the passage.
SQuAD v2.0: The SQuAD 2.0 task extends the SQuAD 1.1 problem definition by allowing for the possibility that no short answer exists in the provided paragraph, making the problem more realistic.
SWAG: Given a sentence, the task is to choose the most plausible continuation among four choices.
GPT,全称是Generative Pre-training,顾名思义,就是预训练模型。
这种模型之所以效果好是因为在每个新单词产生后,该单词就被添加在之前生成的单词序列后面,这个序列会成为模型下一步的新输入。这种机制叫做自回归(auto-regression),同时也是令 RNN 模型效果拔群的重要思想。
Positional embedding: We used learned position embeddings instead of the sinusoidal version proposed in the original work. 1024D
Token embedding: We used learned token embeddings. 50257D
Our model largely follows the original transformer work [62]. We trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads).
,则 ,再与 相乘后仍然是
后仍是768到768的映射,相当于dense layer,该步映射是GPT-attention相对BERT-attention更多的一层
最终在downstream task上finetune时把语言模型的目标函数当做辅助目标函数来增强最终的效果,总损失函数定义为
The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText.
Language Models are Unsupervised Multitask Learners,zero-shot是指没有gradient update,但有引导
Language Models are Few-Shot Learners
Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art finetuning approaches.
For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model.