Compute complexity of self-attention1. Compute complexity of matrix multiplication1.1 Schoolbook algorithm1.2 Strassen's algorithm2. Compute complexity in self-attention2.1 Definition of self-attention2.2 Complexity calculations3. Personal view

Compute complexity of self-attention

1. Compute complexity of matrix multiplication

If A, B are n × n matrices over a field, then their product AB is also an n × n matrix over that field, defined entrywise as

(A B)_{i j} = \sum_{k = 1}^{n} A_{i k} B_{k j}

1.1 Schoolbook algorithm

The simplest approach to computing the product of two n × n matrices A and B is to compute the arithmetic expressions coming from the definition of matrix multiplication. In pseudocode:


1
input A and B, both n by n matrices
2
initialize C to be an n by n matrix of all zeros
3
for i from 1 to n:
4
    for j from 1 to n:
5
        for k from 1 to n:
6
            C[i][j] = C[i][j] + A[i][k]*B[k][j]
7
output C (as A*B)

$n^3$ $n^3-n^2$ $\Omicron(n^3)$

Surprisingly, algorithms exist that provide better running times than this straightforward "schoolbook algorithm". The first to be discovered was Strassen's algorithm, devised by Volker Strassen in 1969 and often referred to as "fast matrix multiplication".[1] The optimal number of field operations needed to multiply two square n × n matrices up to constant factors is still unknown. This is a major open question in theoretical computer science.

$\Omicron(n^{2.3728596})$ time, given by Josh Alman and Virginia Vassilevska Williams.

1.2 Strassen's algorithm

$\Omicron(n^{log_27})\approx \Omicron(n^{2.807})$ field operations.

Unlike algorithms with faster asymptotic complexity, Strassen's algorithm is used in practice. The numerical stability is reduced compared to the naive algorithm, but it is faster in cases where n > 100 or so and appears in several libraries, such as BLAS. It is very useful for large matrices over exact domains such as finite fields, where numerical stability is not an issue.

2. Compute complexity in self-attention

2.1 Definition of self-attention

In this blog, self-attention layer consists of a point-wise feed-forward computation and self-attention function. The compute complexity of transformer claimed in Transformer Paper contains only self-attention function.

selfattention

complexity

2.2 Complexity calculations

$X$ $(n,d)$ $n, k$ represent number of tokens and dimension of each token, respectively.

$Q,K,V$ $Q^{\Bbb R(n,d)}=X^{\Bbb R(n,d)}W_{Q}^{\Bbb R(d,d)}$ $\Omicron(nd^2)$

$Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt d_k})V$ $QK^T$ $\Omicron(n^2d)$ $V$ $\Omicron(n^2d)$ $\Omicron(n^2d)$

$\Omicron(nd^2+n^2d)$ . BUT!!, as mentioned above, the claimed compute complexity in paper is only the self-attention function, while point-wise feed-forward complexity is not included.

So in Table 1 is strictly the attention mechanism, it is not the complexity of the Transformer. Authors are very well aware about the complexity of their model (I quote):

Separable convolutions [6], however, decrease the complexity considerably, to O(k·n·d + n·d^2). Even with k = n, however, the complexity of a separable convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer, the approach we take in our model.

3. Personal view

Quote:

Now, to understand what Table 1 contains please keep in mind how most people scan papers: they read title, abstract, then look at figures and tables. Only then if the results were interesting, they read the paper more thoroughly. So, the main idea of the Attention is all you need paper was to replace the RNN layers completely with attention mechanism in seq2seq setting because RNNs were really slow to train. If you look at the Table 1 in this context, you see that it compares RNN, CNN and Attention and highlights the motivation for the paper: using Attention should have been beneficial over RNNs and CNNs. It should have been advantageous in 3 aspects: constant amount of calculation steps, constant amount of operations and lower computational complexity for usual Google setting, where n ~= 100 and d ~= 1000.