Reducing the high cost of training NLP models with SRU++
07 Aug 2022
집현전 자연어처리 논문 스터디 최신반에서 제가 발표한 논문 리뷰입니다. When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute (EMNLP 2021 Outstanding paper,Tao Lei) 이라는 논문을 발표했습니다. 해당 논문에 대해 저자의 블로그 글의 제목이 포스트의 제목입니다. 발표자료를 영어로 준비하게 되서, 포스트가 영어인 점 양해바랍니다.
TL;DR, Contribution
- In this work, we validate this idea and present a self-attentive recurrent unit that achieves strong computational efficiency.
- Our work builds upon the SRU (Lei et al, 2018), a parallelizable RNN.
- Incorporating Attention in to SRU by simply replacing the linear transformation of input with self-attentive component.
- The Proposed architecture, called SRU++.
- SRU++ exhibits strong modeling capacity and training efficiency.
- Bits-per-character on ENWIK8 dev set vs. GPU hours used for training. SRU++ obtains better BPC by using 1/8 of the resources.
Cost of training Large language model
The size of recent models have increased enormously, growing to millions (or even billions) of parameters, along with a significant increase in the financial cost.
Recurrence is lack of parallelizability
- Forward and backward passes have O(sequence length) unparallelizable operations.
- GPUs can perform a bunch of independent computation at once!
- But future RNN hidden states can’t be computed in full before past RNN hidden states have been computed. → RNN is dependent on state (timestep)
- Inhibits training on very large dataset!
If not recurrence, then what? How about attention?
- Attention treats each word’s representation as a query to access and incorporate info from a set of values.
- Maximum interaction distance: O(1), since all words interact at every layer! → attention is independent on state(time step)
Transformer as a efficient unit for Language Model
- Transformer is built entirely upon on self-attention and avoids the use of recurrence.
- Transformer architecture was proposed to accelerate model training (by parallelization) and become the predominant architecture in NLP.
Motivation : Author’s question
-
Revisiting the Architectural question : Is Attention all we need for modeling?
-
If recurrence is not a compute bottleneck, can we find better architecture?
Motivation : Answer
- Previous works have tackled the parallelization/speed problem of RNNs and proposed various fast recurrent network.
- QRNN(Quasi-RNN)
- SRU(Simple Recurrent Units for Highly Parallelizable Recurrence)
→ The advance eliminates the need of avoiding recurrences to trade training efficiency.
- Several recent works have achieved strong results by leveraging recurrence in conjunction with self-attention.
- SHA-LSTM (Single Headed Attention RNN: Stop Thinking With Your Head)
- TRANS-BLSTM
→ These results suggest that recurrence and attention are complementary at sequence modeling.
Background : SRU
- Computation resembles other recurrent network (e.g, LSTM,GRU)
- x[t] : input vector, c[t] : hidden state vector, f[t] : forget gate, r[t] :reset gate, h[t]: output state vector
- Complete architecture decompose to two sub components.
- Light recurrence is controlled by f[t] forget gate.
- Highway network (skip connection) is controlled by r[t] reset gate.
- Initialization : 0 mean and 1/d variance, via the uniform distribution $\sqrt{(− 3/d)}, \sqrt{(+ 3/d)}$
Background: How SRU parallelizable?
Two important code-level optimizations are performed to enhance the parallelism and speed of SRU.
- Given the input sequence, SRU combines the three matrix multiplications across all time steps as a single multiplication
- The second optimization performs all element-wise operations in an efficient way.(next slide)
The second optimization performs all element-wise operations in an efficient way.
- SRU implements all these element-wise product operations as a single CUDA kernel to accelerate computation.
- Note that each dimension of the hidden vectors is independent once U is computed. (state is dependent)
- The computation can run in parallel across each hidden dimension
- this is pseudo code for parallelization in SRU.
(pseudo code를 보시면, Length에 대해 for문을 돌면서 dimension에 대해 한꺼번에 연산을 합니다.)
- (similar to batch normalization) they use parallelization in dimension.
- (Normalization has nothing to do with parallelization)
(Batch Normalization과 관련은 없지만, dimension에 대해 병렬 연산을 수행한다는 것을 표현하고자 표현해 보았습니다.)
Model architecture: SRU++
- The self-attention block used in the transformer was used.
- Single-head self-attention (No multi-headed)
- Residual connection
- Layer normalization
- No Positional encoding. (RNN 구조를 쓰기 때문에, 위치에 대한 정보를 추가하지 않아도 됩니다.)
Experimental setup
- Datasets
- ENWIK8 : character-level language modeling dataset, 100M tokens, from Wikipedia
- WIKI-103 : word-level language modeling dataset, 100M tokens, from Wikipedia.
- BILLION WORD : largest language modeling dataset 768M token. Sentences are randomly shuffled
- IWSLT’14 (De→En) : low-resource Machine translation dataset , 170K translation pairs.
- Models
- Single-head attention, 10 SRU++ layers.
- d(model dimension):d’(attention dimension) = 4:1 (e.g, 2048:512)
- Optimization
- RAdam, default beta.
- weight decay: 0.1, initial learning rate: 3e-4.
Results
- Does recurrence improve upon attention-only model?
- How much attention is needed?
- Where to use attention?
- Does the ratio d:d’ matter? (d:input dimension, d’: attention dimension)
- Results for ENWIK8, WIKI-103, BILLION WORD, IWSLT’14(De→En)
- Inference speed
- Why does SRU++ reduce training cost in our experiments
Result : Does recurrence improve upon attention-only model
- We evaluate SRU++ on several language modeling benchmarks such as Enwik8 dataset.
- Compared to Transformer models such as Transformer-XL (41M, 12 transformer layers), SRU++(42M, 10 SRU++ layers) can achieve similar results using only a fraction of the resources.
- Transformer-XL: 4 Nvidia 2080Ti GPUs, 4 days(360hours)
- SRU++ : 2 Nvidia 2080Ti GPUs (less GPU memory usage).
Result : How much attention is needed?
(10을 k로 나눈 값만큼 attention layer를 사용한다고 이해하면 쉽습니다.)
- k=1 : SRU++ model with attention in every layer.
- k=10 : use only in last layer.
- We see that using 50% less attention (k = 2) achieves almost no increase in test BPC.
- Using only a single attention module (k = 10) leads to a marginal loss of 0.01 BPC but reduces the training time by 40%.
Result : Where to use attention?
(첫번째 그림은 10개의 layer중에 한곳에만 self-attetion을 사용했을 때의 성능이고, 두번째 그림은 10번째 layer는 고정한뒤에, 한개의 layer의 위치에 따른 성능입니다.)
- Applying attention in the first bottom layer achieves significantly worse result. → lack of positional information for attention
- Moreover, SRU++ consistently achieves worse results by moving the attention to lower layer closer to the input embedding. In contrast, results are comparable once the attention is placed in a high- enough layer.
- These observations suggest that the model should first learn local features before attention plays a most effective role at capturing long- range dependencies.
Result : Does the ratio d:d’ matter?
(d’은 attention layer의 dimension을 의미합니다.)
- A small value of d′ can reduce the amount of computation and the number of parameters used in attention layers but may limit the modeling capacity.
- Changing this ratio from 4 to a higher value gives better result. The best dev result is obtained with a ratio of 8.
Result : ENWIK8
- 해당 데이터셋에서도 더 적은 학습시간으로 더 좋은 성능을 보였습니다.
Result : WIKI-103
- 해당 데이터셋에서도 더 적은 학습시간으로 더 좋은 성능을 보였습니다.
Result : BILLION WORD
- To increase model capacity, They changed dimension of model.
- Also they increased iteration and change learning rate.
- Transformer model use 32 V100 GPU / 64 V100 GPU.
- SRU++ used 8 GPUs
Result : IWSLT’14(De→En)
- The base model is an 8-layer Transformer model containing 20M parameters.
Result : Inference speed
- We use a single V100 GPU for inference.
- Our large model runs at least 4.5x faster than all baseline models except Shortformer (Press et al., 2021).
- In addition, our model achieves 0.9-1.1 perplexity lower than Shortformer and runs 50% faster when using 2 attention layers (k = 5).
Result : Why does SRU++ reduce training cost in our experiments
- First, combining attention and recurrence gives stronger modeling capacity
- We also observe higher training efficiency, requiring fewer training steps and smaller training batch compared to several Transformer models.
- Finally, model implementation is an important factor for computation saving. Our implementation is highly efficient for two reasons.
- First, the fast recurrence operation of SRU is a reusable module that is already optimized for speed
- Second, since recurrence encodes positional information, we can use simple single- head attention and remove positional encoding
- Advanced attention and positional encoding mechanism can generate non- trivial computation overhead.
Result : Why does SRU++ reduce training cost in our experiments
- Figure 5 (a) shows the average model forward time of a single batch. SRU++ runs 4-5x times faster compared to the Transformer-XL implementation.
- Figure 5 (b) breaks down the computation and highlights the most time-consuming operations in both models.
Conclusion
- We present a recurrent architecture with optional built-in self-attention that achieves leading model capacity and training efficiency.
- We demonstrate that highly expressive and efficient models can be derived using a combination of attention and fast recurrence.
- Our results reaffirm the empirical observations that attention is not all we need, and can be complemented by other sequential modeling modules.