ChangHyeon Nam's Blog notes and thoughts

Reducing the high cost of training NLP models with SRU++

Comments

집현전 자연어처리 논문 스터디 최신반에서 제가 발표한 논문 리뷰입니다. When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute (EMNLP 2021 Outstanding paper,Tao Lei) 이라는 논문을 발표했습니다. 해당 논문에 대해 저자의 블로그 글의 제목이 포스트의 제목입니다. 발표자료를 영어로 준비하게 되서, 포스트가 영어인 점 양해바랍니다.


TL;DR, Contribution

Screen Shot 2022-08-07 at 11.22.44 PM

  • In this work, we validate this idea and present a self-attentive recurrent unit that achieves strong computational efficiency.
  • Our work builds upon the SRU (Lei et al, 2018), a parallelizable RNN.
  • Incorporating Attention in to SRU by simply replacing the linear transformation of input with self-attentive component.
  • The Proposed architecture, called SRU++.
  • SRU++ exhibits strong modeling capacity and training efficiency.

Screen Shot 2022-08-07 at 11.23.11 PM

  • Bits-per-character on ENWIK8 dev set vs. GPU hours used for training. SRU++ obtains better BPC by using 1/8 of the resources.

Cost of training Large language model

Screen Shot 2022-08-07 at 11.20.49 PM The size of recent models have increased enormously, growing to millions (or even billions) of parameters, along with a significant increase in the financial cost.


Recurrence is lack of parallelizability

Screen Shot 2022-08-07 at 11.21.17 PM

  • Forward and backward passes have O(sequence length) unparallelizable operations.
  • GPUs can perform a bunch of independent computation at once!
  • But future RNN hidden states can’t be computed in full before past RNN hidden states have been computed. → RNN is dependent on state (timestep)
  • Inhibits training on very large dataset!

If not recurrence, then what? How about attention?

Screen Shot 2022-08-07 at 11.22.04 PM

  • Attention treats each word’s representation as a query to access and incorporate info from a set of values.
  • Maximum interaction distance: O(1), since all words interact at every layer! → attention is independent on state(time step)

Transformer as a efficient unit for Language Model

Screen Shot 2022-08-07 at 11.21.47 PM

  • Transformer is built entirely upon on self-attention and avoids the use of recurrence.
  • Transformer architecture was proposed to accelerate model training (by parallelization) and become the predominant architecture in NLP.

Motivation : Author’s question

  1. Revisiting the Architectural question : Is Attention all we need for modeling?

  2. If recurrence is not a compute bottleneck, can we find better architecture?


Motivation : Answer

  1. Previous works have tackled the parallelization/speed problem of RNNs and proposed various fast recurrent network.
    • QRNN(Quasi-RNN)
    • SRU(Simple Recurrent Units for Highly Parallelizable Recurrence)

    → The advance eliminates the need of avoiding recurrences to trade training efficiency.

  2. Several recent works have achieved strong results by leveraging recurrence in conjunction with self-attention.
    • SHA-LSTM (Single Headed Attention RNN: Stop Thinking With Your Head)
    • TRANS-BLSTM

    → These results suggest that recurrence and attention are complementary at sequence modeling.


Background : SRU

Screen Shot 2022-08-07 at 11.23.31 PM

  • Computation resembles other recurrent network (e.g, LSTM,GRU)
  • x[t] : input vector, c[t] : hidden state vector, f[t] : forget gate, r[t] :reset gate, h[t]: output state vector
  • Complete architecture decompose to two sub components.
    1. Light recurrence is controlled by f[t] forget gate.
    2. Highway network (skip connection) is controlled by r[t] reset gate.
  • Initialization : 0 mean and 1/d variance, via the uniform distribution $\sqrt{(− 3/d)}, \sqrt{(+ 3/d)}$

Background: How SRU parallelizable?

Screen Shot 2022-08-07 at 11.36.47 PM Two important code-level optimizations are performed to enhance the parallelism and speed of SRU.

  1. Given the input sequence, SRU combines the three matrix multiplications across all time steps as a single multiplication
  2. The second optimization performs all element-wise operations in an efficient way.(next slide)

Screen Shot 2022-08-07 at 11.24.08 PM The second optimization performs all element-wise operations in an efficient way.

  • SRU implements all these element-wise product operations as a single CUDA kernel to accelerate computation.
  • Note that each dimension of the hidden vectors is independent once U is computed. (state is dependent)
  • The computation can run in parallel across each hidden dimension

Screen Shot 2022-08-07 at 11.24.21 PM

  • this is pseudo code for parallelization in SRU.

(pseudo code를 보시면, Length에 대해 for문을 돌면서 dimension에 대해 한꺼번에 연산을 합니다.)

Screen Shot 2022-08-07 at 11.24.32 PM

  • (similar to batch normalization) they use parallelization in dimension.
  • (Normalization has nothing to do with parallelization)

(Batch Normalization과 관련은 없지만, dimension에 대해 병렬 연산을 수행한다는 것을 표현하고자 표현해 보았습니다.)


Model architecture: SRU++

Screen Shot 2022-08-07 at 11.24.45 PM

  • The self-attention block used in the transformer was used.
    • Single-head self-attention (No multi-headed)
    • Residual connection
    • Layer normalization
    • No Positional encoding. (RNN 구조를 쓰기 때문에, 위치에 대한 정보를 추가하지 않아도 됩니다.)

Experimental setup

  • Datasets
    • ENWIK8 : character-level language modeling dataset, 100M tokens, from Wikipedia
    • WIKI-103 : word-level language modeling dataset, 100M tokens, from Wikipedia.
    • BILLION WORD : largest language modeling dataset 768M token. Sentences are randomly shuffled
    • IWSLT’14 (De→En) : low-resource Machine translation dataset , 170K translation pairs.
  • Models
    • Single-head attention, 10 SRU++ layers.
    • d(model dimension):d’(attention dimension) = 4:1 (e.g, 2048:512)
  • Optimization
    • RAdam, default beta.
    • weight decay: 0.1, initial learning rate: 3e-4.

Results

  1. Does recurrence improve upon attention-only model?
  2. How much attention is needed?
  3. Where to use attention?
  4. Does the ratio d:d’ matter? (d:input dimension, d’: attention dimension)
  5. Results for ENWIK8, WIKI-103, BILLION WORD, IWSLT’14(De→En)
  6. Inference speed
  7. Why does SRU++ reduce training cost in our experiments

Result : Does recurrence improve upon attention-only model

Screen Shot 2022-08-07 at 11.42.29 PM

  • We evaluate SRU++ on several language modeling benchmarks such as Enwik8 dataset.
  • Compared to Transformer models such as Transformer-XL (41M, 12 transformer layers), SRU++(42M, 10 SRU++ layers) can achieve similar results using only a fraction of the resources.
  • Transformer-XL: 4 Nvidia 2080Ti GPUs, 4 days(360hours)
  • SRU++ : 2 Nvidia 2080Ti GPUs (less GPU memory usage).

Result : How much attention is needed?

Screen Shot 2022-08-07 at 11.40.58 PM (10을 k로 나눈 값만큼 attention layer를 사용한다고 이해하면 쉽습니다.)

  • k=1 : SRU++ model with attention in every layer.
  • k=10 : use only in last layer.
  • We see that using 50% less attention (k = 2) achieves almost no increase in test BPC.
  • Using only a single attention module (k = 10) leads to a marginal loss of 0.01 BPC but reduces the training time by 40%.

Result : Where to use attention?

Screen Shot 2022-08-07 at 11.41.10 PM (첫번째 그림은 10개의 layer중에 한곳에만 self-attetion을 사용했을 때의 성능이고, 두번째 그림은 10번째 layer는 고정한뒤에, 한개의 layer의 위치에 따른 성능입니다.)

  • Applying attention in the first bottom layer achieves significantly worse result. → lack of positional information for attention
  • Moreover, SRU++ consistently achieves worse results by moving the attention to lower layer closer to the input embedding. In contrast, results are comparable once the attention is placed in a high- enough layer.
  • These observations suggest that the model should first learn local features before attention plays a most effective role at capturing long- range dependencies.

Result : Does the ratio d:d’ matter?

Screen Shot 2022-08-07 at 11.42.43 PM

(d’은 attention layer의 dimension을 의미합니다.)

  • A small value of d′ can reduce the amount of computation and the number of parameters used in attention layers but may limit the modeling capacity.
  • Changing this ratio from 4 to a higher value gives better result. The best dev result is obtained with a ratio of 8.

Result : ENWIK8

Screen Shot 2022-08-07 at 11.43.00 PM

  • 해당 데이터셋에서도 더 적은 학습시간으로 더 좋은 성능을 보였습니다.

Result : WIKI-103

Screen Shot 2022-08-07 at 11.43.09 PM

  • 해당 데이터셋에서도 더 적은 학습시간으로 더 좋은 성능을 보였습니다.

Result : BILLION WORD

Screen Shot 2022-08-07 at 11.43.22 PM

  • To increase model capacity, They changed dimension of model.
  • Also they increased iteration and change learning rate.
  • Transformer model use 32 V100 GPU / 64 V100 GPU.
  • SRU++ used 8 GPUs

Result : IWSLT’14(De→En)

Screen Shot 2022-08-07 at 11.43.49 PM

  • The base model is an 8-layer Transformer model containing 20M parameters.

Result : Inference speed

Screen Shot 2022-08-07 at 11.44.10 PM

  • We use a single V100 GPU for inference.
  • Our large model runs at least 4.5x faster than all baseline models except Shortformer (Press et al., 2021).
  • In addition, our model achieves 0.9-1.1 perplexity lower than Shortformer and runs 50% faster when using 2 attention layers (k = 5).

Result : Why does SRU++ reduce training cost in our experiments

  1. First, combining attention and recurrence gives stronger modeling capacity
  2. We also observe higher training efficiency, requiring fewer training steps and smaller training batch compared to several Transformer models.
  3. Finally, model implementation is an important factor for computation saving. Our implementation is highly efficient for two reasons.
    • First, the fast recurrence operation of SRU is a reusable module that is already optimized for speed
    • Second, since recurrence encodes positional information, we can use simple single- head attention and remove positional encoding
  4. Advanced attention and positional encoding mechanism can generate non- trivial computation overhead.

Result : Why does SRU++ reduce training cost in our experiments

Screen Shot 2022-08-07 at 11.44.49 PM

  • Figure 5 (a) shows the average model forward time of a single batch. SRU++ runs 4-5x times faster compared to the Transformer-XL implementation.
  • Figure 5 (b) breaks down the computation and highlights the most time-consuming operations in both models.

Conclusion

  • We present a recurrent architecture with optional built-in self-attention that achieves leading model capacity and training efficiency.
  • We demonstrate that highly expressive and efficient models can be derived using a combination of attention and fast recurrence.
  • Our results reaffirm the empirical observations that attention is not all we need, and can be complemented by other sequential modeling modules.