Go To:

Paper Title Paper Authors Table Of Contents Abstract References
Home
Report a problem with this paper

Random Feature Attention

Authors

Abstract

Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their core is an attention function which models pairwise interactions between the inputs at every timestep. While attention is powerful, it does not scale efficiently to long sequences due to its quadratic time and space complexity in the sequence length. We propose RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, and explore its application in transformers. RFA can be used as a drop-in replacement for conventional softmax attention and offers a straightforward way of learning with recency bias through an optional gating mechanism. Experiments on language modeling and machine translation demonstrate that RFA achieves similar or better performance compared to strong transformer baselines. In the machine translation experiment, RFA decodes twice as fast as a vanilla transformer. Compared to existing efficient transformer variants, RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets. Our analysis shows that RFA’s efficiency gains are especially notable on long sequences, suggesting that RFA will be particularly useful in tasks that require working with large inputs, fast decoding speed, or low memory footprints.

1 Introduction

Transformer architectures (Vaswani et al., 2017) have achieved tremendous success on a variety of sequence modeling tasks Radford et al., 2018; Parmar et al., 2018; Devlin et al., 2019; Parisotto et al., 2020, inter alia) . Under the hood, the key component is attention (Bahdanau et al., 2015) , which models pairwise interactions of the inputs, regardless of their distances from each other. This comes with quadratic time and memory costs, making the transformers computationally expensive, especially for long sequences. A large body of research has been devoted to improving their time and memory efficiency (Tay et al., 2020c) . Although better asymptotic complexity and prominent gains for long sequences have been achieved Child et al., 2019; Beltagy et al., 2020, inter alia) , in practice, many existing approaches are less well-suited for moderatelength ones: the additional computation steps required by some approaches can overshadow the time and memory they save (Kitaev et al., 2020; Roy et al., 2020, inter alia) .

This work proposes random feature attention (RFA), an efficient attention variant that scales linearly in sequence length in terms of time and space, and achieves practical gains for both long and moderate length sequences. RFA builds on a kernel perspective of softmax (Rawat et al., 2019) . Using the well-established random feature maps (Rahimi & Recht, 2007; Avron et al., 2016; §2) , RFA approximates the dot-then-exponentiate function with a kernel trick (Hofmann et al., 2008) : exp(x • y) ≈ φ(x) • φ(y). Inspired by its connections to gated recurrent neural networks (Hochreiter & Schmidhuber, 1997; Cho et al., 2014) and fast weights (Schmidhuber, 1992) , we further augment RFA with an optional gating mechanism, offering a straightforward way of learning with recency bias when locality is desired.

RFA and its gated variant ( §3) can be used as a drop-in substitute for the canonical softmax attention, and increase the number of parameters by less than 0.1%. We explore its applications in transformers on language modeling, machine translation, and long text classification ( §4). Our experiments show that RFA achieves comparable performance to vanilla transformer baselines in all tasks, while outperforming a recent related approach (Katharopoulos et al., 2020) . The gating mechanism proves particularly useful in language modeling: the gated variant of RFA outperforms the transformer baseline on WikiText-103. RFA shines in decoding, even for shorter sequences. In our head-to-head comparison on machine translation benchmarks, RFA decodes around 2× faster than a transformer baseline, without accuracy loss. Comparisons to several recent efficient transformer variants on three long text classification datasets show that RFA is competitive in terms of both accuracy and efficiency. Our analysis ( §5) shows that more significant time and memory efficiency improvements can be achieved for longer sequences: 12× decoding speedup with less than 10% of the memory for 2,048-length outputs.

2 Background 2.1 Attention In Sequence Modeling

The attention mechanism (Bahdanau et al., 2015) has been widely used in many sequence modeling tasks. Its dot-product variant is the key building block for the state-of-the-art transformer architectures (Vaswani et al., 2017) . Let {q t } N t=1 denote a sequence of N query vectors, that attend to sequences of M key and value vectors. At each timestep, the attention linearly combines the values weighted by the outputs of a softmax:

attn (q t , {k i }, {v i }) = i exp (q t • k i /τ ) j exp (q t • k j /τ ) v i .

(1)

τ is the temperature hyperparameter determining how "flat" the softmax is (Hinton et al., 2015) . 1 Calculating attention for a single query takes O(M ) time and space. For the full sequence of N queries the space amounts to O(M N ). When the computation cannot be parallelized across the queries, e.g., in autoregressive decoding, the time complexity is quadratic in the sequence length.

2.2 Random Feature Methods

The theoretical backbone of this work is the unbiased estimation of the Gaussian kernel by Rahimi & Recht (2007) . Based on Bochner's theorem (Bochner, 1955) , Rahimi & Recht (2007) proposed random Fourier features to approximate a desired shift-invariant kernel. The method nonlinearly transforms a pair of vectors x and y using a random feature map φ; the inner product between φ(x) and φ(y) approximates the kernel evaluation on x and y. More precisely: Theorem 1 (Rahimi & Recht, 2007) . Let φ : R d → R 2D be a nonlinear transformation:

EQUATION (2): Not extracted; please refer to original document.

When d-dimensional random vectors w i are independently sampled from N (0, σ

EQUATION (3): Not extracted; please refer to original document.

Variance of the estimation is inversely proportional to D (Appendix A.2; Yu et al., 2016) .

Random feature methods proved successful in speeding up kernel methods (Oliva et al., 2015; Avron et al., 2017; Sun, 2019, inter alia) , and more recently are used to efficiently approximate softmax (Rawat et al., 2019) . In §3.1, we use it to derive an unbiased estimate to exp( • , • ) and then an efficient approximation to softmax attention.

3 Model

This section presents RFA ( §3.1) and its gated variant ( §3.2). In §3.3 we lay out several design choices and relate RFA to prior works. We close by practically analyzing RFA's complexity ( §3.4).

< l a t e x i t s h a 1 _ b a s e 6 4 = " i T N 1 6 P U F E d d 6 7 b c g 7 v L U / G F M T 0 M = " > A A A C A H i c b V B N S 8 N A E N 3 U r 1 q / o o I X L 4 t F 8 F Q S K e i x 6 M V j B W u F N p T N d t M u 3 c 2 G 3 Y l Y Y g / + F S 8 e F M S r P 8 O b / 8 Z N m 4 O 2 P h h 4 v D f D z L w w E d y A 5 3 0 7 p a X l l d W 1 8 n p l Y 3 N r e 8 f d 3 b s 1 K t W U t a g S S t + F x D D B Y 9 Y C D o L d J Z o R G Q r W D k e X u d + + Z 9 p w F d / A O G G B J I O Y R 5 w S s F L P P e i q h G k C S s d E s s y o C C R 5 m P T c q l f z p s C L x C 9 I F R V o 9 t y v b l / R V L I Y q C D G d H w v g S A j G j g V b F L p p o Y l h I 7 I g H U s z Z e Z I J v e P 8 H H V u n j S G l b M e C p + n s i I 9 K Y s Q x t p y Q w N P N e L v 7 n d V K I z o O M x 0 k K L K a z R V E q M C i c h 4 H 7 X D M K Y m w J o Z r b W z E d E k 0 o 2 M g q N g R / / u V F 0 j 6 t + f W a 7 1 / X q 4 2 L I o 8 y O k R H 6 A T 5 6 A w 1 0 B V q o h a i 6 B E 9 o 1 f 0 5 j w 5 L 8 6 7 8 z F r L T n F z D 7 6 A + f z B + x 9 l y M = < / l a t e x i t > softmax < l a t e x i t s h a 1 _ b a s e 6 4 = " E K G F j e p b x 5 c a f J 1 5 n y I y H U x E J U c = " > A A A K H n i c b d Z J b 9 t G G A Z g O t 0 i d U n S H n s h a h R I A c H Q O L K j 3 B J 5 3 + V F i 2 0 K w X A 0 H N H m l u H Q j k z w P / T a X v p r e i x 6 b f 9 N K V n u + 9 U q A Q H D 5 / t I j m b e w 7 h J 4 K e m X v 9 7 4 c k n n 3 7 2

+ R d P K 9 U v v / r 6 m 2 f P X 3 z b T e N M C 9 k R c R D r v s t T G f i R 7 B j f B L K f a M l D N 5 A 9 9 3 p t U u / d S J 3 6 c X R m x o k c h F x F v u c L b k r q O u r o 5 c F P 7 5 8 v 1 p f q 0 8 u e H 7 D Z Y N G a X e 3 3 L y o L z j A W W S g j I w K e p p e s n p h B z r X x R S C L q p O l M u H i m i t 5 W Q 4 j H s p 0 k E + n W 9 g / l j K 0 v V i X v 8 j Y U 6 V P 5 D x M 0 3 H o l p 0 h N 6 P 0 c W 2 C / 1 e 7 z I z X H O R + l G R G R u L + Q 1 4 W 2 C a 2 J / / d H v p a C h O M y w E X 2 i / n a o s R 1 1 y Y c o W q z l B 6 5 S p O p 5 O H Y z f I Z J G f b L W K v L F a a z Z r r F E v H j d p O Z z 1 s C a r r b 6 p N V f n e m L N I / X w q u V 6 o 2 a z x q u a v b I y 1 5 l k O g k e O t m r s r P x u u x e m X + n 0 l J G D 7 M r p 8 Z Y z V 5 + U 1 S r 0 0 b n 5 k 7 q O M + d y R K 5 X l 4 v i m J W i C M J Z / A w A z t h h o I Z S c N J b X q P M i k R d a E u V E A F d A g d Q s k s J d S D e l A F V d A R d A T 1 o T 7 0 C n o F v Y Z e Q w N o Q N Y P G k I j a E T 2 A B p D E 2 g C / Q D 9 A N V Q D U 2 h K d l A q I G S 7

S a b f Q O 9 g d 5 C b 6 E f o R + h Y + g Y e g e 9 m + g 9 h + / A 7

/ 5 t D l v Q F n Q N u g Z d h 6 5 D N 6 A b 0 E 3 o J n Q L u g X d h m 5 D d 6 A 7 0 F 3 o L n Q P u g f d h + 5 D D 6 A H 0 E P o I f Q I e g R t Q 9 v Q Y + g x 9 A R 6 A j 2 F n k L P o G f Q D r Q D 7 U K 7 0 B 6 0 B + 1 D + 9 B z 6 D n 0 A n q B 8 K i H 8 A g e k P S o F m H E R 6 0 R R n 7 U O m E E S G 0 Q R o L U J m F E S G 0 R R o b U N m G E S O 0 Q R o r U L m H E S O 0 R R o 7 U P m E E S R 0 Q R p L U I W F E S R 0 R R p Z U m z D C p I 4 J I 0 3 q h D D i p E 4 J I 0 / q j D A C p T q E k S j V J Y x I q R 5 h Z E r 1 C S N U 6 p w w U q U u C E 9 j 5 U T y V s R h y K N h 7 k S x D o t L N s i d Q H r G C b p S m 0 X m a F + N j K M n d 0 V 5 a G K P j 0 j z g 9 7 y E m s s M X b c W H z b m p 2 f n l r f W z 9 Y L y 1 m v b b e W t t W 2 + p Y w r q y f r Z + s X 6 t / F b 5 v f J H 5 c / 7 1 i c L s 2 e + s / 5 z V f 7 6 B 4 b o x 3 s = < / l a t e x i t > O(M )

< l a t e x i t s h a 1 _ b a s e 6 4 = " x v 9 6 / M 5 N / v l e B K W 7 w D I f 4 y 0 8 j r M = " > A A A K H

3 i c b d Z J b 9 t G G A Z g O r v U N O s x F y J G g R Q Q D I 0 j O 8 o t k f d d X r T Y p h A M q e G I E b c M h 3 Z k g j 8 i 1 / b S X 9 N j k G v + T S l Z 7 v v V

K g E B w + f 7 S I 5 m 3 s P Y s e 8 l u l r 9 O X f n 7 r 3 7 D x 4 + K p V / e f z r k 6 f P n r 9 o J 1 G q H N F y I j 9 S X Z s n w v d C 0 d K e 9 k U 3 V o I H t i 8 6 9 n B l X O 9 c C J V 4 U X i i R 7 H o B V y G n u s 5 X B f U s e T B m 7 3 9 3 z 8 9 m 6 8 u V C e X O T t g 0 8 G 8 M b 2 a n 5 6 X 5 q x + 5 K S B C L X j 8 y Q 5 Z 9 V Y 9 z K u t O f 4 I i 9 b a S J i 7 g y 5

F O f F M O S B S H r Z Z L 6 5 + V s h f d O N V P E L t T l R + k T G g y Q Z B X b R G X A 9 S G 7 X x v h / t f N U u / V e 5 o V x q k X o X H / I T X 1 T R + b 4 z 5 t 9 T w l H + 6 N i w B 3 l F X M 1 n Q F X 3 N H F E p W t v n C L Z Z x M J w t G t p + K P D v a a O R Z b b l S r 1 d Y r Z r f b l K i P + 1 h d V Z Z f l + p L 8 / 0 R I q H 8 u Z V i 9 V a x W S 1 t x V z a W m m M 0 5 V 7 N 9 0 s r d F Z + 1 d 0 b 0 0 + 0 6 p h A h v Z l d M j b G K u f g + L 5 c n j d b F l V B R l l n j J b L d r J r n + b Q Q h Q L O 4 E E K t o I U B T 0 Q m p P a 5 B 5 l U i J q Q 2 2 o A 3 W g f W g f S m Y p o C 7

U h U q o h A 6 g A 6 g H 9 a C f o Z + h Q + g Q 6 k N 9 s n 7 Q A B p C Q 7 I H 0 A g a Q 2 P o F + g X q I I q a A J N y A Z C N Z R s N 9 n s C + g F 9 B J 6 C f 0

K / Q o d Q U f Q K + j V W K 8 5 + A j + + G 9 z 0 I A 2 o C v Q F e g q d B W 6 B l 2 D r k P X o R v Q D e g m d B O 6 B d 2 C b k O 3 o T v Q H e g u d B e 6 B 9 2 D 7 k P 3 o Q f Q A 2 g T 2 o Q e Q g + h R 9 A j 6 D H 0 G H o C P Y G 2 o C 1 o G 9 q G d q A d a B f a h Z 5 C T 6 F n 0 D O E R 9 6 E x + E + S Y 9 s E E Z 8 5 A p h 5 E e u E k a A 5 B p h J E i u E 0 a E 5 A Z h Z E h u E k a I 5 B Z h p E h u E 0 a M 5 A 5 h 5 E j u E k a Q 5 B 5 h J E n u E 0 a U 5 A F h Z E k 2 C S N M 8 p A w 0 i S P C C N O 8 p g w 8 i R P C C N Q s k U Y i Z J t w o i U 7 B B G p m S X M E I l T w k j V f K M 8 C R W V i g u n S g I e N j P r D B S Q X 7 O e p n l C 1 d b f l s o P c 8 s 5 c m B t t T 4 L i 8 O T e z 2 E W l 2 0 F l c Y L U F x g 5 r 8 x 8 a 0 / P T I + O V 8 d p 4 Y z D j n f H B 2 D S a R s t w j K H x z f j D + L P 0 V + n v 0 v f S j + v

W O 3 P T Z 1 4 a / 7 l K P / 8 B q y / H 0 w = = < / l a t e x i t >

O(Mn)

< l a t e x i t s h a 1 _ b a s e 6 4 = " i 3 k

f S 6 H 5 o f O X / m R p F 1 7 d 8 R U q W C 4 = " > A A A K L X i c b d Z J b 9 t G G A Z g O t 0 i p Y v T H H s h a h T o Q T A 0 j u w o h w K J v O / y o s U 2 V W F I D U

e M u X k 4 l C M T / C 2 9 t p f + m l w C B L 3 2 b 5 S S 5 b 5 f r Q 4 g Y P h 8 H 6 n B z H s Y O / a 9 R F e r H x e e f P b 5 F 1 9 + 9 b R U f v b 1 N 9 9 + t / j 8 + 3 Y S p c o R L S f y I 9 W 1 e S J 8 L x Q t 7 W l f d G M l e G D 7 o m N f r 0 / q n Z F Q i R e F 5 3 o c i 1 7 A Z e i 5 n s N 1 Q f 3 F F 1 Z m j W 7 6 n p X 3 M + 8 X l v + a H e X 9 x a X q c n U 6 z P k J m 0 2 W j N l o 9 p + X F q x B 5 K S B C L X j 8 y S 5 Y t V Y 9 z K u t O f 4 I i 9 b a S J i 7 l x z K a 6 K a c g D k f S y 6 e p z 8 6 d C B q Y b q e I X a n O q 9 I 2 M B 0 k y D u y i M + B 6 m D y u T f D / a l e p d u u 9 z A v j

V I v Q u f 8 j N / V N H Z m T r T A H n h K O 9 s f F h D v K K 9 Z q O k O u u K O L D S t b A + E W m z p d T h a M b T 8 V e X a 6 3 c i z 2 l q l X q + w W j V / 3 K T E Y N b D 6 q y y 9 r p S X 5 v r i R Q P 5 c O n V q q 1 i s l q L y v m 6 u p c Z 5 y q 2 H / o Z C + L z t q r o n t 1 / p t S C R E + r K 5 Y G m M V c + V 1 X i 5 P G 6 3 R n V B R l l m T L b L d r J r n + a w Q h Q L O 4 E E K t o I U B T 0 U m p P a 9 B l l U i J q Q 2 2 o A 3 W g A + g A S l Y p o C 7 U h U q o h A 6 h Q 6 g H 9 a D v o O + g 1 9 B r q A / 1 y f 5 B A 2 g I D c k Z Q C N o D I 2 h N 9 A b q I I q a A J N y A F C N Z Q c N z n s E X Q E v Y X e Q t 9 D 3 0 P H 0 D H 0 D n o 3 0 X s O 3 o L f / t s c N K A N 6 D p 0 H b o B 3 Y B u Q

j e h W 9 A t 6 D Z 0 G 7 o D 3 Y H u Q n e h e 9 A 9 6 D 5 0 H 3 o A P Y A e Q g + h R 9 A j 6 D H 0 G N q E N q E n 0 B P o K f Q U e g Y 9 g 5 5 D z 6 E t a A v a h r a h H W g H 2 o

V 2 o R f Q C + g l 9 B L h k Q / h c b h P 0 i M b h B E f u U 4 Y + Z E b h B E g u U k Y C Z J b h B E h u U 0 Y G Z I 7 h B E i u U s Y K Z J 7 h B E j u U 8 Y O Z I H h B E k e U g Y S Z J H h B E l e U w Y W Z J N w g i T P C G M N M l T w o i T P C O M P M l z w g i U b B F G o m S b M C I l O 4 S R K d k l j F D J C 8 J I l b w k P I 2 V F Y p b J w o C H g 4 y K 4 x U k F + x X m b 5 w t W W 3 x Z K L z F L e X K o L T V 5 m l y a 2 O M r 0 v y k s 7 L M a s u M n d S W 3 j R m 9 6 e n x g / G j 8 b P B j N e G W + M H a N p t A z H G B u / G b 8 b f 5 T + L H 0 o f S r 9 d d / 6 Z G H 2 z g v j P 6 P 0 9 z 9 1 7 M 4 L < / l a t e x i t > {q i } N i=1

< l a t e x i t s h a 1 _ b a s e 6 4 = " 0 H + N 8 b 0 t Y v / J e C 1 w 8 N x p 5 Y 5 / J

+ U = " > A A A K L X i c b d Z J b 9 t G G A Z g O u k S q Z u T H H s h a h T o Q T A 0 j u w o h w K J v O / y o s U 2 V W F I D U e M

u G U 4 l C M T / C 2 9 t p f + m l 4 K F L 3 2 b 5 S S 5 b 5 f r Q 4 g Y P h 8 H 6 n B z H s Y O / a 9 R F e r f y w 9 e f r J p 5 9 9 / q x U / u L L r 7 7

+ Z v n 5 i 3 Y S p c o R L S f y I 9 W 1 e S J 8 L x Q t 7 W l f d G M l e G D 7 o m O P N q f 1 z l i o x I v C S z 2 J R S / g M v R c z + G 6 o P 7 y S y u z x q O + Z + X 9 z P u R 5 T 9 l x 3 l / e a W 6 W p 0 N c 3 H C 5 p M V Y z 6 a / e e l J W s Q O W k g Q u 3 4 P E l u W D X W v Y w r 7 T m + y M t W m o i Y O y M u x U 0 x D X k g k l 4 2 W 3 1 u f l / I w H Q j V f x C b c 6 U v p H x I E k m g V 1 0 B l w P k 8 e 1 K f 5 f 7 S b V b r 2 X e W G c a h E 6 9 3 / k p r 6 p I 3 O 6 F e b A U 8 L R / q S Y c E d 5 x V p N Z 8 g V d 3 S x Y W V r I N x i U 2 f L y Y K J 7 a c i z 8 5 3 G 3 l W 2 6 j U 6 x V W q + a P m 5 Q Y z H t Y n V U 2 3 l T q G w s 9 k e K h f P j U W r V W M V n t V c V c X 1 / o j F M V + w + d 7 F X R W X t d d K 8 v f l M q I c K H 1 R V L Y 6 x i r r 3 J y + V Z o z W + E y r K M m u 6 R b a b V f M 8 n x e i U M A Z P E j B V p C i o I d C c 1 K b P a N M S k R t q A 1 1 o A 5 0 A B 1 A y S o F 1 I W 6 U A m V 0 C F 0 C P W g H v Q 9 9 D 1 0 B B 1 B f a h P 9 g 8 a Q E N o S M 4 A G k F j a A z 9 A P 0 A V V A F T a A J O U C o h p L j J o c 9

h o 6 h t 9 B b 6 E f o R + g E O o H e Q e + m e s / B O / C 7 f 5 u D B r Q B 3 Y R u Q r e g W 9 B t 6 D Z 0 B 7 o D 3 Y X u Q v e g e 9 B 9 6 D 7 0 A H o A P Y Q e Q o + g R 9 B j 6 D H 0 B

H o C P Y W e Q p v Q J v Q M e g Y 9 h 5 5 D L 6 A X 0 E v o J b Q F b U H b 0 D a 0 A + 1 A u 9 A u 9 A p 6 B b 2 G X i M 8 8 i E 8 D v d J e m S D M O I j N w k j P 3 K L M A I k t w k j Q X K H M C I k d w k j Q 3 K P M E I k 9 w k j R f K A M G I k D w k j R / K I M I I k j w k j S f K E M K I k T w k j S 7 J J G G G S Z 4 S R J n l O G H G S F 4 S R J 3 l J G I G S L c J I l G w T R q R k h z A y J b u E E S p 5 R R i p k t e E Z 7 G y Q n H r R E H A w 0 F m h Z E K 8 h v W y y x f u N r y 2 0 L p F W Y p T w 6 1 p a Z P 0 0 s T e 3 x F W p x 0 1 l Z Z b Z W x s 9 r K 2 8 b 8 / v T M + N b 4 z v j B Y M Z r 4 6 2 x Z z S N l u E Y E + N n 4 x f j 1 9 J v p d 9 L f 5 b + u m 9 9 s j R / 5 6 X x n 1 H 6 + x 8 x p M 4 E < / l a t e x i t > {k i } M i=1

< l a t e x i t s h a 1 _ b a s e 6 4 = " O e G p z y C c y 6

Z C o 0 o E D G M V u J c b Z C I = " > A A A K L X i c b d b L b t t G F A Z g O k n b S O n F S Z b d E D U K d C E Y G k d 2 l E W B R L 7 f 5 Y s u t q k K Q 2 o 4 Y s x b h k M 5 M s F n y b b d 9 G m y K V B 0 2 9 c o J c v 9 T 6 0 O I G D 4 n U N q M P M v x o 5 9 L 9 H V 6 h 8 L j x 4

/ + e L L r 5 6 W y s + + / u b b 7 x a f v 2 g n U a o c 0 X I i P 1 J d m y f C 9 0 L R 0 p 7 2 R T d W g g e 2 L z r 2 9 f q k 3 h k J l X h R e K 7 H s e g F X I a e 6 z l c F 9 R f f G l l 1 m j U 9 6 y 8 n 3 k / s / y

X 7 D D v L y 5 V l 6 v T Y c 5 P 2 G y y Z M x G s / + 8 t G A N I i c N R K g d n y f J F a v G u p d x p T 3 H F 3 n Z S h M R c + e a S 3 F V T E M e i K S X T V e f m z 8 W M j D d S B W / U J t T p W 9 k P E i S c W A X n Q H X w + R h b Y L / V 7 t K t V v v Z V 4 Y p 1 q E z t 0 f u a l v 6 s i c b I U 5 8 J R w t D 8 u J t x R X r F W 0 x l y x R 1 d b F j Z G g i 3 2 N T p c r J g b P u p y L P T 7 U a e 1 d Y q 9 X q F 1 a r 5 w y Y l B r M e V m e V t T e V + t p c T 6 R 4 K O 8 / t V K t V U x W e 1 U x V 1 f n O u N U x f 5 9 J 3 t V d N Z e F 9 2 r 8 9 + U S o j w f n X F 0 h i r m C t v 8 n J 5 2 m i N b o W K s s y a b J H t Z t U 8 z 2 e F K B R w B g 9 S s B W k K O i h 0 J z U p s 8 o k x J R G 2 p D H a g D H U A H U L J K A X W h L l R C J X Q I H U I 9 q A d 9 D 3 0 P v Y Z e Q 3 2 o T / Y P G k B D a E j O A B p B Y 2 g M / Q D 9 A F V Q B U 2 g C T l A q I a S 4 y a H P Y K O o D f Q G + h H 6 E f o G D q G 3 k J v J 3 r H w T v w u 3 + b g w a 0 A V 2 H r k M 3 o B v Q T e g m d A u 6 B d 2 G b k N 3 o D v

Q X e g u d A + 6 B 9 2 H 7 k M P o A f Q Q + g h 9 A h 6 B D 2 G H k O b 0 C b 0 B H o C P Y W e Q s + g Z 9 B z 6 D m 0 B W 1 B 2 9 A 2 t A P t Q L v Q L v Q C e g G 9 h F 4 i P P I + P A 7 3 S X p k g z D i I 9 c J I z 9 y g z A C J D c

J I 0 F y i z A i J L c J I 0 N y h z B C J H c J I 0 V y j z B i J P c J I 0 f y g D C C J A 8 J I 0 n y i D C i J I 8 J I 0 u y S R h h k i e E k S Z 5 S h h x k m e E k S d 5 T h i B k i 3 C S J R s E 0 a k Z I c w M i W 7 h B E q e U E Y q Z K X h K e x s k J x 4 0 R B w M N B Z o W R C v I r 1 s s s X 7 j a 8 t t C 6 S V m K U 8 O t a U m T 5 N L E 3 t 4 R Z q f d F a W W W 2 Z s Z P a 0 t v G 7 P 7 0 1 P j e + M H 4 y W D G a + O t s W M 0 j Z b h G G P j k / G r 8 V v p 9 9 L n 0 p + l v + 5 a H y 3 M 3 n l p / G e U / v 4 H n Q X O D w = = < / l a t e x i t > {v i } M i=1

< l a t e x i t s h a 1 _ b a s e 6 4 = " m D v r 6 a 7 P Y g P p F H

Q f A F d 1 u G l y T d k = " > A A A K L X i c b d b L b t t G F A Z g O k n b S O n F S Z b d E D U K d C E Y G k d 2 l E W B R L 7 f 5 Y s u t q k K Q 2 o 4 Y s x b h k M 5 M s F n y b b d 9 G m y K V B 0 2 9 c o J c v 9 T 6 0 O I G D 4 n U N q M P M v x o 5 9 L 9 H V 6 h 8 L j x 4

/ + e L L r 5 6 W y s + + / u b b 7 x a f v 2 g n U a o c 0 X I i P 1 J d m y f C 9 0 L R 0 p 7 2 R T d W g g e 2 L z r 2 9 f q k 3 h k J l X h R e K 7 H s e g F X I a e 6 z l c F 9 R f f G l l 1 m j Y 9 6 y 8 n 3 k / s / y

X 7 C j v L y 5 V l 6 v T Y c 5 P 2 G y y Z M x G s / + 8 t G A N I i c N R K g d n y f J F a v G u p d x p T 3 H F 3 n Z S h M R c + e a S 3 F V T E M e i K S X T V e f m z 8 W M j D d S B W / U J t T p W 9 k P E i S c W A X n Q H X w + R h b Y L / V 7 t K t V v v Z V 4 Y p 1 q E z t 0 f u a l v 6 s i c b I U 5 8 J R w t D 8 u J t x R X r F W 0 x l y x R 1 d b F j Z G g i 3 2 N T p c r J g b P u p y L P T 7 U a e 1 d Y q 9 X q F 1 a r 5 w y Y l B r M e V m e V t T e V + t p c T 6 R 4 K O 8 / t V K t V U x W e 1 U x V 1 f n O u N U x f 5 9 J 3 t V d N Z e F 9 2 r 8 9 + U S o j w f n X F 0 h i r m C t v 8 n J 5 2 m i N b o W K s s y a b J H t Z t U 8 z 2 e F K B R w B g 9 S s B W k K O i h 0 J z U p s 8 o k x J R G 2 p D H a g D H U A H U L J K A X W h L l R C J X Q I H U I 9 q A d 9 D 3 0 P v Y Z e Q 3 2 o T / Y P G k B D a E j O A B p B Y 2 g M / Q D 9 A F V Q B U 2 g C T l A q I a S 4 y a H P Y K O o D f Q G + h H 6 E f o G D q G 3 k J v J 3 r H w T v w u 3 + b g w a 0 A V 2 H r k M 3 o B v Q T e g m d A u 6 B d 2 G b k N 3 o D v Q X e g u d A + 6 B 9 2 H 7 k M P o A f Q Q + g h 9 A h 6 B D 2 G H k O b 0 C b 0 B H o C P Y W e Q s + g Z 9 B z 6 D m 0 B W 1 B 2 9 A 2 t A P t Q L v Q L v Q C e g G 9 h F 4 i P P I + P A 7 3 S X p k g z D i I 9 c J I z 9 y g z A C J D c J I 0 F y i z A i J L c J I 0 N y h z B C J H c J I 0 V y j z B i J P c J I 0 f y g D C C J A 8 J I 0 n y i D C i J I 8 J I 0 u y S R h h k i e E k S Z 5 S h h x k m e E k S d 5 T h i B k i 3 C S J R s E 0 a k Z I c w M i W 7 h B E q e U E Y q Z K X h K e x s k J x 4 0 R B w M N B Z o W R C v I r 1 s s s X 7 j a 8 t t C 6 S V m K U 8 O t a U m T 5 N L E 3 t 4 R Z q f d F a W W W 2 Z s Z P a 0 t v G 7 P 7 0 1 P j e + M H 4 y W D G a + O t s W M 0 j Z b h G G P j k / G r 8 V v p 9 9 L n 0 p + l v + 5 a H y 3 M 3 n l p / G e U / v 4 H H h H O A g = = < / l a t e x i t > {h i } N i=1 < l a t e x i t s h a 1 _ b a s e 6 4 = " Z W e Y L p o x / d Y s Y 7 p E F R o p n W u q s G 4 = " > A A A K L n i c b d b L b t t G F A Z g O k 3 b S O n F a b L L h q h R I A U E Q 3 R k R 9 k l 8 v 0 u X 3 S x T c E Y D o c j x r x l O L Q j E 3 y X b t t N n q a L L o p u + x i l Z L n / q V U C g k b f O a Q O Z v 4 F n S T w U 1 2 v / z H 3 6 I v H X 3 7 1 9 Z N K 9 e k 3 3 3 7 3 / f y z H 7 p p n C k u O j w O Y t V 3 W C o C P x I d 7 e t A 9 B M l W O g E o u d c r Y 7 r v W u h U j + O T v U o E Y O Q y c j 3 f M 5 0 S Z f z L 1 7 Z T h y 4 6 S g s v 3 K b u 7 E u f r 6 c X 6 g v 1 i e X O b u w p o s F Y 3 q 1 L 5 9 V 5 m w 3 5 l k o I s 0 D l q Y X V j 3 R g 5 w p 7 f N A F F U 7 S 0 X C + B W T 4 q J c R i w U 6 S C f j F + Y P 5 X i m l 6 s y k + k z Y n S O 3 I W p u M B y 8 6 Q 6 W H 6 s D b G / 6 t d Z N p r D n I / S j I t I n 7 3 R 1 4 W m D o 2 x 3 t h u r 4 S X A e j c s G 4 8 s t Z T T 5 k i n F d 7 l j V d o V X 7 u p k n D w c O U E m i v x 4 s 1 X k j Z V a s 1 m z G v X i Y Z M S 7 r T H a l q 1 l b e 1 5 s p M T 6 x Y J O 8 f t V R v 1 E y r 8 b p m L i / P d C a Z S o L 7 T u t 1 2 d l 4 U 3 Y v z z 5 T K i G i + + n K 0 S y r Z i 6 9 L a r V S a N 9 f S t U n O f 2 e I s c L 6 8 X R T E t x J G A W / A w A 9 t h h o I e C s 1 I b f I b Z V I i 6 k A d K I d y q A t 1 o W R K A f W g H l R C J X Q I H U J 9 q A / 9 A P 0 A v Y J e Q Q N o Q P Y P G k I j a E T O A B p D E 2 g C / Q j 9 C F V Q B U 2 h K T l A q I a S 4

y a H f Q 2 9 h t 5 A b 6 C f o J + g I + g I e g u 9 H e s d h

+ / B 7 / 9 t D l v Q F n Q V u g p d g 6 5 B 1 6 H r 0 A 3 o B n Q T u g n d g m 5 B t 6 H b 0 B 3 o D n Q X u g v d g + 5 B 9 6 H 7 0 A P o A f Q Q e g h t Q 9 v Q I + g R 9 B h 6 D D 2 B n k B P o a f Q D r Q D 7 U K 7 0 B 6 0 B + 1 D + 9 A z 6 B n 0 H H q O 8 M j 7 8 H A W k P T I F m H E R 6 4 S R n 7 k G m E E S K 4 T R o L k B m F E S G 4 S R o b k F m G E S G 4 T R o r k D m H E S O 4 S R o 7 k H m E E S e 4 T R p L k A W F E S R 4 S R p Z k m z D C J I 8 I I 0 3 y m D D i J E 8 I I 0 / y l D A C J T u E k S j Z J Y x I y R 5 h Z E r 2 C S N U 8 o w w U i X P C U 9 i Z U f i h s d h y C I 3 t 6 N Y h c W F N c j t Q H j a D r p C 6 Q X L V r 4 c a l u N f x X l S 5 P 1 8 B V p d t F b W r Q a i 5 Z 1 1 F h 4 1 5 q + P z 0 x X h o / G q 8 M y 3 h j v D O 2 j L b R M b h x a / x i / G r 8 V v l c + b 3 y Z + W v u 9 Z H c 9 N 7 n h v / u S p / / w P h a M 5 q < / l a t e x i t > (•) < l a t e x i t s h a 1 _ b a s e 6 4 = " Z W e Y L p o x / d Y s Y 7 p E F R o p n W u q s G 4 = " > A A A K L n i c b d b L b t t G F A Z g O k 3 b S O n F a b L L h q h R I A U E Q 3 R k R 9 k l 8 v 0 u X 3 S x T c E Y D o c j x r x l O L Q j E 3 y X b t t N n q a L L o p u + x i l Z L n / q V U C g k b f O a Q O Z v 4 F n S T w U 1 2 v / z H 3 6 I v H X 3 7 1 9 Z N K 9 e k 3 3 3 7 3 / f y z H 7 p p n C k u O j w O Y t V 3 W C o C P x I d 7 e t A 9 B M l W O g E o u d c r Y 7 r v W u h U j + O T v U o E Y O Q y c j 3 f M 5 0 S Z f z L 1 7 Z T h y 4 6 S g s v 3 K b u 7 E u f r 6 c X 6 g v 1 i e X O b u w p o s F Y 3 q 1 L 5 9 V 5 m w 3 5 l k o I s 0 D l q Y X V j 3 R g 5 w p 7 f N A F F U 7 S 0 X C + B W T 4 q J c R i w U 6 S C f j F + Y P 5 X i m l 6 s y k + k z Y n S O 3 I W p u M B y 8 6 Q 6 W H 6 s D b G / 6 t d Z N p r D n I / S j I t I n 7 3 R 1 4 W m D o 2 x 3 t h u r 4 S X A e j c s G 4 8 s t Z T T 5 k i n F d 7 l j V d o V X 7 u p k n D w c O U E m i v x 4 s 1 X k j Z V a s 1 m z G v X i Y Z M S 7 r T H a l q 1 l b e 1 5 s p M T 6 x Y J O 8 f t V R v 1 E y r 8 b p m L i / P d C a Z S o L 7 T u t 1 2 d l 4 U 3 Y v z z 5 T K i G i + + n K 0 S y r Z i 6 9 L a r V S a N 9 f S t U n O f 2 e I s c L 6 8 X R T E t x J G A W / A w A 9 t h h o I e C s 1 I b f I b Z V I i 6 k A d K I d y q A t 1 o W R K A f W g H l R C J X Q I H U J 9 q A / 9 A P 0 A v Y J e Q Q N o Q P Y P G k I j a E T O A B p D E 2 g C / Q j 9 C F V Q B U 2 h K T l A q I a S 4

y a H f Q 2 9 h t 5 A b 6 C f o J + g I + g I e g u 9 H e s d h

+ / B 7 / 9 t D l v Q F n Q V u g p d g 6 5 B 1 6 H r 0 A 3 o B n Q T u g n d g m 5 B t 6 H b 0 B 3 o D n Q X u g v d g + 5 B 9 6 H 7 0 A P o A f Q Q e g h t Q 9 v Q I + g R 9 B h 6 D D 2 B n k B P o a f Q D r Q D 7 U K 7 0 B 6 0 B + 1 D + 9 A z 6 B n 0 H H q O 8 M j 7 8 H A W k P T I F m H E R 6 4 S R n 7 k G m E E S K 4 T R o L k B m F E S G 4 S R o b k F m G E S G 4 T R o r k D m H E S O 4 S R o 7 k H m E E S e 4 T R p L k A W F E S R 4 S R p Z k m z D C J I 8 I I 0 3 y m D D i J E 8 I I 0 / y l D A C J T u E k S j Z J Y x I y R 5 h Z E r 2 C S N U 8 o w w U i X P C U 9 i Z U f i h s d h y C I 3 t 6 N Y h c W F N c j t Q H j a D r p C 6 Q X L V r 4 c a l u N f x X l S 5 P 1 8 B V p d t F b W r Q a i 5 Z 1 1 F h 4 1 5 q + P z 0 x X h o / G q 8 M y 3 h j v D O 2 j L b R M b h x a / x i / G r 8 V v l c + b 3 y Z + W v u 9 Z H c 9 N 7 n h v / u S p / / w P h a M 5 q < / l a t e x i t > (•) (a) Softmax attention.

< l a t e x i t s h a 1 _ b a s e 6 4 = "

H A Z Z A J T h m B q k O Z h O P W i t 3 O O Z a O k = " > A A A K P 3 i c b d b L b t t G F A Z g O k 3 b S L 0 5 7 b I b I k a A F B A M 0 Z E d Z Z f I 9 7 t 8 0 c U 2 B W M 4 H I 4 Y 8 5 b h 0 I 5 M 8 A n y N N m 2 m 7 5 G X 6 D L o t v u S s l y / x O r B A Q N v 3 N I j W b + x T h J 4 K e 6 X v 9 j 7 t E X j 7 / 8 6 u s n l e o 3 3 3 7 3 / Q / z T 3 / s p n G m u O j w O I h V 3 2 G p C P x I d L S v A 9 F P l G C h E 4 i e c 7 U 6 r v e u h U r 9 O D r V o 0 Q M Q i Y j 3 / M 5 0 y V d z j + 3 n T h w 0 1 F Y f u V 2 M v S L F 5 8 J d 2 N d / H I 5 v 1 B f r E 8 u c 3 Z g T Q c L x v R q X z 6 t z N l u z L N Q R J o H L E 0 v r H q i B z l T 2 u e B K K p 2 l o q E 8 S s m x U U 5 j F g o 0 k E + + T + F + b w U 1 / R i V X 4 i b U 6 U P p G z M B 1 P s O w M m R 6 m D 2 t j / L / a R a a 9 5 i D 3 o y T T I u J 3 P + R l g a l j c 7 w 4 p u s r w X U w K g e M K 7 + c q 8 m H T D G u y y W s 2 q 7 w y m W e T C c P R 0 6 Q i S I / 3 m w V e W O l 1 m z W r E a 9 e N i k h D v t s Z p W b e V 1 r b k y 0 x M r F s n 7 V y 3 V G z X T a r y s m c v L M 5 1 J p p L g v t N 6 W X Y 2 X p X d y 7 P v l E q I 6 H 5 2 5 d Q s q 2 Y u v S 6 q 1 U m j f X 0 r V J z n 9 n i J H C + v F 0 U x L c S R g F v w M A P b Y Y a C H g r N S G 1 y j z I p E X W g D p R D O d S F u l A y S w H 1 o B 5 U Q i V 0 C B 1 C f a g P f Q d 9 B 7 2 C X k E D a E D W D x p C I 2 h E 9 g A a Q x N o A n 0 P f Q 9 V U A V N o S n Z Q K i G k u 0 m m 3 0 N v Y b e Q G + g H 6 A f o C P o C H o L v R 3 r H Y d v w W / / a w 5 b 0 B Z 0 F b o K X Y O u Q d e h 6 9

A N 6 A Z 0 E 7 o J 3 Y J u Q b e h 2 9 A d 6 A 5 0 F 7 o L 3 Y P u Q f e h + 9 A D 6 A H 0 E H o I b U P b 0 C P o E f Q Y e g w 9 g Z 5 A T 6 G n 0 A 6 0 A + 1 C u 9 A e t A f t Q /

v Q M + g Z 9 B x 6 j v D I + / B w F p D 0 y B Z h x E e u E k Z + 5 B p h B E i u E 0 a C 5 A Z h R E h u E k a G 5 B Z h h E h u E 0 a K 5 A 5 h x E j u E k a O 5 B 5 h B E n u E 0 a S 5 A F h R E k e E k a W Z J s w w i S P C C N N 8 p g w 4 i R P C C N P 8 p Q w A i U 7 h J E o 2 S W M S M k e Y W R K 9 g k j V P K M M F I l z w l P Y m V H 4 o b H Y c g i N 7

e j W I X F h T X I 7 U B 4 2 g 6 6 Q u k F y 1 a + H G p b j e + K 8 t B k P T w i z Q 5 6 S 4 t W Y 9 G y j h o L b 1 r T 8 9 M T 4 2 f j m f H C s I x X x h t j y 2 g b H Y M b H 4 1 P x q / G b 5 X f K 3 9 W / q r 8 f d f 6 a G 7 6 z E / G Z 1 f l n 3 8 B K G 3 W P A = = < / l a t e x i t >

(•)

< l a t e x i t s h a 1 _ b a s e 6 4 = " H A Z Z A J T h m B q k O Z h O P W i t 3 O O Z a O k = " > A A A K P

3 i c b d b L b t t G F A Z g O k 3 b S L 0 5 7 b I b I k a A F B A M 0 Z E d Z Z f I 9 7 t 8 0 c U 2 B W M 4 H I 4 Y 8 5 b h 0 I 5 M 8 A n y N N m 2 m 7 5 G X 6 D L o t v u S s l y / x O r B A Q N v 3 N I j W b +

x T h J 4 K e 6 X v 9 j 7 t E X j 7 / 8 6 u s n l e o 3 3 3 7 3 / Q

/ z T 3 / s p n G m u O j w O I h V 3 2 G p C P x I d L S v A 9 F P l G C h E 4 i e c 7 U 6 r v e u h U r 9 O D r V o 0 Q M Q i Y j 3 / M 5 0 y V d z j + 3 n T h w 0 1 F Y f u V 2 M v S L F 5 8 J d 2 N d / H I 5 v 1 B f r E 8 u c 3 Z g T Q c L x v R q X z 6 t z N l u z L N Q R J o H L E 0 v r H q i B z l T 2 u e B K K p 2 l o q E 8 S s m x U U 5 j F g o 0 k E + + T + F + b w U 1 / R i V X 4 i b U 6 U P p G z M B 1 P s O w M m R 6 m D 2 t j /

L / a R a a 9 5 i D 3 o y T T I u J 3 P + R l g a l j c 7 w 4 p u s r w X U w K g e M K 7 + c q 8 m H T D G u y y W s 2 q 7 w y m W e T C c P R 0 6 Q i S I / 3 m w V e W O l 1 m z W r E a 9 e N i k h

D v t s Z p W b e V 1 r b k y 0 x M r F s n 7 V y 3 V G z X T a r y s m c v L M 5 1 J p p L g v t N 6 W X Y 2 X p X d y 7 P v l E q I 6 H 5 2 5 d Q s q 2 Y u v S 6 q 1 U m j f X 0 r V J z n 9 n i J H C + v F 0 U x L c S R g F v w M A P b Y Y a C H g r N S G 1 y j z I p E X W g D p R D O d S F u l A y S w H 1 o B 5 U Q i V 0 C B 1 C f a g P f Q d 9 B 7 2 C X k E D a E D W D x p C I 2 h E 9 g A a Q x N o A n 0 P f Q 9 V U A V N o S n Z Q K i G k u 0 m m 3 0 N v Y b e Q G + g H 6 A f o C P o C H o L v R 3 r H Y d v w W / / a w 5 b 0 B Z 0 F b o K X Y O u Q d e h 6 9

A N 6 A Z 0 E 7 o J 3 Y J u Q b e h 2 9 A d 6 A 5 0 F 7 o L 3 Y P u Q f e h + 9 A D 6 A H 0 E H o I b U P b 0 C P o E f Q Y e g w 9 g Z 5 A T 6 G n 0 A 6 0 A + 1 C u 9 A e t A f t Q /

v Q M + g Z 9 B x 6 j v D I + / B w F p D 0 y B Z h x E e u E k Z + 5 B p h B E i u E 0 a C 5 A Z h R E h u E k a G 5 B Z h h E h u E 0 a K 5 A 5 h x E j u E k a O 5 B 5 h B E n u E 0 a S 5 A F h R E k e E k a W Z J s w w i S P C C N N 8 p g w 4 i R P C C N P 8 p Q w A i U 7 h J E o 2 S W M S M k e Y W R K 9 g k j V P K M M F I l z w l P Y m V H 4 o b H Y c g i N 7

e j W I X F h T X I 7 U B 4 2 g 6 6 Q u k F y 1 a + H G p b j e + K 8 t B k P T w i z Q 5 6 S 4 t W Y 9 G y j h o L b 1 r T 8 9 M T 4 2 f j m f H C s I x X x h t j y 2 g b H Y M b H 4 1 P x q / G b 5 X f K 3 9 W / q r 8 f d f 6 a G 7 6 z E / G Z 1 f l n 3 8 B K G 3 W P A = = < / l a t e x i t >

< l a t e x i t s h a 1 _ b a s e 6 4 = " E K G F j e p b x 5 c a f J 1 5 n y I y H U x E J U c = " > A A A K H n i c b d Z J b 9 t G G A Z g O t 0 i d U n S H n s h a h R I A c H Q O L K j 3 B J 5 3 + V F i 2 0 K w X A 0 H N H m l u H Q j k z w P / T a X v p r e i x 6 b f 9 N K V n u + 9 U q A Q H D 5 / t I j m b e w 7 h J 4 K e m X v 9 7 4 c k n n 3 7 2

+ R d P K 9 U v v / r 6 m 2 f P X 3 z b T e N M C 9 k R c R D r v s t T G f i R 7 B j f B L K f a M l D N 5 A 9 9 3 p t U u / d S J 3 6 c X R m x o k c h F x F v u c L b k r q O u r o 5 c F P 7 5 8 v 1 p f q 0 8 u e H 7 D Z Y N G a X e 3 3 L y o L z j A W W S g j I w K e p p e s n p h B z r X x R S C L q p O l M u H i m i t 5 W Q 4 j H s p 0 k E + n W 9 g / l j K 0 v V i X v 8 j Y U 6 V P 5 D x M 0 3 H o l p 0 h N 6 P 0 c W 2 C / 1 e 7 z I z X H O R + l G R G R u L + Q 1 4 W 2 C a 2 J / / d H v p a C h O M y w E X 2 i / n a o s R 1 1 y Y c o W q z l B 6 5 S p O p 5 O H Y z f I Z J G f b L W K v L F a a z Z r r F E v H j d p O Z z 1 s C a r r b 6 p N V f n e m L N I / X w q u V 6 o 2 a z x q u a v b I y 1 5 l k O g k e O t m r s r P x u u x e m X + n 0 l J G D 7 M r p 8 Z Y z V 5 + U 1 S r 0 0 b n 5 k 7 q O M + d y R K 5 X l 4 v i m J W i C M J Z / A w A z t h h o I Z S c N J b X q P M i k R d a E u V E A F d A g d Q s k s J d S D e l A F V d A R d A T 1 o T 7 0 C n o F v Y Z e Q w N o Q N Y P G k I j a E T 2 A B p D E 2 g C / Q D 9 A N V Q D U 2 h K d l A q I G S 7

S a b f Q O 9 g d 5 C b 6 E f o R + h Y + g Y e g e 9 m + g 9 h + / A 7 < l a t e x i t s h a 1 _ b a s e 6 4 = " M A 8 A Z n 8 N / V o N T E G j 9 R t Q q 4 F J F z 0 = " > A A A K H n i c b d Z J b 9 t G G A Z g O t 0 i d U n S H n s h a h R I A c H Q O L K j 3 B J 5 3 + V F i 2 0 K w X A 0 H N H m l u H Q j k z w P / T a X v p r e i x 6 b f 9 N K V n u + 9 U q A Q H D 5 / t I j m b e w 7 h J 4 K e m X v 9 7 4 c k n n 3 7 2

/ 5 t D l v Q F n Q N u g Z d h 6 5 D N 6 A b 0 E 3 o J n Q L u g X d h m 5 D d 6 A 7 0 F 3 o L n Q P u g f d h + 5 D D 6 A H 0 E P o I f Q I e g R t Q 9 v Q Y + g x 9 A R 6 A j 2 F n k L P o G f Q D r Q D 7 U K 7 0 B 6 0 B + 1 D + 9 B z 6 D n 0 A n q B 8 K i H 8 A g e k P S o F m H E R 6 0 R R n 7 U O m E E S G 0 Q R o L U J m F E S G 0 R R o b U N m G E S O 0 Q R o r U L m H E S O 0 R R o 7 U P m E E S R 0 Q R p L U I W F E S R 0 R R p Z U m z D C p I 4 J I 0 3 q h D D i p E 4 J I 0 / q j D A C p T q E k S j V J Y x I q R 5 h Z E r 1 C S N U 6 p w w U q U u C E 9 j 5 U T y V s R h y K N h 7 k S x D o t L N s i d Q H r G C b p S m 0 X m a F + N j K M n d 0 V

+ R d P K 9 U v v / r 6 m 2 f P X 3 z b T e N M C 9 k R c R D r v s t T G f i R 7 B j f B L K f a M l D N 5 A 9 9 3 p t U u / d S J 3 6 c X R m x o k c h F x F v u c L b k r q O u r o 5 e F P 7 5 8 v 1 p f q 0 8 u e H 7 D Z Y N G a X e 3 3 L y o L z j A W W S g j I w K e p p e s n p h B z r X x R S C L q p O l M u H i m i t 5 W Q 4 j H s p 0 k E + n W 9 g / l j K 0 v V i X v 8 j Y U 6 V P 5 D x M 0 3 H o l p 0 h N 6 P 0 c W 2 C / 1 e 7 z I z X H O R + l G R G R u L + Q 1 4 W 2 C a 2 J / / d H v p a C h O M y w E X 2 i / n a o s R 1 1 y Y c o W q z l B 6 5 S p O p 5 O H Y z f I Z J G f b L W K v L F a a z Z r r F E v H j d p O Z z 1 s C a r r b 6 p N V f n e m L N I / X w q u V 6 o 2 a z x q u a v b I y 1 5 l k O g k e O t m r s r P x u u x e m X + n 0 l J G D 7 M r p 8 Z Y z V 5 + U 1 S r 0 0 b n 5 k 7 q O M + d y R K 5 X l 4 v i m J W i C M J Z / A w A z t h h o I Z S c N J b X q P M i k R d a E u V E A F d A g d Q s k s J d S D e l A F V d A R d A T 1 o T 7 0 C n o F v Y Z e Q w N o Q N Y P G k I j a E T 2 A B p D E 2 g C / Q D 9 A N V Q D U 2 h K d l A q I G S 7 S a b f Q O 9 g d 5 C b 6 E f o R + h Y + g Y e g e 9 m + g 9 h + / A 7 / 5 t D l v Q F n Q N u g Z d h 6 5 D N 6 A b 0 E 3 o J n Q L u g X d h m 5 D d 6 A 7 0 F 3 o L n Q P u g f d h + 5 D D 6 A H 0 E P o I f Q I e g R t Q 9 v Q Y + g x 9 A R 6 A j 2 F n k L P o G f Q D r Q D 7 U K 7 0 B 6 0 B + 1 D + 9 B z 6 D n 0 A n q B 8 K i H 8 A g e k P S o F m H E R 6 0 R R n 7 U O m E E S G 0 Q R o L U J m F E S G 0 R R o b U N m G E S O 0 Q R o r U L m H E S O 0 R R o 7 U P m E E S R 0 Q R p L U I W F E S R 0 R R p Z U m z D C p I 4 J I 0 3 q h D D i p E 4 J I 0 / q j D A C p T q E k S j V J Y x I q R 5 h Z E r 1 C S N U 6 p w w U q U u C E 9 j 5 U T y V s R h y K N h 7 k S x D o t L N s i d Q H r G C b p S m 0 X m a F + N j K M n d 0 V

5 o f O X / m R p F 1 7 d 8 R U q W C 4 = " > A A A K L X i c b d Z J b 9 t G G A Z g O t 0 i p Y v T H H s h a h T o Q T A 0 j u w o h w K J v O / y o s U 2 V W F I D U

e M u X k 4 l C M T / C 2 9 t p f + m l w C B L 3 2 b 5 S S 5 b 5 f r Q 4 g Y P h 8 H 6 n B z H s Y O / a 9 R F e r H x e e f P b 5 F 1 9 + 9 b R U f v b 1 N 9 9 + t / j 8 + 3 Y S p c o R L S f y I 9 W 1 e S J 8 L x Q t 7 W l f d G M l e G D 7 o m N f r 0 / q n Z F Q i R e F 5 3 o c i 1 7 A Z e i 5 n s N 1 Q f 3 F F 1 Z m j W 7 6 n p X 3 M + 8 X l v + a H e X 9 x a X q c n U 6 z P k J m 0 2 W j N l o 9 p + X F q x B 5 K S B C L X j 8 y S 5 Y t V Y 9 z K u t O f 4 I i 9 b a S J i 7 l x z K a 6 K a c g D k f S y 6 e p z 8 6 d C B q Y b q e I X a n O q 9 I 2 M B 0 k y D u

y i M + B 6 m D y u T f D / a l e p d u u 9 z A v j V I v Q u f 8 j N / V N H Z m T r T A H n h K O 9 s f F h D v K K 9 Z q O k O u u K O L D S t b A + E W m z p d T h a M b T 8 V e X a 6 3 c i z 2 l q l X q + w W j V / 3 K T E Y N b D 6 q y y 9 r p S X 5 v r i R Q P 5 c O n V q q 1 i s l q L y v m 6 u p c Z 5 y q 2 H / o Z C + L z t q r o n t 1 / p t S C R E + r K 5 Y G m M V c + V 1 X i 5 P G 6 3 R n V B R l l m T L b L d r J r n + a w Q h Q L O 4 E E K t o I U B T 0 U m p P a 9 B l l U i J q Q 2 2 o A 3 W g A + g A S l Y p o C 7 U h U q o h A 6 h Q 6 g H 9 a D v o O + g 1 9 B r q A / 1 y f 5 B A 2 g I D c k Z Q C N o D I 2 h N 9 A b q I I q a A J N y A F C N Z Q c N z n s E X Q E v Y X e Q t 9 D 3 0 P H 0 D H 0 D n o 3 0 X s O 3 o L f / t s c N K A N 6 D p 0 H b o B 3 Y B u Q j e h W 9

A t 6 D Z 0 G 7 o D 3 Y H u Q n e h e 9 A 9 6 D 5 0 H 3 o A P Y A e Q g + h R 9 A j 6 D H 0 G N q E N q E n 0 B P o K f Q U e g Y 9 g 5 5 D z 6 E t a A v a h r a h H W g H 2 o

V 2 o R f Q C + g l 9 B L h k Q / h c b h P 0 i M b h B E f u U 4 Y + Z E b h B E g u U k Y C Z J b h B E h u U 0 Y G Z I 7 h B E i u U s Y K Z J 7 h B E j u U 8 Y O Z I H h B E k e U g Y S Z J H h B E l e U w Y W Z J N w g i T P C G M N M l T w o i T P C O M P M l z w g i U b B F G o m S b M C I l O 4 S R K d k l j F D J C 8 J I l b w k P I 2 V F Y p b J w o C H g 4 y K 4 x U k F + x X m b 5 w t W W 3 x Z K L z F L e X K o L T V 5 m l y a 2 O M r 0 v y k s 7 L M a s u M n d S W 3 j R m 9 6 e n x g / G j 8 b P B j N e G W + M H a N p t A z H G B u / G b 8 b f 5 T + L H 0 o f S r 9 d d / 6 Z G H 2 z g v j P 6 P 0 9 z 9 1 7 M 4 L < / l a t e x i t > {q i } N i=1

< l a t e x i t s h a 1 _ b a s e 6 4 = " 0 H + N 8 b 0 t Y v / J e C 1 w 8 N x p 5 Y 5 / J

+ U = " > A A A K L X i c b d Z J b 9 t G G A Z g O u k S q Z u T H H s h a h T o Q T A 0 j u w o h w K J v O / y o s U 2 V W F I D U e M u G

U 4 l C M T / C 2 9 t p f + m l 4 K F L 3 2 b 5 S S 5 b 5 f r Q 4 g Y P h 8 H 6 n B z H s Y O / a 9 R F e r f y w 9 e f r J p 5 9 9 / q x U / u L L r 7 7 + Z

v n 5 i 3 Y S p c o R L S f y I 9 W 1 e S J 8 L x Q t 7 W l f d G M l e G D 7 o m O P N q f 1 z l i o x I v C S z 2 J R S / g M v R c z + G 6 o P 7 y S y u z x q O + Z + X 9 z P u R 5 T 9 l x 3 l / e a W 6 W p 0 N c 3 H C 5 p M V Y z 6 a / e e l J W s Q O W k g Q u 3 4 P E l u W D X W v Y w r 7 T m + y M t W m o i Y O y M u x U 0 x D X k g k l 4 2 W 3 1 u f l / I w H Q j V f x C b c 6 U v p H x I E k m g V 1 0 B l w P k 8 e 1 K f 5 f 7 S b V b r 2 X e W G c a h E 6 9 3 / k p r 6 p I 3 O 6 F e b A U 8 L R / q S Y c E d 5 x V p N Z 8 g V d 3 S x Y W V r I N x i U 2 f L y Y K J 7 a c i z 8 5 3 G 3 l W 2 6 j U 6 x V W q + a P m 5 Q Y z H t Y n V U 2 3 l T q G w s 9 k e K h f P j U W r V W M V n t V c V c X 1 / o j F M V + w + d 7 F X R W X t d d K 8 v f l M q I c K H 1 R V L Y 6 x i r r 3 J y + V Z o z W + E y r K M m u 6 R b a b V f M 8 n x e i U M A Z P E j B V p C i o I d C c 1 K b P a N M S k R t q A 1 1 o A 5 0 A B 1 A y S o F 1 I W 6 U A m V 0 C F 0 C P W g H v Q 9 9 D 1 0 B B 1 B f a h P 9 g 8 a Q E N o S M 4 A G k F j a A z 9 A P 0 A V V A F T a A J O U C o h p L j J o c 9

h o 6 h t 9 B b 6 E f o R + g E O o H e Q e + m e s / B O / C 7 f 5 u D B r Q B 3 Y R u Q r e g W 9 B t 6 D Z 0 B 7 o D 3 Y X u Q v e g e 9 B 9 6 D 7 0 A H o A P Y Q e Q o + g R 9 B j 6 D H 0 B

H o C P Y W e Q p v Q J v Q M e g Y 9 h 5 5 D L 6 A X 0 E v o J b Q F b U H b 0 D a 0 A + 1 A u 9 A u 9 A p 6 B b 2 G X i M 8 8 i E 8 D v d J e m S D M O I j N w k j P 3 K L M A I k t w k j Q X K H M C I k d w k j Q 3 K P M E I k 9 w k j R f K A M G I k D w k j R / K I M I I k j w k j S f K E M K I k T w k j S 7 J J G G G S Z 4 S R J n l O G H G S F 4 S R J 3 l J G I G S L c J I l G w T R q R k h z A y J b u E E S p 5 R R i p k t e E Z 7 G y Q n H r R E H A w 0 F m h Z E K 8 h v W y y x f u N r y 2 0 L p F W Y p T w 6 1 p a Z P 0 0 s T e 3 x F W p x 0 1 l Z Z b Z W x s 9 r K 2 8 b 8 / v T M + N b 4 z v j B Y M Z r 4 6 2 x Z z S N l u E Y E + N n 4 x f j 1 9 J v p d 9 L f 5 b + u m 9 9 s j R / 5 6 X x n 1 H 6 + x 8 x p M 4 E < / l a t e x i t > {k i } M i=1

< l a t e x i t s h a 1 _ b a s e 6 4 = " O e G p z y C c y 6

Z C o 0 o E D G M V u J c b Z C I = " > A A A K L X i c b d b L b t t G F A Z g O k n b S O n F S Z b d E D U K d C E Y G k d 2 l E W B R L 7 f 5 Y s u t q k K Q 2 o 4 Y s x b h k M 5 M s F n y b b d 9 G m y K V B 0 2 9 c o J c v 9 T 6 0 O I G D 4 n U N q M P M v

x o 5 9 L 9 H V 6 h 8 L j x 4 / + e L L r 5 6 W y s + + / u b b 7 x a f v 2 g n U a o c 0 X I i P 1 J d m y f C 9 0 L R 0 p 7 2 R T d W g g e 2 L z r 2 9 f q k 3 h k J l X h R e K 7 H s e g F X I a e 6 z l c F 9 R f f G l l 1 m j U 9 6 y 8 n 3 k / s / y

X 7 D D v L y 5 V l 6 v T Y c 5 P 2 G y y Z M x G s / + 8 t G A N I i c N R K g d n y f J F a v G u p d x p T 3 H F 3 n Z S h M R c + e a S 3 F V T E M e i K S X T V e f m z 8 W M j D d S B W / U J t T p W 9 k P E i S c W A X n Q H X w + R h b Y L / V 7 t K t V v v Z V 4 Y p 1 q E z t 0 f u a l v 6 s i c b I U 5 8 J R w t D 8 u J t x R X r F W 0 x l y x R 1 d b F j Z G g i 3 2

N T p c r J g b P u p y L P T 7 U a e 1 d Y q 9 X q F 1 a r 5 w y Y l B r M e V m e V t T e V + t p c T 6 R 4 K O 8 / t V K t V U x W e 1 U x V 1 f n O u N U x f 5 9 J 3 t V d N Z e F 9 2 r 8 9 + U S o j w f n X F 0 h i r m C t v 8 n J 5 2 m i N b o W K s s y a b J H t Z t U 8 z 2 e F K B R w B g 9 S s B W k K O i h 0 J z U p s 8 o k x J R G 2 p D H a g D H U A H U L J K A X W h L l R C J X Q I H U I 9 q A d 9 D 3 0 P v Y Z e Q 3 2 o T / Y P G k B D a E j O A B p B Y 2 g M / Q D 9 A F V Q B U 2 g C T l A q I a S 4 y a H P Y K O o D f Q G + h H 6 E f o G D q G 3 k J v J 3 r H w T v w u 3 + b g w a 0 A V 2 H r k M 3 o B v Q T e g m d A u 6 B d 2 G b k N 3 o D v Q X e g u d A + 6 B 9 2 H 7 k M P o A f Q Q + g h 9 A h 6 B D 2 G H k O b 0 C b 0 B H o C P Y W e Q s + g Z 9 B z 6 D m 0 B W 1 B 2 9 A 2 t A P t Q L v Q L v Q C e g G 9 h F 4 i P P I + P A 7 3 S X p k g z D i I 9 c J I z 9 y g z A C J D c J I 0 F y i z A i J L c J I 0 N y h z B C J H c J I 0 V y j z B i J P c J I 0 f y g D C C J A 8 J I 0 n y i D C i J I 8 J I 0 u y S R h h k i e E k S Z 5 S h h x k m e E k S d 5 T h i B k i 3 C S J R s E 0 a k Z I c w M i W 7 h B E q e U E Y q Z K X h K e x s k J x 4 0 R B w M N B Z o W R C v I r 1 s s s X 7 j a 8 t t C 6 S V m K U 8 O t a U m T 5 N L E 3 t 4 R Z q f d F a W W W 2 Z s Z P a 0 t v G 7 P 7 0 1 P j e + M H 4 y W D G a + O t s W M 0 j Z b h G G P j k / G r 8 V v p 9 9 L n 0 p + l v + 5 a H y 3 M 3 n l p / G e U / v 4 H n Q X O D w = = < / l a t e x i t >

{v i } M i=1

< l a t e x i t s h a 1 _ b a s e 6 4 = " m D v r 6 a 7 P Y g P p F H

Q f A F d 1 u G l y T d k = " > A A A K L X i c b d b L b t t G F A Z g O k n b S O n F S Z b d E D U K d C E Y G k d 2 l E W B R L 7 f 5 Y s u t q k K Q 2 o 4 Y s x b h k M 5 M s F n y b b d 9 G m y K V B 0 2 9 c o J c v 9 T 6 0 O I G D 4 n U N q M P M v

x o 5 9 L 9 H V 6 h 8 L j x 4 / + e L L r 5 6 W y s + + / u b b 7 x a f v 2 g n U a o c 0 X I i P 1 J d m y f C 9 0 L R 0 p 7 2 R T d W g g e 2 L z r 2 9 f q k 3 h k J l X h R e K 7 H s e g F X I a e 6 z l c F 9 R f f G l l 1 m j Y 9 6 y 8 n 3 k / s / y X 7 C j v L y 5

V l 6 v T Y c 5 P 2 G y y Z M x G s / + 8 t G A N I i c N R K g d n y f J F a v G u p d x p T 3 H F 3 n Z S h M R c + e a S 3 F V T E M e i K S X T V e f m z 8 W M j D d S B W / U J t T p W 9 k P E i S c W A X n Q H X w + R h b Y L / V 7 t K t V v v Z V 4 Y p 1 q E z t 0 f u a l v 6 s i c b I U 5 8 J R w t D 8 u J t x R X r F W 0 x l y x R 1 d b F j Z G g i 3 2

N T p c r J g b P u p y L P T 7 U a e 1 d Y q 9 X q F 1 a r 5 w y Y l B r M e V m e V t T e V + t p c T 6 R 4 K O 8 / t V K t V U x W e 1 U x V 1 f n O u N U x f 5 9 J 3 t V d N Z e F 9 2 r 8 9 + U S o j w f n X F 0 h i r m C t v 8 n J 5 2 m i N b o W K s s y a b J H t Z t U 8 z 2 e F K B R w B g 9 S s B W k K O i h 0 J z U p s 8 o k x J R G 2 p D H a g D H U A H U L J K A X W h L l R C J X Q I H U I 9 q A d 9 D 3 0 P v Y Z e Q 3 2 o T / Y P G k B D a E j O A B p B Y 2 g M / Q D 9 A F V Q B U 2 g C T l A q I a S 4 y a H P Y K O o D f Q G + h H 6 E f o G D q G 3 k J v J 3 r H w T v w u 3 + b g w a 0 A V 2 H r k M 3 o B v Q T e g m d A u 6 B d 2 G b k N 3 o D v Q X e g u d A + 6 B 9 2 H 7 k M P o A f Q Q + g h 9 A h 6 B D 2 G H k O b 0 C b 0 B H o C P Y W e Q s + g Z 9 B z 6 D m 0 B W 1 B 2 9 A 2 t A P t Q L v Q L v Q C e g G 9 h F 4 i P P I + P A 7 3 S X p k g z D i I 9 c J I z 9 y g z A C J D c J I 0 F y i z A i J L c J I 0 N y h z B C J H c J I 0 V y j z B i J P c J I 0 f y g D C C J A 8 J I 0 n y i D C i J I 8 J I 0 u y S R h h k i e E k S Z 5 S h h x k m e E k S d 5 T h i B k i 3 C S J R s E 0 a k Z I c w M i W 7 h B E q e U E Y q Z K X h K e x s k J x 4 0 R B w M N B Z o W R C v I r 1 s s s X 7 j a 8 t t C 6 S V m K U 8 O t a U m T 5 N L E 3 t 4 R Z q f d F a W W W 2 Z s Z P a 0 t v G 7 P 7 0 1 P j e + M H 4 y W D G a + O t s W M 0 j Z b h G G P j k / G r 8 V v p 9 9 L n 0 p + l v + 5 a H y 3 M 3 n l p / G e U / v 4 H H h H O A g = = < / l a t e x i t >

{h i } N i=1

< l a t e x i t s h a 1 _ b a s e 6 4 = " Z W e Y L p o

x / d Y s Y 7 p E F R o p n W u q s G 4 = " > A A A K L n i c b d b L b t t G F A Z g O k 3 b S O n F a b L L h q h R I A U E Q 3 R k R 9 k l 8 v 0 u X 3 S x T c E Y D o c j x r x l O L Q j E 3 y X b t t N n q a L L o p u + x i l Z L n / q V U C g k b f O a Q O Z v

4 F n S T w U 1 2 v / z H 3 6 I v H X 3 7 1 9 Z N K 9 e k 3 3 3 7 3 / f y z H 7 p p n C k u O j w O Y t V 3 W C o C P x I d 7 e t A 9 B M l W O g E o u d c r Y 7 r v W u h U j + O T v U o E Y O Q y c j 3 f M 5 0 S Z f z L 1 7 Z T h y 4 6 S g s v 3 K b u 7 E u f r 6 c X 6 g v 1 i e X O b u w p o s F Y 3 q 1 L 5 9 V 5 m w 3 5 l k o I s 0 D l q Y X V j 3 R g 5 w p 7 f N A F F U 7 S 0 X C + B W T 4 q J c R i w U 6 S C f j F + Y P 5 X i m l 6 s y k + k z Y n S O 3 I W p u M B y 8 6 Q 6 W H 6 s D b G / 6 t d Z N p r D n I / S j I t I n 7 3 R 1 4 W m D o 2 x 3 t h u r 4 S X A e j c s G 4 8 s t Z T T 5 k i n F d 7 l j

V d o V X 7 u p k n D w c O U E m i v x 4 s 1 X k j Z V a s 1 m z G v X i Y Z M S 7 r T H a l q 1 l b e 1 5 s p M T 6 x Y J O 8 f t V R v 1 E y r 8 b p m L i / P d C a Z S

o L 7 T u t 1 2 d l 4 U 3 Y v z z 5 T K i G i + + n K 0 S y r Z i 6 9 L a r V S a N 9 f S t U n O f 2 e I s c L 6 8 X R T E t x J G A W / A w A 9 t h h o I e C s 1 I b f I b Z V I i 6 k A d K I d y q A t 1 o W R K A f W g H l R C J X Q I H U J 9 q A / 9 A P 0 A v Y J e Q Q N o Q P Y P G k I j a E T O A B p D E 2 g C / Q j 9 C F V Q B U 2 h K T l A q I a S 4 y a H f Q 2 9 h t 5 A b 6 C f o J + g I + g I e g u 9 H e s d h + / B 7 / 9 t D l v Q F n Q V u g p d g 6 5 B 1 6 H r 0 A 3 o B n Q T u g n d g m 5 B t 6 H b 0 B 3 o D n Q X u g v d g + 5 B 9 6 H 7 0 A P o A f Q Q e g h t Q 9 v Q I + g R 9 B h 6 D D 2 B n k B P o a f Q D r Q D 7 U K 7 0 B 6 0 B + 1 D + 9 A z 6 B n 0 H H q O 8 M j 7 8 H A W k P T I F m H E R 6 4 S R n 7 k G m E E S K 4

T R O L K B M F E S G 4 S R O B K F M G E S G 4 T R O R K D M H E S O 4 S R O 7 K H M E E S E 4 T R P L K A W F E S R 4 S R P Z K M Z D C J I 8 I I 0 3 Y M D

D i J E 8 I I 0 / y l D A C J T u E k S j Z J Y x I y R 5 h Z E r 2 C S N U 8 o w w U i X P C U 9 i Z U f i h s d h y C I 3 t 6 N Y h c W F N c j t Q H j a D r p C 6 Q X L V r 4 c a l u N f x X l S 5 P 1 8 B V p d t F b W r Q a i 5 Z 1 1 F h 4 1 5 q + P z 0 x X h o / G q 8 M y 3 h j v D O 2 j L b R M b h x a / x i / G r 8 V v l c + b 3 y Z + W v u 9 Z H c 9 N 7 n h v / u S p / / w P h a M 5 q < / l a t e x i t > (•)

< l a t e x i t s h a 1 _ b a s e 6 4 = " Z W e Y L p o

x / d Y s Y 7 p E F R o p n W u q s G 4 = " > A A A K L n i c b d b L b t t G F A Z g O k 3 b S O n F a b L L h q h R I A U E Q 3 R k R 9 k l 8 v 0 u X 3 S x T c E Y D o c j x r x l O L Q j E 3 y X b t t N n q a L L o p u + x i l Z L n / q V U C g k b f O a Q O Z v

4 F n S T w U 1 2 v / z H 3 6 I v H X 3 7 1 9 Z N K 9 e k 3 3 3 7 3 / f y z H 7 p p n C k u O j w O Y t V 3 W C o C P x I d 7 e t A 9 B M l W O g E o u d c r Y 7 r v W u h U j + O T v U o E Y O Q y c j 3 f M 5 0 S Z f z L 1 7 Z T h y 4 6 S g s v 3 K b u 7 E u f r 6 c X 6 g v 1 i e X O b u w p o s F Y 3 q 1 L 5 9 V 5 m w 3 5 l k o I s 0 D l q Y X V j 3 R g 5 w p 7 f N A F F U 7 S 0 X C + B W T 4 q J c R i w U 6 S C f j F + Y P 5 X i m l 6 s y k + k z Y n S O 3 I W p u M B y 8 6 Q 6 W H 6 s D b G / 6 t d Z N p r D n I / S j I t I n 7 3 R 1 4 W m D o 2 x 3 t h u r 4 S X A e j c s G 4 8 s t Z T T 5 k i n F d 7 l j

V d o V X 7 u p k n D w c O U E m i v x 4 s 1 X k j Z V a s 1 m z G v X i Y Z M S 7 r T H a l q 1 l b e 1 5 s p M T 6 x Y J O 8 f t V R v 1 E y r 8 b p m L i / P d C a Z S o L 7 T u t 1 2 d l 4 U 3 Y v

z z 5 T K i G i + + n K 0 S y r Z i 6 9 L a r V S a N 9 f S t U n O f 2 e I s c L 6 8 X R T E t x J G A W / A w A 9 t h h o I e C s 1 I b f I b Z V I i 6 k A d K I d y q A t 1 o W R K A f W g H l R C J X Q I H U J 9 q A / 9 A P 0 A v Y J e Q Q N o Q P Y P G k I j a E T O A B p D E 2 g C / Q j 9 C F V Q B U 2 h K T l A q I a S 4 y a H f Q 2 9 h t 5 A b 6 C f o J + g I + g I e g u 9 H e s d h + / B 7 / 9 t D l v Q F n Q V u g p d g 6 5 B 1 6 H r 0 A 3 o B n Q T u g n d g m 5 B t 6 H b 0 B 3 o D n Q X u g v d g + 5 B 9 6 H 7 0 A P o A f Q Q e g h t Q 9 v Q I + g R 9 B h 6 D D 2 B n k B P o a f Q D r Q D 7 U K 7 0 B 6 0 B + 1 D + 9 A z 6 B n 0 H H q O 8 M j 7 8 H A W k P T I F m H E R 6 4 S R n 7

k G m E E S K 4 T R o L k B m F E S G 4 S R o b k F m G E S G 4 T R o r k D m H E S O 4 S R o 7 k H m E E S e 4 T R p L k A W F E S R 4 S R p Z k m z D C J I 8 I I 0 3 y m D D i J E 8 I I 0 / y l D A C J T u E k S j Z J Y x I y R 5 h Z E r 2 C S N U 8 o w w U i X P C U 9 i Z U f i h s d h y C I 3 t 6 N Y h c W F N c j t Q H j a D r p C 6 Q X L V r 4 c a l u N f x X l S 5 P 1 8 B V p d t F b W r Q a i 5 Z 1 1 F h 4 1 5 q + P z 0 x X h o / G q 8 M y 3 h j v D O 2 j L b R M b h x a / x i / G r 8 V v

3.1 Random Feature Attention

RFA builds on an unbiased estimate to exp( • , • ) from Theorem 1, which we begin with:

EQUATION (4): Not extracted; please refer to original document.

The last line does not have any nonlinear interaction between φ(x) and φ(y), allowing for a linear time/space approximation to attention. For clarity we assume the query and keys are unit vectors. 2

EQUATION (5): Not extracted; please refer to original document.

⊗ denotes the outer product between vectors, and σ 2 corresponds to the temperature term τ in Eq. 1.

RFA can be used as a drop-in-replacement for softmax-attention.

(a) The input is revealed in full to cross attention and encoder self-attention. Here RFA calculates attention using Eq. 5. (b) In causal attention RFA attends only to the prefix. 3 This allows for a recurrent computation. Tuple (S t ∈ R 2D×d , z t ∈ R 2D ) is used as the "hidden state" at time step t to keep track of the history, similar to those in RNNs. Then RFA(q t ,

EQUATION (6): Not extracted; please refer to original document.

2D denotes the size of φ(•). Appendix A.1 summarizes the computation procedure of RFA, and Figure 1 compares it against the softmax attention. Appendix A.3 derives causal RFA in detail.

Figure 1: Computation graphs for softmax attention (left) and random feature attention (right). Here, we assume cross attention with source length M and target length N .

Analogously to the softmax attention, RFA has its multiheaded variant (Vaswani et al., 2017) . In our experiments we use causal RFA in a transformer language model ( §4.1), and both cross and causal RFA in the decoder of a sequence-to-sequence machine translation model.

3.2 Rfa-Gate: Learning With Recency Bias

The canonical softmax attention does not have any explicit modeling of distance or locality. In learning problems where such inductive bias is crucial (Ba et al., 2016; Parmar et al., 2018; Miconi et al., 2018; Li et al., 2019, inter alia) , transformers heavily rely on positional encodings. Answering to this, many approaches have been proposed, e.g., learning the attention spans (Sukhbaatar et al., 2019; , and enhancing the attention computation with recurrent (Hao et al., 2019; or convolutional (Wu et al., 2019; Mohamed et al., 2019) components.

RFA faces the same issue, but its causal attention variant (Eq. 6) offers a straightforward way of learning with recency bias. We draw inspiration from its connections to RNNs, and augment RFA with a learned gating mechanism (Hochreiter & Schmidhuber, 1997; Cho et al., 2014; Peng et al., 2018 , inter alia):

EQUATION (7): Not extracted; please refer to original document.

w g and b g are learned parameters, and x t is the input representation at timestep t. 4 By multiplying the learned scalar gates 0 < g t < 1 against the hidden state (S t , z t ), history is exponentially decayed, favoring more recent context.

The gating mechanism shows another benefit of RFA: it would be otherwise more difficult to build similar techniques into the softmax attention, where there is no clear sense of "recurrence" (Appendix A.5). It proves useful in our language modeling experiments ( §4.1).

3.3 Discussion

On query and key norms, and learned random feature variance. Eq. 5 assumes both the query and keys are of norm-1. It therefore approximates a softmax attention that normalizes the queries and keys before multiplying them, and then scales the logits by dividing them by σ 2 . Empirically, this normalization step scales down the logits (Vaswani et al., 2017) and enforces that −1 ≤ q k ≤ 1.

In consequence, the softmax outputs would be "flattened" if not for σ, which can be set a priori as a hyperparameter (Yu et al., 2016; Avron et al., 2017; Sun, 2019 , inter alia). Here we instead learn it from data with the reparameterization trick (Kingma & Welling, 2014) :

EQUATION (8): Not extracted; please refer to original document.

I d is the d × d identity matrix, and • denotes elementwise product between vectors. d-dimensional vector σ is learned, but random vectors w i are not. 5

This norm-1 constraint is never mandatory. Rather, we employ it for notation clarity and easier implementation. In preliminary experiments we find it has little impact on the performance when σ is set properly or learned from data. Eq. 12 in Appendix A presents RFA without imposing it.

Going beyond the Gaussian kernel. More broadly, random feature methods can be applied to a family of shift-invariant kernels, with the Gaussian kernel being one of them. In the same family, the order-1 arc-cosine kernel (Cho & Saul, 2009) can be approximated with feature map: Alber et al., 2017) . 6 In our experiments, the Gaussian and arc-cosine variants achieve similar performance. This supplements the exploration of alternatives to softmax in attention (Tsai et al., 2019; Gao et al., 2019) .

φ arccos (x) = 1/D[ReLU(w 1 • x), . . . , ReLU(w D • x)] (

Relations to prior work. Katharopoulos et al. (2020) inspire the causal attention variant of RFA. They use a feature map based on the exponential linear unit activation (Clevert et al., 2016) : elu(•) + 1. It significantly underperforms both the baseline and RFA in our controlled experiments, showing the importance of a properly-chosen feature map. Random feature approximation of attention is also explored by a concurrent work (Choromanski et al., 2020) , with applications in masked language modeling for proteins. They propose positive random features to approximate softmax, aiming for a lower variance in critical regions. RFA instead normalizes the queries and keys before random projection to reduce variance. Going beyond both, RFA establishes the benefits of random feature methods as a more universal substitute for softmax across all attention variants, facilitating its applications in, e.g., sequence-to-sequence learning.

There are interesting connections between gated RFA and fast weights (Schmidhuber, 1992; 1993; Ba et al., 2016; Miconi et al., 2018, inter alia) . Emphasizing recent patterns, they learn a temporal memory to store history similarly to Eqs. 7. The main difference is that RFA additionally normalizes the output using φ(q t ) • z as in Eq. 6, a by-product of approximating softmax's partition function. It is intriguing to study the role of this normalization term, which we leave to future work.

3.4 Complexity Analysis

Time. Scaling linearly in the sequence lengths, RFA needs less computation (in terms of number of operations) for long sequences. This implies speedup wherever the quadratic-time softmax attention cannot be fully-parallelized across time steps. More specifically:

• Significant speedup can be expected in autoregressive decoding, both conditional (e.g., machine translation) and unconditional (e.g., sampling from a language model). For example, 1.9× speedup is achieved in our machine translation experiments ( §4.2); and more for longer sequences (e.g., 12× for 2,048-length ones; §5). • Some applications (e.g., language modeling, text classification) reveal inputs to the model in full. 7 When there are enough threads to parallelize softmax attention across time steps, hardly any speedup from RFA can be achieved; when there are not, typically for very long sequences (>1,000), substantial speed gain is possible. For example, RFA does not achieve any speedup when working with 512-length context ( §4.1), but achieves a 5.3× speedup with 4,000-length context ( §4.2).

Memory. Asymptotically, RFA has a better memory efficiency than its softmax counterpart (linear vs. quadratic). To reach a more practical conclusion, we include in our analysis the cost of the feature maps. φ's memory overhead largely depends on its size D. For example, let's consider the cross attention of a decoder. RFA uses O(4D + 2Dd) space to store φ(q t ), i φ(k i ) ⊗ v i , and i φ(k i ) (Eq. 5; line 12 of Algo. 2). 8 In contrast, softmax cross attention stores the encoder outputs with O(M d) memory, with M being the source length. In this case RFA has a lower memory overhead when 2D M . Typically D should be no less than d in order for reasonable approximation (Yu et al., 2016) ; In a transformer model, d is the size of an attention head, which is usually around 64 or 128 (Vaswani et al., 2017; . This suggests that RFA can achieve significant memory saving with longer sequences, which is supported by our empirical analysis in §5. Further, using moderate sized feature maps is also desirable, so that its overhead does not overshadow the time and memory RFA saves. We experiment with D at d and 2d; the benefit of using D > 2d is marginal.

To ensure fair comparisons, we use comparable implementations, tuning, and training procedure. All models use a 512 block size during both training and evaluation, i.e., they read as input a segment of 512 consecutive tokens, without access to the context from previous mini-batches. RFA variants use 64-dimensional random feature maps. We experiment with two model size settings, small (around 38M parameters) and big (around 242M parameters); they are described in Appendix B.1 along with other implementation details. Results. Table 1 compares the models' performance in perplexity on WikiText-103 development and test data. Both kernel variants of RFA, without gating, outperform φ elu by more than 2.4 and 2.1 test perplexity for the small and big model respectively, confirming the benefits from using random feature approximation. 10 Yet both underperform BASE, with RFA-Gaussian having a smaller gap.

Table 1: Language model perplexity (lower is better) on the WikiText-103 development and test sets. Bolded numbers outperform BASE.

Small

Comparing RFA against its gated variants, a more than 1.8 perplexity improvement can be attributed to the gating mechanism; and the gap is larger for small models. Notably, RFA-GATE-Gaussian outperforms BASE under both size settings by at least 1.2 perplexity. In general, RFA models with Gaussian feature maps outperform their arc-cosine counterparts. 11 From the analysis in §3.4 we would not expect speedup by RFA models, nor do we see any in the experiments. 12

Closing this section, we explore a "stateful" variant of RFA-GATE-Gaussian. It passes the last hidden state (S t , z t ) to the next mini-batch during both training and evaluation, a technique commonly used in RNN language models (Merity et al., 2018) . This is a consequence of RFA's RNN-style computation, and is less straightforward to be applicable in the vanilla transformer models. 13 From the last row of Table 1 we see that this brings a more than 1.5 test perplexity improvement.

4.2 Machine Translation

Datasets. We experiment with three standard machine translation datasets.

Table 2: Machine translation test set BLEU. The decoding speed (last column) is relative to BASE. All models are tested on a single TPU v2 accelerator, with batch size 32.
Table 3: Accuracy (higher is better) of different models on LO, IMDb, and AAN, along with their speed (higher is better) and peak memory consumption (lower is better) varying sequence lengths (1–4K). Speed and memory are evaluated on the IMDb dataset and relative to the transformer’s. Bold font indicates the best performance in each column, and underlined numbers outperform the transformer in accuracy. Transformer’s and previous works’ numbers are due to Tay et al. (2021).
Table 4: Time and space complexity comparisons between RFA and its softmax counterpart in a sequence-to-sequence attentive model, assuming an infinite amount of available threads. M and N denote the lengths of the source and target sequences respectively. Teacher forcing training (Williams & Zipser, 1989) and autoregressive decoding are assumed. Blue color indicates the cases where RFA asymptotically outperforms softmax attention.

• WMT14 EN-DE and EN-FR (Bojar et al., 2014) . Our data split and preprocessing follow those of Vaswani et al. (2017) . We share the source and target vocabularies within each language pair, with 32,768 byte pair encoding types (BPE; Sennrich et al., 2016 ). • IWSLT14 DE-EN (Cettolo et al., 2014) is based on TED talks. The preprocessing follows . Separate vocabularies of 9K/7K BPE types are used for the source and target. Table 5 in Appendix B summarizes some statistics of the datasets.

Table 5: Some statistics for the datasets. WikiText-103 split sizes are in number of tokens, while others are in number of instances.

10 All models are trained for 150K steps; this could be part of the reason behind the suboptimal performance of φ elu : it may need 3 times more gradient updates to reach similar performance to the softmax attention baseline (Katharopoulos et al., 2020) . 11 We observe that RFA Gaussian variants are more stable and easier to train than the arc-cosine ones as well as φ elu . We conjecture that this is because the outputs of the Gaussian feature maps have an 2-norm of 1, which can help stabilize training. To see why, sin 2 (x) + cos 2 (x) = cos(x − x) = 1.

12 In fact, RFA trains around 15% slower than BASE due to the additional overhead from the feature maps.

13 Some transformer models use a text segment from the previous mini-batch as a prefix Dai et al., 2019) . Unlike RFA, this gives the model access to only a limited amount of context, and significantly increases the memory overhead.

Setting. We compare the RFA variants described in §4.1. They build on a BASE model that is our implementation of the base-sized transformer (Vaswani et al., 2017) . All RFA models apply random feature attention in decoder cross and causal attention, but use softmax attention in encoders. This setting yields the greatest decoding time and memory savings ( §3.4). We use 128/64 for D in cross/causal attention. RFA-GATE learns sigmoid gates in the decoder causal attention. The φ elu baseline uses the same setting and applies feature map in both decoder cross and causal attention, but not in the encoders. Further details are described in Appendix B.2.

Wmt14 Iwslt14

Model Results. Table 2 compares the models' test set BLEU on three machine translation datasets. Overall both Gaussian and arc-cosine variants of RFA achieve similar performance to BASE on all three datasets, significantly outperforming Katharopoulos et al. (2020) . Differently from the trends in the language modeling experiments, here the gating mechanism does not lead to substantial gains. Notably, all RFA variants decode more than 1.8× faster than BASE.

4.3 Long Text Classification

We further evaluate RFA's accuracy and efficiency when used as text encoders on three NLP tasks from the recently proposed Long Range Arena benchmark (Tay et al., 2021) , designed to evaluate efficient Transformer variants on tasks that require processing long sequences. 14 Experimental setting and datasets. We compare RFA against baselines on the following datasets:

• ListOps (LO; Nangia & Bowman, 2018) aims to diagnose the capability of modelling hierarchically structured data. Given a sequence of operations on single-digit integers, the model predicts the solution, also a single-digit integer. It is formulated as a 10-way classification. We follow Tay et al. (2021) and consider sequences with 500-2,000 symbols. • Character-level text classification with the IMDb movie review dataset (Maas et al., 2011) . This is a binary sentiment classification task. • Character-level document retrieval with the ACL Anthology Network (AAN; Radev et al., 2009) dataset. The model classifies whether there is a citation between a pair of papers.

To ensure fair comparisons, we implement RFA on top of the transformer baseline by Tay et al. (2021) , and closely follow their preprocessing, data split, model size, and training procedure. Speed and memory are evaluated on the IMDb dataset. For our RFA model, we use D = 64 for the IMDb dataset, and D = 128 for others. We refer the readers to Tay et al. (2021) for further details.

Results. From Table 3 we can see that RFA outperforms the transformer baseline on two out of the three datasets, achieving the best performance on IMDb with 66% accuracy. Averaging across three datasets, RFA outperforms the transformer by 0.3% accuracy, second only to Zaheer et al. (2020) with a 0.1% accuracy gap. In terms of time and memory efficiency, RFA is among the strongest. RFA speeds up over the transformer by 1.1-5.3×, varying by sequence length. Importantly, compared to the only two baselines that perform comparably to the baseline transformer model (Tay et al., 2020a; Zaheer et al., 2020) , RFA has a clear advantage in both speed and memory efficiency, and is the only model that is competitive in both accuracy and efficiency.

5 Analysis

Decoding time and memory varying by sequence length. §3.4 shows that RFA can potentially achieve more significant speedup and memory saving for longer sequences, which we now explore.

We use a simulation conditional generation experiment on to compare RFA's sequence-to-sequence decoding speed and memory overhead against the baseline's. Here we assume the input and output sequences are of the same length. The compared models are of the same size as those described in §4.2, with 6-layer encoders and decoders. Other hyperparameters are summarized in Appendix B.2. All models are tested using greedy decoding with the same batch size of 16, on a TPU v2 accelerator. From Figures 2 (a) and (b) we observe clear trends. Varying the lengths, both RFA variants achieve consistent decoding speed with nearly-constant memory overhead. In contrast, the baseline decodes slower for longer sequences, taking an increasing amount of memory. Notably, for 2,048-length sequences, RFA decodes around 12× faster than the baseline while using less than 10% of the memory. RFA-arccos slightly outperforms RFA-Gaussian in terms of speed and memory efficiency. This is because when using the same D (as we do here), the φ arccos is half the size of φ Gaussian . These results suggest that RFA can be particularly useful in sequence-to-sequence tasks with longer sequences, e.g., document-level machine translation (Miculicich et al., 2018) . Notes on decoding speed. With a lower memory overhead, RFA can use a larger batch size than the baseline. As noted by Katharopoulos et al. (2020) and Kasai et al. (2021) , if we had used mini-batches as large as the hardware allows, RFA could have achieved a more significant speed gain. Nonetheless, we control for batch size even though it is not the most favorable setting for RFA, since the conclusion translates better to common applications where one generates a single sequence at a time (e.g., instantaneous machine translation). For the softmax attention baseline, we follow and cache previously computed query/key/value representations, which significantly improves its decoding speed (over not caching).

Further analysis results. RFA achieves comparable performance to softmax attention. Appendix C.3 empirically shows that this cannot be attributed to RFA learning a good approximation to softmax: when we train with one attention but evaluate with the other, the performance is hardly better than randomly-initialized untrained models. Yet, an RFA model initialized from a pretrained softmax transformer achieves decent training loss after a moderate amount of finetuning steps (Appendix C.4). This suggests some potential applications, e.g., transferring knowledge from a pretrained transformer (e.g., GPT-3; Brown et al., 2020) to an RFA model that is more efficient to sample from.

6 Related Work

One common motivation across the following studies, that is shared by this work and the research we have already discussed, is to scale transformers to long sequences. Note that there are plenty orthogonal choices for improving efficiency such as weight sharing (Dehghani et al., 2019) , quantization (Shen et al., 2020) , knowledge distillation (Sanh et al., 2020) , and adapters (Houlsby et al., 2019) . For a detailed overview we refer the reader to Tay et al. (2020c) .

Sparse attention patterns. The idea behind these methods is to limit the reception field of attention computation. It motivates earlier attempts in improving attention's efficiency, and still receives lots of interest. The sparse patterns can be set a priori (Liu et al., 2018; Qiu et al., 2020; Ho et al., 2020; You et al., 2020, inter alia) or learned from data (Sukhbaatar et al., 2019; Roy et al., 2020, inter alia) . For most of these approaches, it is yet to be empirically verified that they are suitable for large-scale sequence-to-sequence learning; few of them have recorded decoding speed benefits.

Compressed context. compress the context along the timesteps so that the effective sequence length for attention computation is reduced. Another line of work aims to store past context into a memory module with limited size Rae et al., 2020, inter alia) , so that accessing longer history only moderately increases the overhead. Reminiscent of RNN language models, RFA attends beyond a fixed context window through a stateful computation, without increasing time or memory overhead.

7 Conclusion

We presented random feature attention (RFA). It views the softmax attention through the lens of kernel methods, and approximates it with random feature methods. With an optional gating mechanism, RFA provides a straightforward way of learning with recency bias. RFA's time and space complexity is linear in the sequence length. We use RFA as a drop-in substitute for softmax attention in transformer models. On language modeling, machine translation, and long text classification benchmarks, RFA achieves comparable or better performance than strong baselines. In the machine translation experiment, RFA decodes twice as fast. Further time and memory efficiency improvements can be achieved for longer sequences. During training, we sample a different random projection matrix for each attention head. Preliminary experiments suggest this performs better than using the same random projection throughout training (Table 6 ). Our conjecture is that this helps keep the attention heads from "over committing" to any particular random projection (Peng et al., 2020) . To avoid the overhead of sampling from Gaussian during training, we do this in an offline manner. I.e., before training we construct a pool of random matrices (typically 200), at each training step we draw from the pool. At test time each attention head uses the same random projection, since no accuracy benefit is observed by using different ones for different test instances.

Table 6: WMT14 EN-DE development set performance varying the number of random matrices to sample from during training. No beam search or checkpoint averaging is used.

softmax O(M ) O(M ) O(N ) O(M 2 ) O(M N ) O(N 2 ) RFA O(M ) O(M ) O(N ) O(M ) O(M + N ) O(N ) Decoding softmax O(M ) O(M N ) O(N 2 ) O(M 2 ) O(M N ) O(N 2 ) RFA O(M ) O(M + N ) O(N ) O(M ) O(M + N ) O(N )

Figure 2: Conditional decoding speed (left) and memory overhead (right) varying the output lengths. All models are tested on a single TPU v2 accelerator, with greedy decoding and batch size 16.

B.1 Language Modeling

We compare the models using two model size settings, summarized in Table 7 . We use the fixed sinusoidal position embeddings by Vaswani et al. (2017) . All models are trained for up to 150K gradient steps using the Adam optimizer (Kingma & Ba, 2015) . No 2 -regularization is used. We apply early stopping based on development set perplexity. All models are trained using 16 TPU v3 accelerators, and tested using a single TPU v2 accelerator. . We use the fixed sinusoidal position embeddings by Vaswani et al. (2017) . For both EN-DE and EN-FR experiments, we train the models using the Adam (with β 1 = 0.1, β 2 = 0.98, and = 10 −9 ) optimizer for up to 350K gradient steps. We use a batch size of 1,024 instances for EN-DE, while 4,096 for the much larger EN-FR dataset. The learning rate follows that by Vaswani et al. (2017) . Early stopping is applied based on development set BLEU. No 2 regularization or gradient clipping is used. All models are trained using 16 TPU v3 accelerators, and tested using a single TPU v2 accelerator. Following standard practice, we average 10 most recent checkpoints at test time. We evaluate the models using SacreBLEU (Post, 2018) . 16 A beam search with beam size 4 and length penalty 0.6 is used. Other hyperparameters are summarized in Table 8 . Figure 3 compares the RFA's unconditional decoding speed and memory against the softmax attention. The setting is the same as that in §5 except that here the models do not have an encoder. This experiment aims to simulate the applications such as sampling from a language model.

Figure 3: Unconditional decoding speed (left) and memory overhead (right) varying the output lengths. All models are tested on a single TPU v2 accelerator, with greedy decoding and batch size 16.
Table 7: Hyperparameters used in the language modeling experiments.
Table 8: Hyperparameters used in the machine translation experiments.

C.2 Effect Of Random Feature Size

This section studies how the size of φ(•) affects the performance. RFA-Gaussian. When the size of φ(•) is too small (32 or 64 for cross attention, 32 for causal attention), training does not converge. We observe accuracy improvements by using random features sufficiently large (256 for cross attention and 128 for causal attention); going beyond that, the benefit is marginal.

C.3 Train And Evaluate With Different Attention Functions

RFA achieves comparable performance to its softmax counterpart. Does this imply that it learns a good approximation to the softmax attention? To answer this question, we consider:

(i) an RFA-Gaussian model initialized from a pretrained softmax-transformer; (ii) a softmax-transformer initialized from a pretrained an RFA-Gaussian model.

If RFA's good performance can be attributed to learning a good approximation to softmax, both, without finetunining, should perform similarly to the pretrained models. However, this is not the case on IWSLT14 DE-EN. Both pretrained models achieve more than 35.2 development set BLEU. In contrast, (i) and (ii) respectively get 2.3 and 1.1 BLEU without finetuning, hardly beating a randomly-initialized untrained model. This result aligns with the observation by Choromanski et al. (2020) , and suggests that it is not the case that RFA performs well because it learns to imitate softmax attention's outputs.

C.4 Knowledge Transfer From Softmax Attention To Rfa

We first supplement the observation in Appendix C.3 by finetuning (i) on the same pretraining data. Figure 4 plots the learning curves. It takes RFA roughly 1,500 steps to reach similar training loss to the pretrained model. As a baseline, "RFA Reset" resets the multihead attention parameters (i.e., those for query, key, value, and output projections) to randomly initialized ones. Its learning curve is similar to that of (i), suggesting that the pretrained multihead attention parameters are no more useful to RFA than randomly initialized ones. To further confirm this observation, "softmax Reset"

Figure 4: Finetuning an RFA-Gaussian model with its parameters initialized from a pretrained softmax-transformer. “Reset” indicates resetting the multihead attention parameters to randomlyinitialized ones. The dashed line indicates the training loss of the pretrained model.

M = N in self-attention; they may differ, e.g., in the cross attention of a sequence-to-sequence model.

This can be achieved by 2-normalizing the query and keys. See §3.3 for a related discussion. 3 It is also sometimes called "decoder self-attention" or "autoregressive attention."

In multihead attention(Vaswani et al., 2017), kt and vt are calculated from xt using learned affine transformations.5 This departs from Eq. 2 by lifting the isotropic assumption imposed on the Gaussian distribution: note the difference between the vector σ in Eq. 8 and the scalar σ in Eq. 3. We find this improves the performance in practice ( §4), even though the same result in Theorem 1 may not directly apply.6 Apart from replacing the sinusoid functions with ReLU, it constructs wi in the same way as Eq. 8.

A causal masking is usually used to prevent the model from accessing future tokens in language models.8 RFA never constructs the M × 2D × d tensor [φ(ki) ⊗ vi]i, but sequentially processes the sequence. 9 This gating technique is specific to RFA variants, in the sense that it is less intuitive to apply it in BASE.

https://github.com/google-research/long-range-arena

https://github.com/google/jax.

https://github.com/mjpost/sacrebleu

Table 9: WMT14 EN-DE development set performance of RFA-Gaussian (the size of φ is 2D; §2.2) varying the random feature sizes. N/A indicates training does not converge. No beam search or checkpoint averaging is used.