Go To:

Paper Title Paper Authors Table Of Contents Abstract References
Home
Report a problem with this paper

Robust Navigation with Language Pretraining and Stochastic Sampling

Authors

  • Xiujun Li
  • Chunyuan Li
  • Qiaolin Xia
  • Yonatan Bisk
  • A. Çelikyilmaz
  • Jianfeng Gao
  • Noah A. Smith
  • Yejin Choi
  • EMNLP
  • 2019
  • View in Semantic Scholar

Abstract

Core to the vision-and-language navigation (VLN) challenge is building robust instruction representations and action decoding schemes, which can generalize well to previously unseen instructions and environments. In this paper, we report two simple but highly effective methods to address these challenges and lead to a new state-of-the-art performance. First, we adapt large-scale pretrained language models to learn text representations that generalize better to previously unseen instructions. Second, we propose a stochastic sampling scheme to reduce the considerable gap between the expert actions in training and sampled actions in test, so that the agent can learn to correct its own mistakes during long sequential action decoding. Combining the two techniques, we achieve a new state of the art on the Room-to-Room benchmark with 6% absolute gain over the previous best result (47% -> 53%) on the Success Rate weighted by Path Length metric.

1 Introduction

The vision-and-language navigation (VLN) task, learning to navigate in visual environments based on natural language instructions, has attracted interest throughout the artificial intelligence research community (Hemachandra et al., 2015; Anderson et al., 2018; Chen et al., 2019; Savva et al., 2019) . It fosters research on multimodal representations and reinforcement learning, and serves as a test bed for many real-world applications such as in-home robots.

In the recent Room-to-Room (R2R) VLN challenge (Anderson et al., 2018) , most state-of-theart methods are developed based on an encoderdecoder framework (Cho et al., 2014; Sutskever et al., 2014) , where a natural language instruction is represented as a sequence of words, and a navigation trajectory as a sequence of actions, enhanced with attention (Anderson et al., 2018; Wang et al., 2019; Fried et al., 2018; Ma et al., 2019a) . Two important components are shared by all VLN agents: (i) an Instruction Encoder that employs a language model (LM) for instruction understanding; and (ii) an Action Decoder, where an appropriate sequence-level training scheme is required for sequential decision-making. Each component faces its own challenges (see Figure 1) .

Figure 1: Two challenges in VLN.

The first challenge is generalizing grounded natural language instruction understanding from seen to unseen environments. Specifically, in the R2R task, only 69% of bigrams are shared between training and evaluation. 1 Existing work leverages pretrained GloVe embeddings (Pennington et al., 2014) to help generalize. In computer vision, it has been shown that large-scale models pretrained on ImageNet can transfer the knowledge to downstream applications (Yosinski et al., 2014) , thus improving generalization. Comparable language-based transfer learning has not been shown for instruction understanding in VLN.

The second challenge is exposure bias (Ranzato et al., 2016) for the action decoder, due to the discrepancy between training and inference. This problem is common to many tasks where decoding is needed, including text generation, abstractive summarization, and machine translation (Ben- Table 1 : N-grams instruction overlap statistics between validation seen and unseen environments. gio et al., 2015) . Two widely used training strategies are student-forcing and teacher-forcing (described in detail in Section 2.2). It is well-known that the sequence length determines which training strategy is more effective. In the VLN literature, student-forcing has been widely used, as early work (Anderson et al., 2018) used long trajectories (up to 20 steps) with a simple discrete action space. Most recent work, however, has relied on a panoramic action space (Fried et al., 2018) in which most trajectories are only up to seven steps long. In such cases, teacher-forcing is preferable (Tan et al., 2019) . Neither strategy is perfect: teacher-forcing has exposure bias, while studentforcing's random actions can cause an agent to deviate far from the correct path, rendering the original instruction invalid. 2 To tackle these challenges, we have developed two techniques to enable the agent to navigate more efficiently. For the first challenge, we leverage the recent large-scale pretrained language models, BERT (Devlin et al., 2019) and GPT (Radford et al., 2018) , to improve the agent's robustness in unseen environments. We show that large-scale language-only pretraining improves generalization in grounded environments. For the second challenge, we propose a stochastic sampling scheme to balance teacher-forcing and student-forcing during training, so that the agent can recover from its own mistakes at inference time. As a result of combining both techniques, on the R2R benchmark test set, our agent (PRESS) 3 achieves 53% on SPL, an absolute 6% gain over the current state of the art.

Table 1: N-grams instruction overlap statistics between validation seen and unseen environments.

2 Method

In the VLN task, instructions are represented as a set X = {x i } M i=1 of M instructions per trajectory.

2 To compensate, beam search is often used to improve success rates. Recent work, e.g., using search strategies (Ke et al., 2019) or progress monitors (Ma et al., 2019b) , has focused on mitigating the cost of computing top-k rollouts.

3 PRETRAINED LMS AND STOCHASTIC SAMPLING s t < l a t e x i t s h a 1 _ b a s e 6 4 = " y Z 0 1 Y k z d Y a b 6 K 7 w k a 8 O r e o s h B 5 Y = " > A

A A B 8 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K o M e g F 4 8 R z E O S E G Y n s 8 m Q m d l l p l c I S 7 7 C i w d F v P o 5 3 v w b J 8 k e N L G g o a j q p r s r T K S w 6 P v f X m F t f W N z q 7 h d 2 t n d 2 z 8 o H x 4 1 b Z w a x h s s l r F p h 9 R y K T R v o E D J 2 4 n h V I W S t 8 L x 7 c x v P X F j R a w f c J L w n q J D L S L B K D r p M e u G E b H T P v b L F b / q z 0 F W S Z C T C u S o 9 8 t f 3 U H M U s U 1 M k m t 7 Q R + g r 2 M G h R M 8 m m p m 1 q e U D a m Q 9 5 x V F P F b S + b H z w l Z 0 4 Z k C g 2 r j S S u f p 7 I q P K 2 o k K X a e i O L L L 3 k z 8 z + u k G F 3 3 M q G T F L l m i 0 V R K g n G Z P Y 9 G Q j D G c q J I 5 Q Z 4 W 4 l b E Q N Z

e g y K r k Q g u W X V 0 n z o h r 4 1 e D + s l K 7 y e M o w g m c w j k E c A U 1 u I M 6 N I C B g m d 4 h T f P e C / e u / e x a C 1 4 + c w x / I H 3 + Q O 2 6 5 B W < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " y Z 0 1 Y k z d Y a b 6 K 7 w k a 8 O r e o s h B 5 Y = " > A

A A B 8 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K o M e g F 4 8 R z E O S E G Y n s 8 m Q m d l l p l c I S 7 7 C i w d F v P o 5 3 v w b J 8 k e N L G g o a j q p r s r T K S w 6 P v f X m F t f W N z q 7 h d 2 t n d 2 z 8 o H x 4 1 b Z w a x h s s l r F p h 9 R y K T R v o E D J 2 4 n h V I W S t 8 L x 7 c x v P X F j R a w f c J L w n q J D L S L B K D r p M e u G E b H T P v b L F b / q z 0 F W S Z C T C u S o 9 8 t f 3 U H M U s U 1 M k m t 7 Q R + g r 2 M G h R M 8 m m p m 1 q e U D a m Q 9 5 x V F P F b S + b H z w l Z 0 4 Z k C g 2 r j S S u f p 7 I q P K 2 o k K X a e i O L L L 3 k z 8 z + u k G F 3 3 M q G T F L l m i 0 V R K g n G Z P Y 9 G Q j D G c q J I 5 Q Z 4 W 4 l b E Q N Z

e g y K r k Q g u W X V 0 n z o h r 4 1 e D + s l K 7 y e M o w g m c w j k E c A U 1 u I M 6 N I C B g m d 4 h T f P e C / e u / e x a C 1 4 + c w x / I H 3 + Q O 2 6 5 B W < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " y Z 0 1 Y k z d Y a b 6 K 7 w k a 8 O r e o s h B 5 Y = " > A

A A B 8 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K o M e g F 4 8 R z E O S E G Y n s 8 m Q m d l l p l c I S 7 7 C i w d F v P o 5 3 v w b J 8 k e N L G g o a j q p r s r T K S w 6 P v f X m F t f W N z q 7 h d 2 t n d 2 z 8 o H x 4 1 b Z w a x h s s l r F p h 9 R y K T R v o E D J 2 4 n h V I W S t 8 L x 7 c x v P X F j R a w f c J L w n q J D L S L B K D r p M e u G E b H T P v b L F b / q z 0 F W S Z C T C u S o 9 8 t f 3 U H M U s U 1 M k m t 7 Q R + g r 2 M G h R M 8 m m p m 1 q e U D a m Q 9 5 x V F P F b S + b H z w l Z 0 4 Z k C g 2 r j S S u f p 7 I q P K 2 o k K X a e i O L L L 3 k z 8 z + u k G F 3 3 M q G T F L l m i 0 V R K g n G Z P Y 9 G Q j D G c q J I 5 Q Z 4 W 4 l b E Q N Z

e g y K r k Q g u W X V 0 n z o h r 4 1 e D + s l K 7 y e M o w g m c w j k E c A U 1 u I M 6 N I C B g m d 4 h T f P e C / e u / e x a C 1 4 + c w x / I H 3 + Q O 2 6 5 B W < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " y Z 0 1 Y k z d Y a b 6 K 7 w k a 8 O r e o s h B 5 Y = " > A

A A B 8 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K o M e g F 4 8 R z E O S E G Y n s 8 m Q m d l l p l c I S 7 7 C i w d F v P o 5 3 v w b J 8 k e N L G g o a j q p r s r T K S w 6 P v f X m F t f W N z q 7 h d 2 t n d 2 z 8 o H x 4 1 b Z w a x h s s l r F p h 9 R y K T R v o E D J 2 4 n h V I W S t 8 L x 7 c x v P X F j R a w f c J L w n q J D L S L B K D r p M e u G E b H T P v b L F b / q z 0 F W S Z C T C u S o 9 8 t f 3 U H M U s U 1 M k m t 7 Q R + g r 2 M G h R M 8 m m p m 1 q e U D a m Q 9 5 x V F P F b S + b H z w l Z 0 4 Z k C g 2 r j S S u f p 7 I q P K 2 o k K X a e i O L L L 3 k z 8 z + u k G F 3 3 M q G T F L l m i 0 V R K g n G Z P Y 9 G Q j D G c q J I 5 Q Z 4 W 4 l b E Q N Z e g y K r k Q g u W X V

v 3 / Y d d s Z 2 c v T X E S P X l h a U U = " > A A A B 8 H i c b V A 9 S w N B E J 2 L X z F + R S 1 t F o N g F e 5 E 0 D J o Y x n B J E o S w t 5 m L 1 m y u 3 f s z g n x y K + w s V D E 1 p 9 j 5 7 9 x k 1 y h i Q 8 G H u / N M D M v T K S w 6 P v f X m F l d W 1 9 o 7

h Z 2 t r e 2 d 0 r 7 x 8 0 b Z w a x h s s l r G 5 D 6 n l U m j e Q I G S 3 y e G U x V K 3 g p H 1 1 O / 9 c i N F b G + w 3 H C u 4 o O t I g E o + i k h 6 w T R u R p 0 s N e u e J X / R n I M g l y U o E c 9 V 7 5 q 9 O P W a q 4 R i a p t e 3 A T 7 C b U Y O C

S T 4 p d V L L E 8 p G d M D b j m q q u O 1 m s 4 M n 5 M Q p f R L F x p V G M l N / T 2 R U W T t W o e t U F I d 2 0 Z u K / 3 n t F K P L b i Z 0 k i L X b L 4 o S i X B m E y / J 3 1 h O E M 5 d o Q y I 9 y t h A 2 p o Q x d R i U X Q r D 4 8 j J p n l U D v x r c n l d q V 3 k c R T i C Y z i F A C 6 g B j d Q h w Y w U P A M r / D m G e

/ F e / c + 5 q 0 F L 5 8 5 h D / w P n 8 A w Z y Q X Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " l S 9 + r

v 3 / Y d d s Z 2 c v T X E S P X l h a U U = " > A A A B 8 H i c b V A 9 S w N B E J 2 L X z F + R S 1 t F o N g F e 5 E 0 D J o Y x n B J E o S w t 5 m L 1 m y u 3 f s z g n x y K + w s V D E 1 p 9 j 5 7 9 x k 1 y h i Q 8 G H u / N M D M v T K S w 6 P v f X m F l d W 1 9 o 7

h Z 2 t r e 2 d 0 r 7 x 8 0 b Z w a x h s s l r G 5 D 6 n l U m j e Q I G S 3 y e G U x V K 3 g p H 1 1 O / 9 c i N F b G + w 3 H C u 4 o O t I g E o + i k h 6 w T R u R p 0 s N e u e J X / R n I M g l y U o E c 9 V 7 5 q 9 O P W a q 4 R i a p t e 3 A T 7 C b U Y O C

S T 4 p d V L L E 8 p G d M D b j m q q u O 1 m s 4 M n 5 M Q p f R L F x p V G M l N / T 2 R U W T t W o e t U F I d 2 0 Z u K / 3 n t F K P L b i Z 0 k i L X b L 4 o S i X B m E y / J 3 1 h O E M 5 d o Q y I 9 y t h A 2 p o Q x d R i U X Q r D 4 8 j J p n l U D v x r c n l d q V 3 k c R T i C Y z i F A C 6 g B j d Q h w Y w U P A M r / D m G e

/ F e / c + 5 q 0 F L 5 8 5 h D / w P n 8 A w Z y Q X Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " l S 9 + r

v 3 / Y d d s Z 2 c v T X E S P X l h a U U = " > A A A B 8 H i c b V A 9 S w N B E J 2 L X z F + R S 1 t F o N g F e 5 E 0 D J o Y x n B J E o S w t 5 m L 1 m y u 3 f s z g n x y K + w s V D E 1 p 9 j 5 7 9 x k 1 y h i Q 8 G H u / N M D M v T K S w 6 P v f X m F l d W 1 9 o 7

h Z 2 t r e 2 d 0 r 7 x 8 0 b Z w a x h s s l r G 5 D 6 n l U m j e Q I G S 3 y e G U x V K 3 g p H 1 1 O / 9 c i N F b G + w 3 H C u 4 o O t I g E o + i k h 6 w T R u R p 0 s N e u e J X / R n I M g l y U o E c 9 V 7 5 q 9 O P W a q 4 R i a p t e 3 A T 7 C b U Y O C

S T 4 p d V L L E 8 p G d M D b j m q q u O 1 m s 4 M n 5 M Q p f R L F x p V G M l N / T 2 R U W T t W o e t U F I d 2 0 Z u K / 3 n t F K P L b i Z 0 k i L X b L 4 o S i X B m E y / J 3 1 h O E M 5 d o Q y I 9 y t h A 2 p o Q x d R i U X Q r D 4 8 j J p n l U D v x r c n l d q V 3 k c R T i C Y z i F A C 6 g B j d Q h w Y w U P A M r / D m G e

/ F e / c + 5 q 0 F L 5 8 5 h D / w P n 8 A w Z y Q X Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " l S 9 + r

v 3 / Y d d s Z 2 c v T X E S P X l h a U U = " > A A A B 8 H i c b V A 9 S w N B E J 2 L X z F + R S 1 t F o N g F e 5 E 0 D J o Y x n B J E o S w t 5 m L 1 m y u 3 f s z g n x y K + w s V D E 1 p 9 j 5 7 9 x k 1 y h i Q 8 G H u / N M D M v T K S w 6 P v f X m F l d W 1 9 o 7

h Z 2 t r e 2 d 0 r 7 x 8 0 b Z w a x h s s l r G 5 D 6 n l U m j e Q I G S 3 y e G U x V K 3 g p H 1 1 O / 9 c i N F b G + w 3 H C u 4 o O t I g E o + i k h 6 w T R u R p 0 s N e u e J X / R n I M g l y U o E c 9 V 7 5 q 9 O P W a q 4 R i a p t e 3 A T 7 C b U Y O C

S T 4 p d V L L E 8 p G d M D b j m q q u O 1 m s 4 M n 5 M Q p f R L F x p V G M l N / T 2 R U W T t W o e t U F I d 2 0 Z u K / 3 n t F K P L b i Z 0 k i L X b L 4 o S i X B m E y / J 3 1 h O E M 5 d o Q y I 9 y t h A 2 p o Q x d R i U X Q r D 4 8 j J p n l U D v x r c n l d q V 3 k c R T i C Y z i F A C 6 g B j d Q h w Y w U P A M r / D m G e / F e / c + 5 q 0 F L 5 8 5 h D / w P n 8 A w Z y Q X Q = = < / l a t e x i t > c i,t

< l a t e x i t s h a 1 _ b a s e 6 4 = " G 2

V W R 8 g i q P 7 7 i Y 8 h y F X p k X Q c t n g = " > A A A B 9 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g Q c K u C H o M e v E Y w T w g W c L s Z D Y Z M j u 7 z v Q G w r L f 4 c W D I l 7 9 G G / + j Z P H Q R M L G o q q b r q 7 g k Q K g 6 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R 0 8 S p Z r z B Y h n r d k A N l 0 L x B g q U v J 1 o T q N A 8 l Y w u p v 6 r T H X R s T q E S c J 9 y M 6 U C I U j K K V / K w b h I T l v U x c Y N 4 r V 9 y q O w N Z J d 6 C V G C B e q / 8 1 e 3 H L I 2 4 Q i a p M R 3 P T d D P q E b B J M 9 L 3 d T w h L I R H f C O p Y p G 3 P j Z 7 O i c n F m l T 8 J Y 2 1 J I Z u r v i Y x G x k y i w H Z G F I d m 2 Z u K / 3 m d F M M b P x M q S Z E r N l 8 U p p J g T K Y J k L 7 Q n K G c W E K Z F v Z W w o Z U U 4 Y 2 p 5 I N w V

t + e Z U 0 L 6 u e W / U e r i q 1 2 0 U c R T i B U z g H D 6 6 h B v d Q h w Y w e I J n e I U 3 Z + y 8 O O / O x 7 y 1 4 C x m j u E P n M 8 f n C O R + w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " G 2

V W R 8 g i q P 7 7 i Y 8 h y F X p k X Q c t n g = " > A A A B 9 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g Q c K u C H o M e v E Y w T w g W c L s Z D Y Z M j u 7 z v Q G w r L f 4 c W D I l 7 9 G G / + j Z P H Q R M L G o q q b r q 7 g k Q K g 6 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R 0 8 S p Z r z B Y h n r d k A N l 0 L x B g q U v J 1 o T q N A 8 l Y w u p v 6 r T H X R s T q E S c J 9 y M 6 U C I U j K K V / K w b h I T l v U x c Y N 4 r V 9 y q O w N Z J d 6 C V G C B e q / 8 1 e 3 H L I 2 4 Q i a p M R 3 P T d D P q E b B J M 9 L 3 d T w h L I R H f C O p Y p G 3 P j Z 7 O i c n F m l T 8 J Y 2 1 J I Z u r v i Y x G x k y i w H Z G F I d m 2 Z u K / 3 m d F M M b P x M q S Z E r N l 8 U p p J g T K Y J k L 7 Q n K G c W E K Z F v Z W w o Z U U 4 Y 2 p 5 I N w V

t + e Z U 0 L 6 u e W / U e r i q 1 2 0 U c R T i B U z g H D 6 6 h B v d Q h w Y w e I J n e I U 3 Z + y 8 O O / O x 7 y 1 4 C x m j u E P n M 8 f n C O R + w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " G 2

V W R 8 g i q P 7 7 i Y 8 h y F X p k X Q c t n g = " > A A A B 9 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g Q c K u C H o M e v E Y w T w g W c L s Z D Y Z M j u 7 z v Q G w r L f 4 c W D I l 7 9 G G / + j Z P H Q R M L G o q q b r q 7 g k Q K g 6 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R 0 8 S p Z r z B Y h n r d k A N l 0 L x B g q U v J 1 o T q N A 8 l Y w u p v 6 r T H X R s T q E S c J 9 y M 6 U C I U j K K V / K w b h I T l v U x c Y N 4 r V 9 y q O w N Z J d 6 C V G C B e q / 8 1 e 3 H L I 2 4 Q i a p M R 3 P T d D P q E b B J M 9 L 3 d T w h L I R H f C O p Y p G 3 P j Z 7 O i c n F m l T 8 J Y 2 1 J I Z u r v i Y x G x k y i w H Z G F I d m 2 Z u K / 3 m d F M M b P x M q S Z E r N l 8 U p p J g T K Y J k L 7 Q n K G c W E K Z F v Z W w o Z U U 4 Y 2 p 5 I N w V

t + e Z U 0 L 6 u e W / U e r i q 1 2 0 U c R T i B U z g H D 6 6 h B v d Q h w Y w e I J n e I U 3 Z + y 8 O O / O x 7 y 1 4 C x m j u E P n M 8 f n C O R + w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " G 2

V W R 8 g i q P 7 7 i Y 8 h y F X p k X Q c t n g = " > A A A B 9 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g Q c K u C H o M e v E Y w T w g W c L s Z D Y Z M j u 7 z v Q G w r L f 4 c W D I l 7 9 G G / + j Z P H Q R M L G o q q b r q 7 g k Q K g 6 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R 0 8 S p Z r z B Y h n r d k A N l 0 L x B g q U v J 1 o T q N A 8 l Y w u p v 6 r T H X R s T q E S c J 9 y M 6 U C I U j K K V / K w b h I T l v U x c Y N 4 r V 9 y q O w N Z J d 6 C V G C B e q / 8 1 e 3 H L I 2 4 Q i a p M R 3 P T d D P q E b B J M 9 L 3 d T w h L I R H f C O p Y p G 3 P j Z 7 O i c n F m l T 8 J Y 2 1 J I Z u r v i Y x G x k y i w H Z G F I d m 2 Z u K / 3 m d F M M b P x M q S Z E r N l 8 U p p J g T K Y J k L 7 Q n K G c W E K Z F v Z W w o Z U U 4 Y 2 p 5 I N w V

t + e Z U 0 L 6 u e W / U e r i q 1 2 0 U c R T i B U z g H D 6 6 h B v d Q h w Y w e I J n e I U 3 Z + y 8 O O / O x 7 y 1 4 C x m j u E P n M 8 f n C O R + w = = < / l a t e x i t >

x i < l a t e x i t s h a 1 _ b a s e 6 4 = " R 3 k

V m a u p H w F q D y H V / v Q Y X a W x T L E = " > A A A B 8 H i c b V A 9 S w N B E J 2 L X z F + R S 1 t F o N g F e 5 E 0 D J o Y x n B J E o S w t 5 m L 1 m y u 3 f s z o n h y K + w s V D E 1 p 9 j 5 7 9 x k 1 y h i Q 8 G H u / N M D M v T K S w 6 P v f X m F l d W 1 9 o 7 h Z 2 t r e 2 d 0 r 7 x 8 0 b Z w a x h s s l r G 5 D 6 n l U m j e Q I G S 3 y e G U x V K 3 g p H 1 1 O / 9 c i N F b G + w 3 H C u 4 o O t I g E o + i k h 6 w T R u R p 0 h O 9 c s W v + j O Q Z R L k p A I 5 6 r 3 y V 6 c f s 1 R x j U x S a 9 u B n 2 A 3 o w Y F k 3 x S 6 q S W J 5 S N 6 I C 3 H d V U c d v N Z g d P y I l T + i S K j S u N Z K b + n s i o s n a s Q t e p K A 7 t o j c V / / P a K U a X 3 U z o J E W u 2 X x R l E q C M Z l + T / r C c I Z y 7 A h l R r h b C R t S Q x m 6 j E o u h G D x 5 W X S P K s G f j W 4 P a / U r v I 4 i n A E x 3 A K A V x A D W 6 g D g 1 g o O A Z X u H N M 9 6 L 9 + 5 9 z F s L X j 5 z C H / g f f 4

A r e K Q U A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " R 3 k

V m a u p H w F q D y H V / v Q Y X a W x T L E = " > A A A B 8 H i c b V A 9 S w N B E J 2 L X z F + R S 1 t F o N g F e 5 E 0 D J o Y x n B J E o S w t 5 m L 1 m y u 3 f s z o n h y K + w s V D E 1 p 9 j 5 7 9 x k 1 y h i Q 8 G H u / N M D M v T K S w 6 P v f X m F l d W 1 9 o 7 h Z 2 t r e 2 d 0 r 7 x 8 0 b Z w a x h s s l r G 5 D 6 n l U m j e Q I G S 3 y e G U x V K 3 g p H 1 1 O / 9 c i N F b G + w 3 H C u 4 o O t I g E o + i k h 6 w T R u R p 0 h O 9 c s W v + j O Q Z R L k p A I 5 6 r 3 y V 6 c f s 1 R x j U x S a 9 u B n 2 A 3 o w Y F k 3 x S 6 q S W J 5 S N 6 I C 3 H d V U c d v N Z g d P y I l T + i S K j S u N Z K b + n s i o s n a s Q t e p K A 7 t o j c V / / P a K U a X 3 U z o J E W u 2 X x R l E q C M Z l + T / r C c I Z y 7 A h l R r h b C R t S Q x m 6 j E o u h G D x 5 W X S P K s G f j W 4 P a / U r v I 4 i n A E x 3 A K A V x A D W 6 g D g 1 g o O A Z X u H N M 9 6 L 9 + 5 9 z F s L X j 5 z C H / g f f 4

A r e K Q U A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " R 3 k

V m a u p H w F q D y H V / v Q Y X a W x T L E = " > A A A B 8 H i c b V A 9 S w N B E J 2 L X z F + R S 1 t F o N g F e 5 E 0 D J o Y x n B J E o S w t 5 m L 1 m y u 3 f s z o n h y K + w s V D E 1 p 9 j 5 7 9 x k 1 y h i Q 8 G H u / N M D M v T K S w 6 P v f X m F l d W 1 9 o 7 h Z 2 t r e 2 d 0 r 7 x 8 0 b Z w a x h s s l r G 5 D 6 n l U m j e Q I G S 3 y e G U x V K 3 g p H 1 1 O / 9 c i N F b G + w 3 H C u 4 o O t I g E o + i k h 6 w T R u R p 0 h O 9 c s W v + j O Q Z R L k p A I 5 6 r 3 y V 6 c f s 1 R x j U x S a 9 u B n 2 A 3 o w Y F k 3 x S 6 q S W J 5 S N 6 I C 3 H d V U c d v N Z g d P y I l T + i S K j S u N Z K b + n s i o s n a s Q t e p K A 7 t o j c V / / P a K U a X 3 U z o J E W u 2 X x R l E q C M Z l + T / r C c I Z y 7 A h l R r h b C R t S Q x m 6 j E o u h G D x 5 W X S P K s G f j W 4 P a / U r v I 4 i n A E x 3 A K A V x A D W 6 g D g 1 g o O A Z X u H N M 9 6 L 9 + 5 9 z F s L X j 5 z C H / g f f 4

A r e K Q U A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " R 3 k

V m a u p H w F q D y H V / v Q Y X a W x T L E = " > A A A B 8 H i c b V A 9 S w N B E J 2 L X z F + R S 1 t F o N g F e 5 E 0 D J o Y x n B J E o S w t 5 m L 1 m y u 3 f s z o n h y K + w s V D E 1 p 9 j 5 7 9 x k 1 y h i Q 8 G H u / N M D M v T K S w 6 P v f X m F l d W 1 9 o 7 h Z 2 t r e 2 d 0 r 7 x 8 0 b Z w a x h s s l r G 5 D 6 n l U m j e Q I G S 3 y e G U x V K 3 g p H 1 1 O / 9 c i N F b G + w 3 H C u 4 o O t I g E o + i k h 6 w T R u R p 0 h O 9 c s W v + j O Q Z R L k p A I 5 6 r 3 y V 6 c f s 1 R x j U x S a 9 u B n 2 A 3 o w Y F k 3 x S 6 q S W J 5 S N 6 I C 3 H d V U c d v N Z g d P y I l T + i S K j S u N Z K b + n s i o s n a s Q t e p K A 7 t o j c V / / P a K U a X 3 U z o J E W u 2 X x R l E q C M Z l + T / r C c I Z y 7 A h l R r h b C R t S Q x m 6 j E o u h G D x 5 W X S P K s G f j W 4 P a / U r v I 4 i n A E x 3 A K A V x A D W 6 g D g 1 g o O A Z X u H N M 9 6 L 9 + 5 9 z F s L X j 5 z C H / g f f 4

A r e K Q U A = = < / l a t e x i t > a t < l a t e x i t s h a 1 _ b a s e 6 4 = " O h y 8 P y / < l a t e x i t s h a 1 _ b a s e 6 4 = " L V 4 Y c F O c o j a p 3 B T 6 B S X O a s Q 8 e R 8 = " > A A A B 7 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B a h X k o i g h 6 L X j x W s B / Q h r L Z b N q l m 0 3 c n Q i l 9 E 9 4 8 a C I V / + O N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Q S q F Q d f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o Z Z J M M 9 5 k i U x 0 J 6 C G S 6 F 4 E w V K 3 k k 1 p 3 E g e T s Y 3 c 7 8 9 h P X R i T q A c c p 9 2 M 6 U C I S j K K

V N T U 4 M V s P L C K X 4 O 6 8 1 I U = " > A A A B 8 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K o M e g F 4 8 R z E O S E G Y n s 8 m Q m d l l p l c I S 7 7 C i w d F v P o 5 3 v w b J 8 k e N L G g o a j q p r s r T K S w 6 P v f X m F t f W N z q 7 h d 2 t n d 2 z 8 o H x 4 1 b Z w a x h s s l r F p h 9 R y K T R v o E D J 2 4 n h V I W S t 8 L x 7 c x v P X F j R a w f c J L w n q J D L S L B K D r p M e u G E a H T P v b L F b / q z 0 F W S Z C T C u S o 9 8 t f 3 U H M U s U 1 M k m t 7 Q R + g r 2 M G h R M 8 m m p m 1 q e U D a m Q 9 5 x V F P F b S + b H z w l Z 0 4 Z k C g 2 r j S S u f p 7 I q P K 2 o k K X a e i O L L L 3 k z 8 z + u k G F 3 3 M q G T F L l m i 0 V R K g n G Z P Y 9 G Q j D G c q J I 5 Q Z 4 W 4 l b E Q N Z e g y K r k Q g u W X V

V N T U 4 M V s P L C K X 4 O 6 8 1 I U = " > A A A B 8 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K o M e g F 4 8 R z E O S E G Y n s 8 m Q m d l l p l c I S 7 7 C i w d F v P o 5 3 v w b J 8 k e N L G g o a j q p r s r T K S w 6 P v f X m F t f W N z q 7 h d 2 t n d 2 z 8 o H x 4 1 b Z w a x h s s l r F p h 9 R y K T R v o E D J 2 4 n h V I W S t 8 L x 7 c x v P X F j R a w f c J L w n q J D L S L B K D r p M e u G E a H T P v b L F b / q z 0 F W S Z C T C u S o 9 8 t f 3 U H M U s U 1 M k m t 7 Q R + g r 2 M G h R M 8 m m p m 1 q e U D a m Q 9 5 x V F P F b S + b H z w l Z 0 4 Z k C g 2 r j S S u f p 7 I q P K 2 o k K X a e i O L L L 3 k z 8 z + u k G F 3 3 M q G T F L l m i 0 V R K g n G Z P Y 9 G Q j D G c q J I 5 Q Z 4 W 4 l b E Q N Z e g y K r k Q g u W X V

V N T U 4 M V s P L C K X 4 O 6 8 1 I U = " > A A A B 8 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K o M e g F 4 8 R z E O S E G Y n s 8 m Q m d l l p l c I S 7 7 C i w d F v P o 5 3 v w b J 8 k e N L G g o a j q p r s r T K S w 6 P v f X m F t f W N z q 7 h d 2 t n d 2 z 8 o H x 4 1 b Z w a x h s s l r F p h 9 R y K T R v o E D J 2 4 n h V I W S t 8 L x 7 c x v P X F j R a w f c J L w n q J D L S L B K D r p M e u G E a H T P v b L F b / q z 0 F W S Z C T C u S o 9 8 t f 3 U H M U s U 1 M k m t 7 Q R + g r 2 M G h R M 8 m m p m 1 q e U D a m Q 9 5 x V F P F b S + b H z w l Z 0 4 Z k C g 2 r j S S u f p 7 I q P K 2 o k K X a e i O L L L 3 k z 8 z + u k G F 3 3 M q G T F L l m i 0 V R K g n G Z P Y 9 G Q j D G c q J I 5 Q Z 4 W 4 l b E Q N Z e g y K r k Q g u W X V

V N T U 4 M V s P L C K X 4 O 6 8 1 I U = " > A A A B 8 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K o M e g F 4 8 R z E O S E G Y n s 8 m Q m d l l p l c I S 7 7 C i w d F v P o 5 3 v w b J 8 k e N L G g o a j q p r s r T K S w 6 P v f X m F t f W N z q 7 h d 2 t n d 2 z 8 o H x 4 1 b Z w a x h s s l r F p h 9 R y K T R v o E D J 2 4 n h V I W S t 8 L x 7 c x v P X F j R a w f c J L w n q J D L S L B K D r p M e u G E a H T P v b L F b / q z 0 F W S Z C T C u S o 9 8 t f 3 U H M U s U 1 M k m t 7 Q R + g r 2 M G h R M 8 m m p m 1 q e U D a m Q 9 5 x V F P F b S + b H z w l Z 0 4 Z k C g 2 r j S S u f p 7 I q P K 2 o k K X a e i O L L L 3 k z 8 z + u k G F 3 3 M q G T F L l m i 0 V R K g n G Z P Y 9 G Q j D G c q J I 5 Q Z 4 W 4 l b E Q N Z e g y K r k Q g u W X V

V O o N q j 4 U J n v f L F b f m z k F W i Z e T C u R o 9 M t f v T B h W c w V M k m N 6 X p u i v 6 E a h R M 8 m m p l x m e U j a i A 9 6 1 V N G Y G 3 8 y v 3 d K z q w S k i j R t h S S u f p 7 Y k J j Y 8 Z x Y D t j i k O z 7 M 3 E / 7 x u h t G 1 P x E q z Z A r t l g U Z Z J g Q m b P k 1 B o z l C O L a F M C 3 s r Y U O q K U M b U c m G 4 C 2 / v

E p a F z X P r X n 3 l 5 X 6 T R 5 H E U 7 g F K r g w R X U 4 Q 4 a 0 A Q G E p 7 h F d 6 c R + f F e X c + F q 0 F J 5 8 5 h j 9 w P n 8 A Z J m P i A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " L V 4 Y c F O c o j a p 3 B T 6 B S X O a s Q 8 e R 8 = " > A A A B 7 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B a h X k o i g h 6 L X j x W s B / Q h r L Z b N q l m 0 3 c n Q i l 9 E 9 4 8 a C I V / + O N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Q S q F Q d f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o Z Z J M M 9 5 k i U x 0 J 6 C G S 6 F 4 E w V K 3 k k 1 p 3 E g e T s Y 3 c 7 8 9 h P X R i T q A c c p 9 2 M 6 U C I S j K K

V O o N q j 4 U J n v f L F b f m z k F W i Z e T C u R o 9 M t f v T B h W c w V M k m N 6 X p u i v 6 E a h R M 8 m m p l x m e U j a i A 9 6 1 V N G Y G 3 8 y v 3 d K z q w S k i j R t h S S u f p 7 Y k J j Y 8 Z x Y D t j i k O z 7 M 3 E / 7 x u h t G 1 P x E q z Z A r t l g U Z Z J g Q m b P k 1 B o z l C O L a F M C 3 s r Y U O q K U M b U c m G 4 C 2 / v

E p a F z X P r X n 3 l 5 X 6 T R 5 H E U 7 g F K r g w R X U 4 Q 4 a 0 A Q G E p 7 h F d 6 c R + f F e X c + F q 0 F J 5 8 5 h j 9 w P n 8 A Z J m P i A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " L V 4 Y c F O c o j a p 3 B T 6 B S X O a s Q 8 e R 8 = " > A A A B 7 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B a h X k o i g h 6 L X j x W s B / Q h r L Z b N q l m 0 3 c n Q i l 9 E 9 4 8 a C I V / + O N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Q S q F Q d f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o Z Z J M M 9 5 k i U x 0 J 6 C G S 6 F 4 E w V K 3 k k 1 p 3 E g e T s Y 3 c 7 8 9 h P X R i T q A c c p 9 2 M 6 U C I S j K K

V O o N q j 4 U J n v f L F b f m z k F W i Z e T C u R o 9 M t f v T B h W c w V M k m N 6 X p u i v 6 E a h R M 8 m m p l x m e U j a i A 9 6 1 V N G Y G 3 8 y v 3 d K z q w S k i j R t h S S u f p 7 Y k J j Y 8 Z x Y D t j i k O z 7 M 3 E / 7 x u h t G 1 P x E q z Z A r t l g U Z Z J g Q m b P k 1 B o z l C O L a F M C 3 s r Y U O q K U M b U c m G 4 C 2 / v

E p a F z X P r X n 3 l 5 X 6 T R 5 H E U 7 g F K r g w R X U 4 Q 4 a 0 A Q G E p 7 h F d 6 c R + f F e X c + F q 0 F J 5 8 5 h j 9 w P n 8 A Z J m P i A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " L V 4 Y c F O c o j a p 3 B T 6 B S X O a s Q 8 e R 8 = " > A A A B 7 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B a h X k o i g h 6 L X j x W s B / Q h r L Z b N q l m 0 3 c n Q i l 9 E 9 4 8 a C I V / + O N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Q S q F Q d f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o Z Z J M M 9 5 k i U x 0 J 6 C G S 6 F 4 E w V K 3 k k 1 p 3 E g e T s Y 3 c 7 8 9 h P X R i T q A c c p 9 2 M 6 U C I S j K K

V O o N q j 4 U J n v f L F b f m z k F W i Z e T C u R o 9 M t f v T B h W c w V M k m N 6 X p u i v 6 E a h R M 8 m m p l x m e U j a i A 9 6 1 V N G Y G 3 8 y v 3 d K z q w S k i j R t h S S u f p 7 Y k J j Y 8 Z x Y D t j i k O z 7 M 3 E / 7 x u h t G 1 P x E q z Z A r t l g U Z Z J g Q m b P k 1 B o z l C O L a F M C 3 s r Y U O q K U M b U c m G 4 C 2 / v

E p a F z X P r X n 3 l 5 X 6 T R 5 H E U 7 g F K r g w R X U 4 Q 4 a 0 A Q G E p 7 h F d 6 c R + f F e X c + F q 0 F J 5 8 5 h j 9 w P n 8 A Z J m P i A = = < / l a t e x i t >

Stochastic Action Sampling Pre5Trained Language Models

Teacher Action a T < l a t e x i t s h a 1 _ b a s e 6 4 = " Y N u N 9 Y R P 1 n

V P 2 f Y v N f i N Q k h U t L 4 = " > A A A B / H i c b V D L S s N A F J 3 4 r P U V 7 d L N Y B F c l U Q E X R b d u K z Q F 7 S x T K a T d u h k E m Z u h B D i r 7 h x o Y h b P 8 S d f + O k z U J b D w w c

z r m X e + b 4 s e A a H O f b W l v f 2 N z a r u x U d / f 2 D w 7 t o + O u j h J F W Y d G I l J 9 n 2 g m u G Q d 4 C B Y P 1 a M h L 5 g P X 9 2 W / i 9 R 6 Y 0 j 2 Q b 0 p h 5 I Z l I H n B K w E g j u z b 0 g 4 z k D 9 k w J D A F y N p 5 P r L r T s O Z A 6 8 S t y R 1 V K I 1 s r + G 4 4 g m I

Z N A B d F 6 4 D o x e B l R w K l g e X W Y a B Y T O i M T N j B U k p B p L 5 u H z / G Z U c Y 4 i J R 5 E v B c / b 2 R k V D r N P T N Z B F R L 3 u F + J 8 3 S C C 4 9 j I u 4 w S Y p I t D Q S I w R L h o A o + 5 Y h R E a g i h i p u s m E 6 J I h R M X 1 V T g r v 8 5 V X S v W i 4 T s O 9 v

6 w 3 b 8 o 6 K u g E n a J z 5 K I r 1 E R 3 q I U 6 i K I U P a N X 9 G Y 9 W S / W u / W x G F 2 z y p 0 a + g P r 8 w e 8 + 5 V 2 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " Y N u N 9 Y R P 1 n

V P 2 f Y v N f i N Q k h U t L 4 = " > A A A B / H i c b V D L S s N A F J 3 4 r P U V 7 d L N Y B F c l U Q E X R b d u K z Q F 7 S x T K a T d u h k E m Z u h B D i r 7 h x o Y h b P 8 S d f + O k z U J b D w w c

z r m X e + b 4 s e A a H O f b W l v f 2 N z a r u x U d / f 2 D w 7 t o + O u j h J F W Y d G I l J 9 n 2 g m u G Q d 4 C B Y P 1 a M h L 5 g P X 9 2 W / i 9 R 6 Y 0 j 2 Q b 0 p h 5 I Z l I H n B K w E g j u z b 0 g 4 z k D 9 k w J D A F y N p 5 P r L r T s O Z A 6 8 S t y R 1 V K I 1 s r + G 4 4 g m I

Z N A B d F 6 4 D o x e B l R w K l g e X W Y a B Y T O i M T N j B U k p B p L 5 u H z / G Z U c Y 4 i J R 5 E v B c / b 2 R k V D r N P T N Z B F R L 3 u F + J 8 3 S C C 4 9 j I u 4 w S Y p I t D Q S I w R L h o A o + 5 Y h R E a g i h i p u s m E 6 J I h R M X 1 V T g r v 8 5 V X S v W i 4 T s O 9 v

6 w 3 b 8 o 6 K u g E n a J z 5 K I r 1 E R 3 q I U 6 i K I U P a N X 9 G Y 9 W S / W u / W x G F 2 z y p 0 a + g P r 8 w e 8 + 5 V 2 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " Y N u N 9 Y R P 1 n

V P 2 f Y v N f i N Q k h U t L 4 = " > A A A B / H i c b V D L S s N A F J 3 4 r P U V 7 d L N Y B F c l U Q E X R b d u K z Q F 7 S x T K a T d u h k E m Z u h B D i r 7 h x o Y h b P 8 S d f + O k z U J b D w w c

z r m X e + b 4 s e A a H O f b W l v f 2 N z a r u x U d / f 2 D w 7 t o + O u j h J F W Y d G I l J 9 n 2 g m u G Q d 4 C B Y P 1 a M h L 5 g P X 9 2 W / i 9 R 6 Y 0 j 2 Q b 0 p h 5 I Z l I H n B K w E g j u z b 0 g 4 z k D 9 k w J D A F y N p 5 P r L r T s O Z A 6 8 S t y R 1 V K I 1 s r + G 4 4 g m I

Z N A B d F 6 4 D o x e B l R w K l g e X W Y a B Y T O i M T N j B U k p B p L 5 u H z / G Z U c Y 4 i J R 5 E v B c / b 2 R k V D r N P T N Z B F R L 3 u F + J 8 3 S C C 4 9 j I u 4 w S Y p I t D Q S I w R L h o A o + 5 Y h R E a g i h i p u s m E 6 J I h R M X 1 V T g r v 8 5 V X S v W i 4 T s O 9 v

6 w 3 b 8 o 6 K u g E n a J z 5 K I r 1 E R 3 q I U 6 i K I U P a N X 9 G Y 9 W S / W u / W x G F 2 z y p 0 a + g P r 8 w e 8 + 5 V 2 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " Y N u N 9 Y R P 1 n

V P 2 f Y v N f i N Q k h U t L 4 = " > A A A B / H i c b V D L S s N A F J 3 4 r P U V 7 d L N Y B F c l U Q E X R b d u K z Q F 7 S x T K a T d u h k E m Z u h B D i r 7 h x o Y h b P 8 S d f + O k z U J b D w w c

z r m X e + b 4 s e A a H O f b W l v f 2 N z a r u x U d / f 2 D w 7 t o + O u j h J F W Y d G I l J 9 n 2 g m u G Q d 4 C B Y P 1 a M h L 5 g P X 9 2 W / i 9 R 6 Y 0 j 2 Q b 0 p h 5 I Z l I H n B K w E g j u z b 0 g 4 z k D 9 k w J D A F y N p 5 P r L r T s O Z A 6 8 S t y R 1 V K I 1 s r + G 4 4 g m I

Z N A B d F 6 4 D o x e B l R w K l g e X W Y a B Y T O i M T N j B U k p B p L 5 u H z / G Z U c Y 4 i J R 5 E v B c / b 2 R k V D r N P T N Z B F R L 3 u F + J 8 3 S C C 4 9 j I u 4 w S Y p I t D Q S I w R L h o A o + 5 Y h R E a g i h i p u s m E 6 J I h R M X 1 V T g r v 8 5 V X S v W i 4 T s O 9 v

6 w 3 b 8 o 6 K u g E n a J z 5 K I r 1 E R 3 q I U 6 i K I U P a N X 9 G Y 9 W S / W u / W x G F 2 z y p 0 a + g P r 8 w e 8 + 5 V 2 < / l a t e x i t > a S < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 N

K q R o Y C u Z B d M 4 Z C a 1 Q a y p R L + r I = " > A A A B / H i c b V D L S s N A F J 3 4 r P U V 7 d L N Y B F c l U Q E X R b d u K x o H 9 D G M p l O 2 q G T S Z i 5 E U K I v + L G h S J u / R B 3 / o 2 T N g t t P T B w O O d e 7 p n j x 4 J r c J x v a 2 V 1 b X 1 j s 7 J V 3 d 7 Z 3 d u 3 D w 4 7 O k o U Z W 0 a i U j 1 f K K Z 4 J K 1 g Y N g v V g x E v q C d f 3 p d e F 3 H 5 n S P J L 3 k M b M C 8 l Y 8 o B T A k Y a 2 r W B H 2 Q k f 8 g G I Y E J Q H a X 5 0 O 7 7 j S c G f A y c U t S R y V a Q / t r M I p o E j I J V B C t + 6 4 T g 5 c R B Z w K l l c H i W Y x o V M y Z n 1 D J Q m Z 9 r J Z + B y f G G W E g 0 i Z J w H P 1 N 8 b G Q m 1 T k P f T B Y R 9 a J X i P 9 5 / Q S C S y / j M k 6 A S T o / F C Q C Q 4 S L J v C I K 0 Z B p I Y Q q r j J i u m E K E L B 9 F U 1 J b i L X 1 4 m n b O G 6 z T c 2 / N 6 8 6 q s o 4 K O 0 D E 6 R S 6 6 Q E 1 0 g 1 q o j S h K 0 T N 6 R W / W k / V i v V s f 8 9 E V

q 9 y p o T + w P n 8 A u 3 W V d Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 N

K q R o Y C u Z B d M 4 Z C a 1 Q a y p R L + r I = " > A A A B / H i c b V D L S s N A F J 3 4 r P U V 7 d L N Y B F c l U Q E X R b d u K x o H 9 D G M p l O 2 q G T S Z i 5 E U K I v + L G h S J u / R B 3 / o 2 T N g t t P T B w O O d e 7 p n j x 4 J r c J x v a 2 V 1 b X 1 j s 7 J V 3 d 7 Z 3 d u 3 D w 4 7 O k o U Z W 0 a i U j 1 f K K Z 4 J K 1 g Y N g v V g x E v q C d f 3 p d e F 3 H 5 n S P J L 3 k M b M C 8 l Y 8 o B T A k Y a 2 r W B H 2 Q k f 8 g G I Y E J Q H a X 5 0 O 7 7 j S c G f A y c U t S R y V a Q / t r M I p o E j I J V B C t + 6 4 T g 5 c R B Z w K l l c H i W Y x o V M y Z n 1 D J Q m Z 9 r J Z + B y f G G W E g 0 i Z J w H P 1 N 8 b G Q m 1 T k P f T B Y R 9 a J X i P 9 5 / Q S C S y / j M k 6 A S T o / F C Q C Q 4 S L J v C I K 0 Z B p I Y Q q r j J i u m E K E L B 9 F U 1 J b i L X 1 4 m n b O G 6 z T c 2 / N 6 8 6 q s o 4 K O 0 D E 6 R S 6 6 Q E 1 0 g 1 q o j S h K 0 T N 6 R W / W k / V i v V s f 8 9 E V

q 9 y p o T + w P n 8 A u 3 W V d Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 N

K q R o Y C u Z B d M 4 Z C a 1 Q a y p R L + r I = " > A A A B / H i c b V D L S s N A F J 3 4 r P U V 7 d L N Y B F c l U Q E X R b d u K x o H 9 D G M p l O 2 q G T S Z i 5 E U K I v + L G h S J u / R B 3 / o 2 T N g t t P T B w O O d e 7 p n j x 4 J r c J x v a 2 V 1 b X 1 j s 7 J V 3 d 7 Z 3 d u 3 D w 4 7 O k o U Z W 0 a i U j 1 f K K Z 4 J K 1 g Y N g v V g x E v q C d f 3 p d e F 3 H 5 n S P J L 3 k M b M C 8 l Y 8 o B T A k Y a 2 r W B H 2 Q k f 8 g G I Y E J Q H a X 5 0 O 7 7 j S c G f A y c U t S R y V a Q / t r M I p o E j I J V B C t + 6 4 T g 5 c R B Z w K l l c H i W Y x o V M y Z n 1 D J Q m Z 9 r J Z + B y f G G W E g 0 i Z J w H P 1 N 8 b G Q m 1 T k P f T B Y R 9 a J X i P 9 5 / Q S C S y / j M k 6 A S T o / F C Q C Q 4 S L J v C I K 0 Z B p I Y Q q r j J i u m E K E L B 9 F U 1 J b i L X 1 4 m n b O G 6 z T c 2 / N 6 8 6 q s o 4 K O 0 D E 6 R S 6 6 Q E 1 0 g 1 q o j S h K 0 T N 6 R W / W k / V i v V s f 8 9 E V

q 9 y p o T + w P n 8 A u 3 W V d Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 N

K q R o Y C u Z B d M 4 Z C a 1 Q a y p R L + r I = " > A A A B / H i c b V D L S s N A F J 3 4 r P U V 7 d L N Y B F c l U Q E X R b d u K x o H 9 D G M p l O 2 q G T S Z i 5 E U K I v + L G h S J u / R B 3 / o 2 T N g t t P T B w O O d e 7 p n j x 4 J r c J x v a 2 V 1 b X 1 j s 7 J V 3 d 7 Z 3 d u 3 D w 4 7 O k o U Z W 0 a i U j 1 f K K Z 4 J K 1 g Y N g v V g x E v q C d f 3 p d e F 3 H 5 n S P J L 3 k M b M C 8 l Y 8 o B T A k Y a 2 r W B H 2 Q k f 8 g G I Y E J Q H a X 5 0 O 7 7 j S c G f A y c U t S R y V a Q / t r M I p o E j I J V B C t + 6 4 T g 5 c R B Z w K l l c H i W Y x o V M y Z n 1 D J Q m Z 9 r J Z + B y f G G W E g 0 i Z J w H P 1 N 8 b G Q m 1 T k P f T B Y R 9 a J X i P 9 5 / Q S C S y / j M k 6 A S T o / F C Q C Q 4 S L J v C I K 0 Z B p I Y Q q r j J i u m E K E L B 9 F U 1 J b i L X 1 4 m n b O G 6 z T c 2 / N 6 8 6 q s o 4 K O 0 D E 6 R S 6 6 Q E 1 0 g 1 q o j S h K 0 T N 6 R W / W k / V i v V s f 8 9 E V

q 9 y p o T + w P n 8 A u 3 W V d Q = = < / l a t e x i t > Figure 2 : Illustration of proposed methods.

Figure 2: Illustration of proposed methods.

Each instruction x i is a sequence of L i words,

x i = [x i,1 , x i,2 , ..., x i,L i ].

Given X , the goal is to train an agent to navigate from a starting position s 0 to a target position, via completing a T -

step trajectory τ = [s 0 , a 0 , s 1 , a 1 , • • • , s T , a T ],

where s t and a t are the visual state and navigation action, respectively, at step t. The training dataset D E = {τ , X } consists of example pairs of instruction set X and a corresponding expert trajectory τ . Our goal is to learn a policy π θ (τ |X ) that maximizes the log-likelihood of the target trajectory τ given instructions X :

log π θ (τ |X ) = T t=1 log π θ (a t |s t , X ), (1)

where θ are trainable parameters. The policy is usually parameterized as an attention-based seq2seq model, with a language encoder z t = f θ E (x), and an action decoder a t = f θ D (z t , s t ). Successful navigation depends on (i) precisely grounding the instructions X in τ in various environments, and (ii) correctly making the current decision a t based on previous actions/observations

τ

To address these concerns, we propose PRESS, illustrated in Figure 2 .

2.1 Instruction Understanding With Pretrained Language Models

At each step t, the agent decides where to navigate by updating a dynamic understanding of the instructions z t , according to its current visual state s t . Given instruction x, the language encoder proceeds in two steps, end-to-end, by considering a function decomposition f θ E = f θx→e • f θe→z :

• f θx→e : x → e, where x = [x 1 , • • • , x L ]

= {c i,t } M i=1

into a single joint feature

z t = 1 M M i=1 c i,t . 4

Previous methods in VLN learn e either from pretrained word embeddings (Pennington et al., 2014) which do not take into account word context, or from scratch. As a result, their representations do not capture contextual information within each instruction. More importantly, they tend to overfit the training instructions associated with seen environments, limiting their utility in unseen environments. To remedy these issues, we propose to represent e with contextualized word embeddings produced using large-scale pretrained language models, such as BERT and GPT.

Instruction Encoder. The agent's memory vector h t−1 captures the perception and action history and is used to attend to the instruction x.

c i,t = L l=1 α l h e l (2)

where α l = Softmax(h t h e l ), α l places more weight on the word representations that are most relevant to the agent's current status.

Decoder. At each step, the agent takes an action a t , and the environment returns new visual observations; the agent first performs one-hop visual attention f (•) to all the visual image features s t , based on its previous memory vector h t−1 . Then, the agent updates its visual state s t as the weighted sum of the panoramic features, s t = j γ t,j s t,j . The attention weight γ t,j for the j-th visual feature s t,j represents its importance with respect to the previous history context h t−1 , computed as γ t,j = Softmax((W h h t−1 ) W s s t,j ) (Fried et al., 2018) where Softmax(r j ) = exp(r j )/ j exp(r j ), W h and W s are trainable projection matrices.

h t = f θ D ([s t , a t−1 ], h t−1 ) (3)

where a t−1 is the action taken at previous step, and θ D are the LSTM decoder parameters.

Two-stage learning. The parameters of our agent are θ = {θ x→e , θ e→z , θ D }. In practice, we find that the agent overfits quickly, when the full model is naively fine-tuned, with θ x→e initialized by pretrained LMs (e.g., BERT). In this paper, we consider a two-stage learning scheme to facilitate the use of pretrained LMs for VLN. (i) Embedding-based stage: We fix θ x→e , and use BERT or GPT to provide instruction embeddings. Only {θ e→z , θ D } are updated (while tuning on validation). (ii) Fine-tuning stage: We train all model parameters θ with a smaller learning rate, so that θ x→e can adapt to our VLN task.

2.2 Stochastic Action Sampling

A core question is how to learn useful state representations s t in Eq. (1) during the trajectory rollout. In other words, which action should we use to interact with the environment to elicit the next state? As noted, most existing work uses one of two schemes: (i) Teacher-forcing (TF), where the agent takes ground-truth actions a T only. Though TF enables efficient training, it results in "exposure bias" because agents must follow learned rather than gold trajectories at test time. In contrast, (ii) Student-forcing (SF), where an action a S is drawn from the current learned policy, allows the agent to learn from its own actions (aligning training and evaluation), however, it is inefficient, as the agent explores randomly when confused or in the early stages of training. In this work, we consider a stochastic scheme (SS) to alternate between choosing actions from a T and a S for state transition s ← g(a T , a S ), inspired by scheduled sampling (Bengio et al., 2015) . As illustrated in Figure 2 , at each step, the agent "flips a coin" with some probability to decide whether to take the teacher's action a T or a sampled one a S :

EQUATION (4): Not extracted; please refer to original document.

where δ ∼ Bernoulli( ). This allows the agent to leverage the advantages of both TF and SF, yielding a faster and less biased learner. We fix as a constant during learning, which is different from the decaying schedule in (Bengio et al., 2015) .

3 Experiments

3.1 Dataset

We use the Room-to-Room dataset for the VLN task, built upon the Matterport3D dataset (Chang et al., 2017) , which consists of 10,800 panoramic views and 7,189 trajectories. Each trajectory is paired with three natural language instructions. The R2R dataset consists of four splits: train seen, validation seen, validation unseen, and test unseen. There is no overlap between seen and unseen environments. At the beginning of each episode, the agent starts at a specific location, and is given natural instructions, the goal of the agent is to navigate to the target location as quickly as possible.

3.2 Baseline Systems

We compare our approach with eight recently published systems:

• RANDOM: an agent that randomly selects a direction and moves five step in that direction (Anderson et al., 2018 ). • SEQ2SEQ: sequence-to-sequence model proposed by Anderson et al. as a baseline for the R2R benchmark (Anderson et al., 2018) and analyzed in (Thomason et al., 2019) . • RPA (Wang et al., 2018) : is an agent which combines model-free and model-based reinforcement learning, using a look-ahead module for planning. • SPEAKER-FOLLOWER (Fried et al., 2018) :

an agent trained with data augmentation from a speaker model with panoramic actions. • SMNA (Ma et al., 2019a) : an agent trained with a visual-textual co-grounding module and progress monitor on panoramic actions. • RCM+SIL(TRAIN) (Wang et al., 2019) : an agent trained with cross-modal grounding locally and globally via reinforcement learning. • REGRETFUL (Ma et al., 2019b) : an agent with a trained progress monitor heuristic for search that enables backtracking. • FAST (Ke et al., 2019) : an agent which combines global and local knowledge to compare partial trajectories of different lengths, enabling efficient backtrack after a mistake. • ENVDROP (Tan et al., 2019): proposed an environment dropout method, which can generate more environments based on the limited seen environments.

3.3 Evaluation Metrics

We benchmark our agent on the following metrics:

TL Trajectory Length measures the average length of the navigation trajectory. NE Navigation Error is the mean of the shortest path distance in meters between the agent's final location and the target location. SR Success Rate with which the agent's final location is less than 3 meters from the target. SPL Success weighted by Path Length trades-off SR against TL.

SPL is the recommended primary metric, other metrics are considered as auxiliary measures.

3.4 Implementation

We use a LSTM/GPT/BERT for the language encoder, and a second single-layer LSTM for the action decoder (h=1024). We use Adamax and batch sizes of 24/16 for pretraining/finetuning. The learning rates for MLE are 1e −4 , during finetuning BERT the learning rate is 5e −5 . Following (Fried et al., 2018) , we use a panoramic action space and the ResNet image features provided by (Anderson et al., 2018) . The code is publicly available here: https://github.com/xjli/r2r_vln.

Table 2: Comparison of PRESS and seq2seq.
Table 3: Comparison with the state-of-the-art methods. Blue indicates best value overall.

3.5 Results

Robust Generalization. First, we compare PRESS to a baseline seq2seq model 5 in two evaluation settings on the validation splits: (1) S: A single instruction is provided to the agent at a time. Thus, three separate navigation trajectories are generated corresponding to three alternative instructions in this setting. We report the averaged performance over three separate runs. (2) M: All three instructions are provided to the agent at once. The seq2seq baseline does not have an aggregation strategy so we report its performance for the single trajectory with maximum likelihood. For PRESS, we aggregate the instructions via context Table 4 : Ablation results of different language pretrainings and training strategies: Teacher Forcing (TF), Student Forcing (SF) and Stochastic Sampling (SS).

Table 4: Ablation results of different language pretrainings and training strategies: Teacher Forcing (TF), Student Forcing (SF) and Stochastic Sampling (SS).
Figure 3: Comparison between the agent equipped with an LSTM instruction encoder and our PRESS agent on a validation unseen environment (path id: 6632), including top-down trajectory view and step-by-step navigation

(2) The second set in Figure 4 shows how the agents trained with different training strategies performs in an unseen environment. The agents trained with teacher-forcing and student-forcing both fail, while PRESS succeeds.

Figure 4: Comparison among the agents trained with teacher-forcing, student-forcing and stochastic sampling strategies on a validation unseen environment (path id: 7201), including top-down trajectory view and step-by-

4 Conclusion

We present PRESS, a navigation agent based on two previously underexplored techniques in VLN: pretrained language models and stochastic action sampling. Our PRESS demonstrates robust generalization in the unseen environments, leading to a new state-of-the-art performance over many of the much more complex approaches previously proposed. As both the components of PRESS can be easily integrated, future models can consider building upon them as a strong baseline system.

Instruction A: Go up the stairs to the right, turn left and go into the room on the left. Turn left and stop near the mannequins. Instruction B: Walk up the small set of stairs. Once you reach the top, turn 45 degrees to your left. Walk through the door at the bottom of the large staircase. After you are inside, turn left and wait near the statue. Instruction C: Walk up the stairs Through the doorway on the left. Make a left in the room and stop before the two manikins. Figure 3 : Comparison between the agent equipped with an LSTM instruction encoder and our PRESS agent on a validation unseen environment (path id: 6632), including top-down trajectory view and step-by-step navigation views. We indicate the start ( ), target ( ) and failure ( ) of agents in an unseen environment. Figure 4 : Comparison among the agents trained with teacher-forcing, student-forcing and stochastic sampling strategies on a validation unseen environment (path id: 7201), including top-down trajectory view and step-bystep navigation views. We indicate the start ( ), target ( ) and failure ( ) of agents in an unseen environment.

This recovers zt = ct when only a single instruction is available.

The baseline seq2seq agent is the FOLLOWER of SPEAKER-FOLLOWER(Fried et al., 2018).