PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World

Rowan Zellers
Ari Holtzman
Matthew E. Peters
R. Mottaghi
Aniruddha Kembhavi
Ali Farhadi
Yejin Choi
ACL/IJCNLP
2021
View in Semantic Scholar

Abstract

We propose PIGLeT: a model that learns physical commonsense knowledge through interaction, and then uses this knowledge to ground language. We factorize PIGLeT into a physical dynamics model, and a separate language model. Our dynamics model learns not just what objects are but also what they do: glass cups break when thrown, plastic ones don’t. We then use it as the interface to our language model, giving us a unified model of linguistic form and grounded meaning. PIGLeT can read a sentence, simulate neurally what might happen next, and then communicate that result through a literal symbolic representation, or natural language. Experimental results show that our model effectively learns world dynamics, along with how to communicate them. It is able to correctly forecast “what happens next” given an English sentence over 80% of the time, outperforming a 100x larger, text-to-text approach by over 10%. Likewise, its natural language summaries of physical interactions are also judged by humans as more accurate than LM alternatives. We present comprehensive analysis showing room for future work.

1 Introduction

As humans, our use of language is linked to the physical world. To process a sentence like "the robot turns on the stove, with a pan on it" (Figure 1 ) we might imagine a physical Pan object. This meaning representation in our heads can be seen as a part of our commonsense world knowledge, about what a Pan is and does. We might reasonably predict that the Pan will become Hot -and if there's an Egg on it, it would become cooked .

Figure 1: PIGLeT. Through physical interaction in a 3D world, we learn a model for what actions do to objects. We use our physical model as an interface for a language model, jointly modeling elements of language form and meaning. Given an action expressed symbolically or in English, PIGLeT can simulate what might happen next, expressing it symbolically or in English.

As humans, we learn such a commonsense world model through interaction. Young children learn to reason physically about basic objects by manipulating them: observing the properties they have,

Piglet T T+1

The robot turns on the stove, with a pan on it.

Physical Dynamics Model

The pan is now hot and the egg becomes cooked. Figure 1 : PIGLeT. Through physical interaction in a 3D world, we learn a model for what actions do to objects. We use our physical model as an interface for a language model, jointly modeling elements of language form and meaning. Given an action expressed symbolically or in English, PIGLeT can simulate what might happen next, expressing it symbolically or in English. and how they change if an action is applied on them (Smith and Gasser, 2005) . This process is hypothesized to be crucial to how children learn language: the names of these elementary objects become their first "real words" upon which other language is scaffolded (Yu and Smith, 2012) .

In contrast, the dominant paradigm today is to train large language or vision models on static data, such as language and photos from the web. Yet such a setting is fundamentally limiting, as suggested empirically by psychologists' failed attempts to get kittens to learn passively (Held and Hein, 1963) . More recently, though large Transformers have made initial progress on benchmarks, they also have frequently revealed biases in those same datasets, suggesting they might not be solving underlying tasks (Zellers et al., 2019b) . This has been argued philosophically by a flurry of re-

Figure 2:

2 Pigpen: A Resource For Neuro-Symbolic Language Grounding

We introduce PIGPeN as a setting for learning and evaluating physically grounded language understanding. An overview is shown in Figure 2 . The idea is that an agent gets access to an interactive 3D environment, where it can learn about the world through interaction -for example, that objects such as a Vase can become Broken if thrown. The goal for a model is to learn natural language meaning grounded in these interactions. Task definition. Through interaction, an agent observes the interplay between objects o ∈ O (represented by their attributes) and actions a ∈ A through the following transition:

EQUATION (1): Not extracted; please refer to original document.

Actions change the state of a subset of objects: turning on a Faucet affects a nearby Sink , but it will not change a Mirror on the wall.

To encourage learning from interaction, and not just language, an agent is given a small number of natural language annotations of transitions. We denote these sentences as s o , describing the state pre-action, s a the action, and s o the state postaction respectively. During evaluation, an agent will sometimes encounter new objects o that were not part of the paired training data.

We evaluate the model's transfer in two ways: a. PIGPeN-NLU. A model is given object states o, and an English sentence s a describing an action. It must predict the grounded object states o that result after the action is taken. b. PIGPeN-NLG. A model is given object states o and a literal action a. It must generate a sentence s o describing the state post-action. We next describe our environment, feature representation, and language annotation process.

2.1 Environment: Thor

We use AI2-THOR as an environment for this task (Kolve et al., 2017) . In THOR, a robotic agent can navigate around and perform rich contextual interactions with objects in a house. For instance, it can grab an Apple , slice it, put it in a Fridge , drop it, and so on. The state of the Apple , such as whether it is sliced or cold, changes accordingly; this is not possible in many other environments.

In this work, we use the underlying THOR simulator as a proxy for grounded meaning. Within THOR, it can be seen as a 'complete' meaning representation (Artzi et al., 2013) , as it fully specifies the kind of grounding a model can expect in its perception within THOR.

Objects. The underlying THOR representation of each object o is in terms of 42 attributes; we provide a list in Appendix B. We treat these attributes as words specific to an attribute-level dictionary; for example, the temperature Hot is one of three possible values for an object's temperature; the others being Cold and RoomTemp .

Actions. An action a in THOR is a function that takes up to two objects as arguments. Actions are highly contextual, affecting not only the arguments but potentially other objects in the scene (Figure 2 ). We also treat action names as words in a dictionary.

Filtering out background objects. Most actions change the state of only a few objects, yet there can be many objects in a scene. We keep annotation and computation tractable by having models predict (and humans annotate) possible changes of at most two key objects in the scene. As knowing when an object doesn't change is also important, we include non-changing objects if fewer than two change.

Exploration. Any way of exploring the environment is valid for our task, however, we found that exploring intentionally was needed to yield good coverage of interesting states. Similar to prior work for instruction following (Shridhar et al., 2020) , we designed an oracle to collect diverse and interesting trajectories { o, a, o }. Our oracle randomly selects one of ten high level tasks, see Appendix B for the list. These in turn require randomly choosing objects in the scene; e.g. a Vase and a Laptop in Figure 2 . We randomize the manner in which the oracle performs the task to discover diverse situations.

In total, we sampled 20k trajectories. From these we extracted 280k transitions (Eqn 1's) where at least one object changes state, for training.

2.2.1 Data Selection For Annotation

We select 2k action state-changes from trajectories held out from the training set. We select them while also balancing the distribution of action types to ensure broad coverage in the final dataset. We are also interested in a model's ability to generalize to new object categories -beyond what it has read about, or observed in a training set. We thus select 30 objects to be "unseen," and exclude these from paired environment-language training data. We sample 500 state transitions, containing only "seen" objects to be the training set; we use 500 for validation and 1000 for testing.

2.2.2 Natural Language Annotation

Workers on Mechanical Turk were shown an environment in THOR before and after a given action a. Each view contains the THOR attributes of the two key objects. Workers then wrote three English sentences, corresponding to s o , s a , and s o respectively. Workers were instructed to write at a particular level of detail: enough so that a reader could infer "what happens next" from s o and s a , yet without mentioning redundant attributes.We provide more details in Appendix C.

3 Modeling Piglet

In this section, we describe our PIGLeT model. First, we learn a neural physical dynamics model < l a t e x i t s h a 1 _ b a s e 6 4 = " T k 3 H X n F V I r K V r y i j e / Q E d n L V R C o = " > A A A D M H i c f V J L b 9 N A E N 6 6 B V r z S o E b F 4 s I C X G I 7 I I E x w o 4 c E E U i b S V 4 i g a b y b O K v u w Z t e l q e X / 0 i s c + T V w Q r 3 y K 9 g k P u C E M t J q v v 3 m u b O T F V J Y F 8 c / t 4 L t n R s 3 b + 3 u h b f v 3 L 1 3 v 7 P / 4 N i a k j j 2 u Z G G T j O w K I X G v h N O 4 m l B C C q T e J L N 3 i 7 s J 2 d I V h j 9 2 c 0 L H C r I t Z g I D s 5 T o 8 6 j 9 A x 5 l W Z G j u 1 c e V W Z u h 5 1 u n E v X k q 0 C Z I G d F k j R 6 P 9 Y C c d G 1 4 q 1 I 5 L s H a Q x I U b V k B O c I l 1 m J Y W C + A z y H H g o Q a F d l g t 2 6 + j p 5 4 Z R x N D / m g X L d m / I y p Q d t G c 9 1 T g p n b d t i D / Z R u U b v J 6 W A l d l A 4 1 X x W a l D J y J l r M I h o L Q u 7 k 3 A P g J H y v E Z 8 C A X d + Y q 0 q q p R O k P n S e k n F Q f I 2 k x M U U 8 H P 2 y y h t O K i P Y Z r U p J x / m d 0 3 m Y z 1 b 6 X J N e S G c L N E p k x M w e Z v b b w O / S / R f j B T + 5 j g Q T O 0 P M q B c o V n N d V o / / n J v T K z e s w D P 3 e J O t b s g m O D 3 r J i 1 7 8 6 W X 3 8 E 2 z Q b v s M X v C n r G E v W K H 7 D 0 7 Y n 3 G 2 Q W 7 Z F / Z t + B 7 8 C P 4 F V y t X I O t J u Y h a 0 n w + w / L m Q / 3 < / l a t e x i t >õ 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " d S

J f x B P n R w o w t p N 8 z y n X N 7 N + B K Y = " > A A A D M X i c f V J L b 9 N A E N 6 a A s W 8 U h A n L h Y R A n G I 7 I I E x w o 4 c E E U i b S V 4 i g a b y b J K v u w Z t e l Y e U f w x W O / J r e E F f + B J v E B 5

x Q R l r N t 9 8 8 d 3 a K U g r r 0 v R i J 7 q y e / X a 9 b 0 b 8 c 1 b t + / c 7 e z f O 7 a m I o 5 9 b q S h 0 w I s S q G x 7 4 S T e F o S g i o k n h T z N 0 v 7 y R m S F U Z / c o s S h w q m W k w E B x e o U e d B f o b c 5 4 W R Y 7 t Q Q X l T 1 0 9 G n W 7 a S 1 e S b I O s A

V 3 W y N F o P 9 r N x 4 Z X C r X j E q w d Z G n p h h 7 I C S 6 x j v P K Y g l 8 D l M c B K h B o R 3 6 V f 9 1 8 j g w 4 2 R i K B z t k h X 7 d 4 Q H Z Z f d B U 8 F b m Y 3 b U v y X 7 Z B 5 S a v h l 7 o s n K o + b r Q p J K J M 8 l y G M l Y E H I n F w E A J x F 6 T f g M C L g L I 2 t V U Z V 0 g s z n 1 k s 8 B 8 n b z J S g n A l + 3 m Y J p R V f 2 m O 4 J C U Z F 7 5 G T 9 t s o d r 3 i u R G M k O 4 X a I w Z u 6 g s J c W f o v h t w j f h 8 l 9 K J H A G X r m c 6 C p g v P a N / p / b k K v 3 Y K O 4 z j s T b a 5 J d v g + K C

X P e + l H 1 9 0 D 1 8 3 G 7 T H H r J H 7 C n L 2 E t 2 y N 6 x I 9 Z n n H n 2 l X 1 j 3 6 M f 0 U X 0 M / q 1 d o 1 2 m p j 7 r C X R 7 z 9 k c R A o < / l a t e x i t > a < l a t e x i t s h a 1 _ b a s e 6 4 = " e l I 5 F e e j P p e M 9 u U n R f 3 I Q j S e v x U = "

> A A A D K H i c f V J L b 9 N A E N 6 a Q o t 5 t X D k Y h E h I Q 6 R T Z H g W A E H L o g i k b Z S E q r x Z u K s s g 9 r d l w a r P w P r n D k 1 3 B D v f J L 2 C Q + 4 I Q y 0 m q + / e a 5 s 5 O X W n l O 0 8 u t 6 N r 2 9 R s 7 u z f j W 7 f v 3 L 2 3 t 3 / / 2 L u K J P a k 0 4 5 O c / C o l c U e K 9 Z 4 W h K C y T W e 5 N P X C / v J O Z J X z n 7 k W Y l D A 4 V V Y y W B A / V p k D s 9 8 j M T V A 3 z s 7 1 O 2 k 2 X k m y C r A E d 0 c j R 2 X 6 0 P R g 5 W R m 0 L D V 4 3 8 / S k o c 1 E C u p c R 4 P K o 8 l y C k U 2 A / Q g k E / r J d t z 5 P H g R k l Y 0 f h W E 6 W 7 N 8 R N R i / 6 C 1 4 G u C J X 7 c t y H / Z + h W P X w 5 r Z c u K 0 c p V o X G l E 3 b J Y g b J S B F K 1 r M A Q J I K v S Z y A g S S w 6 R a V U y l W Z H 7 3 H p J L U H L N l M Q l B M l L 9 o s o f b q S 3 s M V 6 Q k x + F H b N F m c 9 O + V 6 T X k j n C z R K 5 c 1 O G 3 F 9 Z + A 2 G 3 y J 8 F y b 3 v k Q C d v S 0 H g A V B i 7 m d a P / 5 6 b s y i 3 o O I 7 D 3 m T r W 7 I J j p 9 1 s 4 N u + u F 5 5 / B V s 0 G 7 4 q F 4 J J 6 I T L w Q h + K t O B I 9 I Q W J r + K b + B 7 9 i H 5 G v 6 L L l W u 0 1 c Q 8 E C 2 J f v 8 B / q 8 M 6 g = = < / l a t e x i t > T enc

< l a t e x i t s h a 1 _ b a s e 6 4 = " F A l D c C J y x B B e 5 F r g + r 9 i t 6 V U k m

A = " > A A A D N X i c f V J L b 9 N A E N 6 a A s W 8 U j g h L h Y R E u I Q O Q U J j h V w 4 I I o U t N W i q N o v R k 7 q + z D m h 1 D g m X x a 7 j C k d / C g R v i y l 9 g n f q A E 8 p I q / n 2 m 5 m d n U d a K O k o j r / v B J d 2 L 1 + 5 u n c t v H 7 j 5 q 3 b v f 0 7 J 8 6 W K G A k r L J 4 l n I H S h o Y k S Q F Z w U C 1 6 m C 0 3 T x s r G f v g d 0 0 p p j W h U w 0 T w 3 M p O C k 6 e m v X u J 5 j R P s + q 4 n l Y J w Z J Q V 2 B E X U 9 7 / X g Q r y X a B s M W 9 F k r R 9 P 9 Y D e Z W V F q M C Q U d 2 4 8 j A u a V B x J C g V 1 m J Q O C i 4 W P I e x h 4 Z r c J N q X U M d P f T M L M o s + m M o W r N / R 1 R c O 7 f S q f d s f u w 2 b Q 3 5 L 9 u 4 p O z 5 p J K m K K m p a 5 0 o K 1 V E N m o a E s 0 k g i C 1 8 o A L l P 6 v k Z h z 5 I J 8 2 z p Z d K l I o v 3 Q q a Q S X I k u k y M v 5 l I s u y y C c v J j t w 0 X P I m W / H h M 3 m V T 3 b 2 X q D Y e s w j b K V J r F 8 R T d 2 H i V + C n h f D G d + 5 t A c j J 4 u M q 4 Z h r v q y r V v / P T Z p z N 6 / D M P R 7 M 9 z c k m 1 w c j A Y P h n E 7 5 7 2 D 1 + 0 G 7 T H 7 r M H 7 B E b s m f s k L 1 m R 2 z E B P v E P r M v 7 G v w L f g R / A x + n b s G O 2 3 M X d a R 4 P c f W w o S P A = = < / l a t e x i t > T dec

< l a t e x i t s h a 1 _ b a s e 6 4 = " 2 g P 2

+ x Y k d i + q a h Q k M a t f m 5 C H G x A = " > A A A D N X i c f V L N j t M w E P a G h V 3 C X x d O i E t E h Y Q 4 V O m C x B 5 X w I E L Y p G 2 u y s 1 V T V x J 6 l V O 4 7 s C b R Y E U / D F Y 4 8 C w d u i C u v g N v m Q F q W k S x / / u Z / P G k p h a U 4 / r 4 T X N m 9 e m 1 v / 3 p 4 4 + a t 2 3 c 6 B 3 f P r K 4 M x w H X U p u L F C x K U e C A B E m 8 K A 2 C S i W e p 7 O X S / 3 5 e z R W 6 O K U F i W O F O S F y A Q H 8 t S 4 c z 9 R Q N M 0 c 6 f 1 2 C W E c z L K T Z D X 9 b j T j X v x S q J t 0 G 9 A l z V y M j 4 I d p O J 5 p X C g r g E a 4 f 9 u K S R A 0 O C S 6 z D p L J Y A p 9 B j k M P C 1 B o R 2 7 V Q x 0 9 8 s w k y r T x p 6 B o x f 7 t 4 U B Z u 1 C p t 1 x W b D d 1 S / J f u m F F 2 d H I i a K s C A u + T p R V M i I d L Q c S T Y R B T n L h A X A j f K 0 R n 4 I B T n 5 s r S y q k i S M / t D q x H G Q v M 3 k B s q p 4 P M 2 a 1 B a 8 b E 9 h k t C G k 3 + e 4 q 8 z a a q / a 6 M 3 A i m D W 6 n S L W e E a T 2 0 s S v 0 P + W w T d + c m 9 L N E D a P H E J m F z B v H b N / T 8 z U a z N / B 2 G o d + b / u a W b I O z w 1 7 / a S 9 + 9 6 x 7 / K L Z o H 3 2 g D 1 k j 1 m f P W f H 7 D U 7 Y Q P G 2 S f 2 m X 1 h X 4 N v w Y / g Z / B r b R r s N D 7

3 W E u C 3 3 8 A P 8 U S M g = = < / l a t e x i t >

Mlpapply

< l a t e x i t s h a 1 _ b a s e 6 4 = " z P F X F r p 2 A w q 1 e m 5 t G

L Q O T q O 0 h 4 o = " > A A A D O X i c f V J N j 9 M w E P W G h V 3 C V x e O C C m i Q k I c q m R B Y o 8 r 4 M C B F U W i u y u 1 V T V x p 6 l V O 7 b s C b R E O f F r u M K R X 8 K R G + L K H 8 B t g 0 R a l p E s P 7 9 5 M 2 O P J z V S O I r j b z v B p d 3 L V / b 2 r 4 b X r t + 4 e a t 1 c P v U 6 c J y 7 H E t t T 1 P w a E U O f Z I k M R z Y x F U K v E s n T 1 f + s / e o X V C 5 2 9 p Y X C o I M v F R H A g T 4 1 a 9 w a E c 7 K q P H n V r U b l n x M Y I x d V N W q 1 4 0 6 8 s m g b J D V o s 9 q 6 o 4 N g d z D W v F C Y E 5 f g X D + J D Q 1 L s C S 4 x C o c F A 4 N 8 B l k 2 P c w B 4 V u W K 7 e U U U P P D O O J t r 6 l V O 0 Y v + O K E E 5 t 1 C p V y q g q d v 0 L c l / + f o F T Y 6 G p c h N Q Z j z d a F J I S P S 0 b I p 0 V h Y 5 C Q X H g C 3 w t 8 1 4 l O w w M m 3 r l F F F Z K E 1 e 8 b L y k 5 S N 5 k M g t m K v i 8 y V q U T n x o t u G C l F a T / 6 I 8 a 7 K p a p 4 L K z e S a Y v b J V K t Z w S p u 7 D w C / S / Z f H E d + 6 1 Q Q u k 7 a N y A D Z T M K / K e v + f T O R r m d / D M P R z k 2 x O y T Y 4 P e w k j z v x m y f t 4 2 f 1 B O 2 z u + w + e 8 g S 9 p Q d s 5 e s y 3 q M s 4 / s E / v M v g R f g + / B j + D n W h r s 1 D F 3 W M O C X 7 8 B y R U U G w = = < / l a t e x i t > h a < l a t e x i t s h a 1 _ b a s e 6 4 = " I p T z z w Z 9 I j K 0 O g G c I S M V C 6 8 h B Q 0 = " > A A A D N 3 i c f V J L b 9 N A E N 6 6 B Y p 5 p e X Y i 0 W E h D h E N i D B s a I 9 c E E U i b S V 4 i g a b 8 b O K v u w d t d t w 8 o H f g 1 X O P J T O H F D X P k H r B M f c E I Z a T X f f j O 7 8 8 x K z o y N 4 + 9 b w f b O j Z u 3 d m + H d + 7 e u / + g t 7 d / a l S l K Q 6 p 4 k q f Z 2 C Q M 4 l D y y z H 8 1 I j i I z j W T Y / a u x n F 6 g N U / K D X Z Q 4 F l B I l j M K 1 l O T 3 k E q w M 6 y 3 M 3 q i U s z x a d m I b x y U N e T X j 8 e x E u J N k H S g j 5 p 5 W S y F + y k U 0 U r g d J S D s a M k r i 0 Y w f a M s q x D t P K Y A l 0 D g W O P J Q g 0 I z d s o o 6 e u y Z a Z Q r 7 Y + 0 0 Z L 9 + 4 U D Y Z r k v G e T s 1 m 3 N e S / b K P K 5 q / G j s m y s i j p K l B e 8 c i q q G l J N G U a q e U L D 4 B q 5 n O N 6 A w 0 U O s b 1 4 k i K m 6 Z V p e d S h w F T r t M o a G c M X r V Z T V y w z 5 2 2 3 D N l 1 p Z P y B Z d N l M d O + V 5 m u f K Y 2 b I T K l 5 h Y y c 2 3 g Y / T T 0 v j W d + 5 d i R q s 0 k 9 d C r o Q c F W 7 V v / P j c m V m 9 d h G P q 9 S d a 3 Z B O c P h s k z w f x + x f 9 w 9 f t B u 2 S A / K I P C E J e U k O y R t y Q o a E k k / k M / l C v g b f g h / B z + D X y j X Y a t 8 8 J B 0 J f v 8 B 9 4 E T I g = = < / l a t e x i t > ho 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " C q R 6 V s V z G u Z l P u X K L q p M Z Y 9 b t z 4 = " > A A A D O X i c f V J L j 9 M w E P a G B Z b w 6 s I R I U V U S I h D l Q A S H F f A g c u K R a K 7 K z V V N H E n q V U / I t u B L V Z O / B q u c O S X 7 J E b 4 s o f w G l z I C 3 L S N Z 8 / m b s e e Y V Z 8 b G 8 f l O c G n 3 8 p W r e 9 f C 6 z d u 3 r o 9 2 L 9 z b F S t K Y 6 p 4 k q f 5 m C Q M 4 l j y y z H 0 0 o j i J z j S b 5 4 1 d p P P q A 2 T M n 3 d l n h V E A p W c E o W E 9 l g / u p A D v P C z d v M p f m i s / M U n j l V J M l T T Y Y x q N 4 J d E 2 S D o w J J 0 c Z f v B b j p T t B Y o L e V g z C S J K z t 1 o C 2 j H J s w r Q 1 W Q B d Q 4 s R D C Q L N 1 K 3 q a K K H n p l F h d L + S B u t 2 L 9 f O B C m T c 9 7 t l m b T V t L / s s 2 q W 3 x Y u q Y r G q L k q 4 D F T W P r I r a p k Q z p p F a v v Q A q G Y + 1 4 j O Q Q O 1 v n W 9 K K L m l m n 1 s V e J o 8 B p n y k 1 V H N G z / q s R m 7 Y p 3 4 b L v h S K + t H J M s + m 4 v + v d Z 8 4 z O l c T t E r t T C Q m 4 u D P w a / b Q 0 H v r O v a 1 Q g 1 X 6 s U t B l w L O G t f p / 7 k x u X b z O g x D v z f J 5 p Z s g + M n o + T p K H 7 3 b H j w s t u g P X K P P C C P S E K e k w P y h h y R M a H k M / l C v p J v w f f g R / A z + L V 2 D X a 6 N 3 d J T 4 L f f w A I h R P U < / l a t e x i t > h o2

< l a t e x i t s h a 1 _ b a s e 6 4 = " Q D w

I L 9 d C 1 n 1 E U Q 5 J g c k m T 4 E F R p 4 = " > A A A D O X i c f V J L j 9 M w E P a G B Z b w 6 s I R I U V U S I h D l e w i w X E F H L g g F o n u r t R U 0 c S d p F b 9 i G w H t l g 5 8 W u 4 w p F f w p E b 4 s o f w G l z I C 3 L S N Z 8 / m b s e e Y V Z 8 b G 8 f e d 4 N L u 5 S t X 9 6 6 F 1 2 / c v H V 7 s H / n x K h a U x x T x Z U + y 8 E g Z x L H l l m O Z 5 V G E D n H 0 3 z x o r W f v k d t m J L v 7 L L C q Y B S s o J R s J 7 K B v d T A X a e F 2 7 e Z C 7 N F Z + Z p f D K q S Y 7 a L L B M B 7 F K 4 m 2 Q d K B I e n k O N s P d t O Z o r V A a S k H Y y Z J X N m p A 2 0 Z 5 d i E a W 2 w A r q A E i c e S h B o p m 5 V R x M 9 9 M w s K p T 2 R 9 p o x f 7 9 w o E w b X r e s 8 3 a b N p a 8 l + 2 S W 2 L Z 1 P H Z F V b l H Q d q K h 5 Z F X U N i W a M Y 3 U 8 q U H Q D X z u U Z 0 D h q o 9 a 3 r R R E 1 t 0 y r D 7 1 K H A V O + 0 y p o Z o z e t 5 n N X L D P v b b c M G X W l k / I l n 2 2 V z 0 7 7 X m G 5 8 p j d s h c q U W F n J z Y e C X 6 K e l 8 b X v 3 J s K N V i l H 7 s U d C n g v H G d / p 8

b k 2 s 3 r 8 M w 9 H u T b G 7 J N j g 5 G C W H o / j t k + H R 8 2 6 D 9 s g 9 8 o A 8 I g l 5 S o 7 I K 3 J M x o S S T + Q z + U K + B t + C H 8 H P 4 N f a N d j p 3 t w l P Q l + / w E L P R P V < / l a t e x i t >ĥ o 0 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 a b H H p k g F 2 9 P U

V V x n / Y J E 6 a K E Y Y = " > A A A D Q X i c f V L N j t M w E P a G B Z b w 1 4 U j l 4 h q B e J Q J Q s S H F f A g Q t i k e j u S k 0 V T d x J a t U / k e 3 A F i t P w N N w h S N P w S N w Q 1 y 5 4 L Q 5 k J Z l J G s + f z P 2 / O Y V Z 8 b G 8 f e d 4 N L u 5 S t X 9 6 6 F 1 2 / c v H V 7 s H / n x K h a U x x T x Z U + y 8 E g Z x L H l l m O Z 5 V G E D n H 0 3 z x o r W f v k d t m J L v 7 L L C q Y B S s o J R s J 7 K B g d R O g f r U g F 2 n h d u 3 j S Z S 3 P F Z 2 Y p v H I q S 5 o H T T Y Y x q N 4 J d E 2 S D o w J J 0 c Z / v B b j p T t B Y o L e V g z C S J K z t 1 o C 2 j H J s w r Q 1 W Q B d Q 4 s R D C Q L N 1 K 3 q a a I D z 8 y i Q m l / p I 1 W 7 N 8 v H A j T 5 u c 9 2 7 z N p q 0 l / 2 W b 1 L Z 4 N n V M V r V F S d e B i p p H V k V t c 6 I Z 0 0 g t X 3 o A V D O f a 0 T n o I F a 3 8 J e F F F z y 7 T 6 0 K v E U e C 0 z 5 Q a q j m j 5 3 1 W I z f s Y 7 8 N F 3 y p l f W j k m W f z U X / X m u + 8 Z n S u B 0 i V 2 p h I T c X B n 6 J f l o a X / v O v a l Q g 1 X 6 k U t B l w L O G 9 f p / 7 k x u X b z O g x D v z f J 5 p Z s g 5 P D U f J 4 F L 9 9 M j x 6 3 m 3 Q H r l H 7 p O H J C F P y R F 5 R Y 7 J m F D y i X w m X 8 j X 4 F v w I / g Z / F q 7 B j v d m 7 u k J 8 H v P 1 c n F v w = < / l a t e x i t >ĥ o 0 2

< l a t e x i t s h a 1 _ b a s e 6 4 = " l C c t r n 5 j U I + p

P I t i h M G 7 P A c L L 3 k = " > A A A D Q X i c f V L N j t M w E P a G B Z b w 1 4 U j l 4 h q B e J Q p Q s S H F f A g Q t i k e j u S k 0 V T d x J Y t U / k e 3 A F i t P w N N w h S N P w S N w Q 1 y 5 4 L Q 5 k J Z l J G s + f z P 2 / G Y V Z 8 b G 8 f e d 4 N L u 5 S t X 9 6 6 F 1 2 / c v H V 7 s H / n x K h a U 5 x Q x Z U + y 8 A g Z x I n l l m O Z 5 V G E B n H 0 2 z x o r W f v k d t m J L v 7 L L C m Y B C s p x R s J 5 K B w d R U o J 1 i Q B b Z r k r m y Z 1 S a b 4 3 C y F V 0 6 l h 8 2 D J h 0 M 4 1 G 8 k m g b j D s w J J 0 c p / v B b j J X t B Y o L e V g z H Q c V 3 b m Q F t G O T Z h U h u s g C 6 g w K m H E g S a m V v V 0 0 Q H n p l H u d L + S B u t 2 L 9 f O B C m z c 9 7 t n m b T V t L / s s 2 r W 3 + b O a Y r G q L k q 4 D 5 T W P r I r a 5 k R z p p F a v v Q A q G Y + 1 4 i W o I F a 3 8 J e F F F z y 7 T 6 0 K v E U e C 0 z x Q a q p L R 8 z 6 r k R v 2 s d + G C 7 7 U y v p R y a L P Z q J / r z X f + E

x p 3 A 6 R K b W w k J k L A 7 9 E P y 2 N r 3 3 n 3 l S o w S r 9 y C W g C w H n j e v 0 / 9 y Y X L t 5 H Y a h 3 5 v x 5 p Z s g 5 P D 0 f j x K H 7 7 Z H j 0 v N u g P X K P 3 C c P y Z g 8 J U f k F T k m E 0 L J J / K Z f C F f g 2 / B j + B n 8 G v t G u x 0 b + 6 S n g S / / w B Z 4 R b 9 < / l a t e x i t > s a < l a t e x i t s h a 1 _ b a s e 6 4 = " L e R 1 J A f P 9 6 t E 2 M T Q r X B

V v 0 E 7 x E o = " > A A A D O 3 i c f V I 7 b x N B E N 4 c A c L x c q C k 4 I S F h C i s u 4 A E Z Q Q U N I g g 4 S S S b V l z 6 / F 5 5 X 2 c d u c g Z n U l v 4 Y W S n 4 I N R 2 i p W d t X 5 G z C S O t 5 t t v Z n Y e O 3 k p h a M 0 / b E T X d q 9 f O X q 3 r X 4 + o 2 b t 2 5 3 9 u 8 c O 1 N Z j n 1 u p L G n O T i U Q m O f B E k 8 L S 2 C y i W e 5 P O X S / v J B 7 R O G P 2 e F i W O F B R a T A U H C t S 4 c 3 + Y G z l x C x W U d / X Y n 7 9 D X Y 8 7 3 b S X r i T Z B l k D u q y R o / F + t D u c G F 4 p 1 M Q l O D f I 0 p J G H i w J L r G O h 5 X D E v g c C h w E q E G h G / l V J 3 X y M D C T Z G p s O J q S F X s + w o N y y + K C p w K a u U 3 b k v y X b V D R 9 P n I C 1 1 W h J q v E 0 0 r m Z B J l m N J J s I i J 7 k I A L g V o d a E z 8 A C p z C 8 V h Z V S R L W f G x 1 4 j l I 3 m Y K C + V M 8 L M 2

v 2 O 8 N N h 4 H 6 P 1 V i A Q 0 1 k 2 M z 6 E = " > A A A D K 3 i c f V J L b 9 N A E N 6 a A s U 8 m p Y j F 4 s I C X G I b E C C Y w U 9 c K C i l Z q 2 U h J F 4 8 3 Y W W U f 1 u 4 Y G i z / E q 5 w 5 N d w

A n H l f 3 S T + I A T y k i r + f a b 5 8 5 O W k j h K I 5 / b g U 3 t m / e u r 1 z J 7 x 7 7 / 6 D 3 c 7 e / p k z p e X Y 5 0 Y a e 5 G C Q y k 0 9 k m Q x I v C I q h U 4 n k 6 e 7 u w n 3 9 E 6 4 T R p z Q v c K Q g 1 y I T H M h T 4 8 7 u U A F N 0 6 w 6 r c f

V + 6 N 6 3 O n G v X g p 0 S Z I G t B l j R y P 9 4 L t 4 c T w U q E m L s G 5 Q R I X N K r A k u A S 6 3 B Y O i y A z y D H g Y c a F L p R t e y 8 j p 5 4 Z h J l x v q j K V q y f 0 d U o J y b q 9 R 7 L v p 0 6 7 Y F + S / b o K T s 9 a g S u i g J N V 8 V y k o Z k Y k W Y 4 g m w i I n O f c A u B W + 1 4 h P w Q I n P 6 x W F V V K E t Z 8 a r 2 k 4 i B 5 m 8 k t F F P B L 9 u s R e n E 5 / Y Y r k l p D f l P 0 X m b T V X 7 X l q 5 l s x Y 3 C y R G j M j S N 2 1 h Q / R / 5 b F I z + 5 D w V a I G O f V U O w u Y L L u m r 0 / 9 y E X r l 5 H Y a h 3 5 t k f U s 2 w d n z X v K i F 5 + 8 7 B 6 8 a T Z o h z 1 i j 9 l T l r B X 7 I C 9 Y 8 e s z z g r 2 R f 2 l X 0 L v g c / g l / B 7 5 V r s N X E P G Q t C f 5 c A R g 6 D U M = < / l a t e x i t > Language Model T LM < l a t e x i t s h a 1 _ b a s e 6 4 = " b W X v 2 O 8 N N h 4 H 6 P 1 V i A Q 0 1 k 2 M z 6 E = " > A A A D K 3 i c f V J L b 9 N A E N 6 a A s U 8 m p Y j F 4 s I C X G I b E C C Y w U 9 c K C i l Z q 2 U h J F 4 8 3 Y W W U f 1 u 4 Y G i z / E q 5 w 5 N d w

The vase breaks and is no longer being held.

Mlp

< l a t e x i t s h a 1 _ b a s e 6 4 = " J 5 y L 7 s R 9

1 y R U 8 / L f F S o p l T f A V l E = " > A A A D M X i c f V J L b 9 N A E N 6 a A s W 8 U h A n L h Y R E u I Q 2 Y A E x w p 6 4 E B F k E h b K Y 6 i 8 W b i r L I P a 3 c M C Z Z / D F c 4 8 m t 6 Q 1 z 5 E 2 w S H 3 B C G W k 1 3 3 7 z 3 N n J C i k c x f H F X n B l / + q 1 6 w c 3 w p u 3 b t + 5 2 z m 8 d + p M a T k O u J H G n m f g U A q N A x I k 8 b y w C C q T e J b N 3 6 z s Z 5 / Q O m H 0 R 1 o W O F K Q a z E V H M h T 4 8 6 D l H B B V l U n 7 / r 1 u E q P U R L U 4 0 4 3 7 s V r i X Z B 0 o A u a 6 Q / P g z 2 0 4 n h p U J N X I J z w y Q u a F S B J c E l 1 m F a O i y A z y H H o Y c a F L p R t e 6 / j h 5 7 Z h J N j f V H U 7 R m / 4 6 o Q D m 3 V J n 3 V E A z t 2 1 b k f + y D U u a v h p V Q h c l o e a b Q t N S R m S i 1 T C i i b D I S S 4 9 A G 6 F 7 z X i M 7 D A y Y + s V U W V k o Q 1 n 1 s v q T h I 3 m Z y C 8 V M 8 E W b t S i d + N I e w y U p r S H / N T p v

s 5 l q 3 0 s r t 5 I Z i 7 s l M m P m B J m 7 t P A x + t + y e O I n 9 7 5 A C 2 T s 0 y o F m y t Y 1 F W j / + c m 9 M b N 6 z A M / d 4 k 2 1 u y C 0 6 f 9 Z L n v f j D i + 7 R 6 2 a D D t h D 9 o g 9 Y Q l 7 y Y 7 Y W 9 Z n A 8 Z Z x b 6 y b + x 7 8 C O 4 C H 4 G v z a u w V 4 T c 5 + 1 J P j 9 B 9 K O D / M = < / l a t e x i t > sõ0 < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 9 U Q b A s r

v A Y 0 2 F 4 v v i N X 5 h J s l v 0 = " > A A A D Q n i c f V L N b h M x E H a X A m X 5 S + H I Z U X E j z h E u w W p P V b A g Q u i S K S t l I 2 i W W e y s e K 1 V / Z s a b D 2 D X g a r n D k J X g F b o g r B 5 x k D 9 2 E M p I 1 n 7 / 5 9 X i y U g p L c f x j K 7 i y f f X a 9 Z 0 b 4 c 1 b t + / c 7 e z e O 7 a 6 M h z 7 X E t t T j O w K I X C P g m S e F o a h C K T e J L N X i 3 s J 2 d o r N D q A 8 1 L H B a Q K z E R H M h T o 8 7 j N N N y b O e F V 8 7 W I 5 e e I X c X S V 3 X T + p R p x v 3 4 q V E m y B p Q J c 1 c j T a D b b T s e Z V g Y q 4 B G s H S V z S 0 I E h w S X W Y V p Z L I H P I M e B h w o K t E O 3 f F A d P f L M O J p o 4 4 + i a M l e j H B Q 2 E V 7 3 r M A m t p 1 2 4 L 8 l 2 1 Q 0 e R g 6 I Q q K 0 L F V 4 U m l Y x I R 4 v p R G N h k J O c e w D c C N 9 r x K d g g J O f Y a t K U U k S R n 9 s v c R x k L z N 5 A b K q e D n b d a g t O J T e w y X p D S a / F + p v M 1 m R f t e G b m W T B v c L J F p P S P I 7 K W F X 6 P / L Y N v

/ e T e l W i A t H n m U j B 5 A e e 1 a / T / 3 I R a u X k d h q H f m 2 R 9 S z b B 8 V 4 v e d 6 L 3 7 / o H r 5 s N m i H P W A P 2 V O W s H 1 2 y N 6 w I 9 Z n n H 1 m X 9 h X 9 i 3 4 H v w M f g W / V 6 7 B V h N z n 7 U k + P M X r A Y Y J w = = < / l a t e x i t > Figure 3 : PIGLeT architecture. We pretrain a model of physical world dynamics by learning to transform objects o and actions a into new updated objects o . Our underlying world dynamics model -the encoder, the decoder, and the action application module, can augment a language model with grounded commonsense knowledge. from interactions, and second, integrate with a pretrained model of language form.

Figure 3: PIGLeT architecture. We pretrain a model of physical world dynamics by learning to transform objects ~o and actions a into new updated objects ~o′. Our underlying world dynamics model – the encoder, the decoder, and the action application module, can augment a language model with grounded commonsense knowledge.

3.1 Modeling Physical Dynamics

We take a neural, auto-encoder style approach to model world dynamics. An object o gets encoded as a vector h o ∈ R do . The model likewise encodes an action a as a vector h a ∈ R da , using it to manipulate the hidden states of all objects. The model can then decode any object hidden representation back into a symbolic form.

3.1.1 Object Encoder And Decoder

We use a Transformer (Vaswani et al., 2017) to encode objects into vectors o ∈ R do , and then another to decode from this representation.

Encoder. Objects o are provided to the encoder as a set of attributes, with categories c 1 ,..., c n . Each attribute c has its own vocabulary and embedding E c . For each object o, we first embed all the attributes separately and feed the result into a Transformer encoder T enc . This gives us (with position embeddings omitted for clarity):

EQUATION (2): Not extracted; please refer to original document.

Decoder. We can then convert back into the original symbolic representation through a left-to-right Transformer decoder, which predicts attributes oneby-one from c 1 to c n . This captures the inherent correlation between attributes, while making no independence assumptions, we discuss our ordering in Appendix A.2. The probability of predicting the next attribute o c i+1 is then given by:

p(o c i+1 |h o , o :c i )=T dec h o ,E 1 (o 1 ),..., E c i (o c i ) (3)

3.1.2 Modeling Actions As Functions

We treat actions a as functions that transform the state of all objects in the scene. Actions in our environment take at most two arguments, so we embed the action a and the names of its arguments, concatenate them, and pass the result through a multilayer perceptron; yielding a vector representation h a . Applying Actions. We use the encoded action h a to transform all objects in the scene, obtaining updated representationsĥ o for each one. We take a global approach, jointly transforming all objects. This takes into account that interactions are contextual: turning on a Faucet might fill up a Cup if and only if there is one beneath it.

Letting the observed objects in the interaction be o 1 and o 2 , with encodings h o 1 and h o 2 respectively, we model the transformation via the following multilayer perceptron:

[ĥ o 1 ,ĥ o 2 ] = MLP apply h a , h o 1 , h o 2 . (4)

The result can be decoded into symbolic form using the object decoder (Equation 3).

3.1.3 Loss Function And Training

We train our dynamics model on ( o,a, o ) transitions. The model primarily learns by running o,a through the model, predicting the updated output stateĥ o , and minimizing the cross-entropy of generating attributes of the real changed object o . We also regularize the model by encoding objects o, o and having the model learn to reconstruct them. We weight all these cross-entropy losses equally. We discuss our architecture in Appendix A.1; it uses 3-layer Transformers, totalling 17M parameters.

3.2 Language Grounding

After pretraining our physical dynamics model, we integrate it with a Transformer Language Model (LM). In our framework, the role of the LM will be to both encode natural language sentences of actions into a hidden state approximating h a , as well as summarizing the result of an interaction ( o,a, o ) in natural language.

Choice of LM. Our framework is compatible with any language model. However, to explore the impact of pretraining data on grounding later in this paper, we pretrain our own with an identical architecture to the smallest GPT2 (Radford et al. (2019) ; 117M). To handle both classification and generation well, we mask only part of the attention weights out, allowing the model to encode a "prefix" bidirectionally; it generates subsequent tokens leftto-right (Dong et al., 2019) . We pretrain the model on Wikipedia and books; details in Appendix D.

We next discuss architectural details of performing the language transfer, along with optimization.

3.2.1 Transfer Architecture

English actions to vector form. Given a natural language description s a of an action a, like "The robot throws the vase," for PIGPeN-NLU, our model will learn to parse this sentence into a neural representation h a , so the dynamics model can simulate the result. We do this by encoding s a through our language model, T LM , with a learned linear transformation over the resulting (bidirectional) encoding. The resulting vector h sa can then be used by Equation 4.

Summarizing the result of an action. For PIGPeN-NLG, our model simulates the result of an action a neurally, resulting in a predicted hidden stateĥ o for each object in the scene o. To write an English summary describing "what changed," we first learn a lightweight fused representation of the transition, aggregating the initial and final states, along with the action, through a multilayer perceptron. For each object o i we have:

EQUATION (5): Not extracted; please refer to original document.

We then use the sequence [h ∆o 1 , h ∆o 2 ] as bidirectional context for our our LM to decode from. Additionally, since our test set includes novel objects not seen in training, we provide the names of the objects as additional context for the LM generator (e.g. 'Vase, Laptop'); this allows a LM to copy those names over rather than hallucinate wrong ones. Importantly we only provide the surfaceform names, not underlying information about these objects or their usage as with few-shot scenarios in the recent GPT-3 experiments (Brown et al., 2020) -necessitating that PIGLeT learns what these names mean through interaction.

3.2.2 Loss Functions And Training.

Modeling text generation allows us to incorporate a new loss function, that of minimizing the loglikelihood of generating each s o given previous words and the result of Equation 5:

p(s post i+1 |s o ,1:i ) = T LM (h ∆o 1 , h ∆o 2 , s o ,1:i ). (6)

We do the same for the object states s o pre-action, using h o i as the corresponding hidden states.

For PIGPeN-NLU, where no generation is needed, optimizing Equation 5 is not strictly necessary. However, as we will show later, it helps provide additional signal to the model, improving overall accuracy by several percentage points.

4 Experiments

We test our model's ability to encode language into a grounded form (PIGPeN-NLU), and decode that grounded form into language (PIGPeN-NLG).

4.1 Pigpen-Nlu Results.

We first evaluate models by their performance on PIGPeN-NLU: given objects o, and a sentence s a describing an action, a model must predict the resulting state of objects o . We primarily evaluate models by accuracy; scoring how many objects for which they got all attributes correct. We compare with the following strong baselines: a. No Change: this baseline copies the initial state of all objects o as the final state o . b. GPT3-175B (Brown et al., 2020), a very large language model for 'few-shot' learning using a prompt. For GPT3, and other text-to-text models, we encode and decode the symbolic object states in a JSON-style dictionary format, discussed in Appendix A.4. c. T5 (Raffel et al., 2019) . With this model, we use the same 'text-to-text' format, however here we train it on the paired data from PIG-PeN. We consider varying sizes of T5, from T5-Small -the closest in size to PIGLeT, up until T5-11B, roughly 100x the size. d. (Alberti et al., 2019) 2019a), where grounded visual information is fed into a BERT model as tokens; the transformer performs the grounded reasoning. We adapt it for our task by using our base LM and feeding in object representations from our pretrained object encoder, also as tokens. Our object decoder predicts the object, given the LM's pooled hidden state. This is "pretrained dynamics," we also consider a version without a randomly initialized dynamics model. e. (Gupta and Durrett, 2019)-style. Thiso paper proposes using Transformers to model physical state, for tasks like entity tracking in recipes. Here, the authors propose decoding a physical state attribute (like isCooked ) by feeding the model a label-specific [CLS] token, and then mapping the result through a hidden layer. We do this and use a similar object encoder as our (Alberti et al., 2019 )-style baseline. We discuss hyperparameters in Appendix A.3.

Results. From the results (Table 1) , we can draw several patterns. Our model, PIGLeT performs best at getting all attributes correct; doing so over 80% on both validation and test sets, even for novel objects not seen during training. The next closest model is T5-11B, which scores 68% on validation. Though when evaluated on objects 'seen' during training it gets 77%, that number drops by over 18% for unseen objects. On the other hand, PIGLeT has a modest gap of 3%. This suggests that our approach is particularly effective at connecting unpaired language and world representations. At

Model

Accuracy (val;%)

PIGLeT, No Pretraining 10.4

PIGLeT, Non-global MLPapply 72.0

PIGLeT, Global MLPapply 78.5 PIGLeT, Global MLPapply, Gen. loss 681.8

PIGLeT, Symbols Only (Upper Bound) 89.3 Table 2 : Ablation study on PIGPeN-NLU's validation set. Our model improves 6% by modeling global dynamics of all objects in the scene, versus applying actions to single objects in isolation. We improve another 3% by adding an auxiliary generation loss.

Table 2: Ablation study on PIGPeN-NLU’s validation set. Our model improves 6% by modeling global dynamics of all objects in the scene, versus applying actions to single objects in isolation. We improve another 3% by adding an auxiliary generation loss.

the other extreme, GPT3 does poorly in its 'fewshot' setting, suggesting that size is no replacement for grounded supervision.

PIGLeT also outperforms 'BERT style' approaches that control for the same language model architecture, but perform the physical reasoning inside the language transformer rather than as a separate model. Performance drops when the physical decoder must be learned from few paired examples (as in Gupta and Durrett (2019)); it drops even further when neither model is given access to our pretrained dynamics model, with both baselines then underperforming 'No Change.' This suggests that our approach of having a physical reasoning model outside of an LM is a good inductive bias.

4.1.1 Ablation Study

In Table 2 we present an ablation study of PIGLeT's components. Of note, by using a global representation of objects in the world (Equation 4), we get over 6% improvement over a local representation where objects are manipulated independently. We get another 3% boost by adding a generation loss, suggesting that learning to generate summaries helps the model better connect the world to language. Last, we benchmark how much headroom there is on PIGPeN-NLU by evaluating model performance on a 'symbols only' version of the task, where the symbolic action a is given explicitly to our dynamics model. This upper bound is roughly 7% higher than PIGLeT, suggesting space for future work.

4.2 Pigpen-Nlg Results

Next, we turn to PIGPeN-NLG: given objects o, and the literal next action a, a model must generate a sentence s o describing what will change in the scene. We compare with the following baselines: a. T5. We use a T5 model that is given a JSONstyle dictionary representation of both o and a, it is finetuned to generate summaries s o . b. LM Baseline. We feed our LM hidden states h o from our pretrained encoder, along with its representation of a. The key difference between it and PIGLeT is that we do not allow it to simulate neurally what might happen next -MLP apply is never used here. Size matters. Arguably the most important factor controlling the fluency of a language generator is its size (Kaplan et al., 2020) . Since our LM could also be scaled up to arbitrary size, we control for size in our experiments and only consider models the size of GPT2-base (117M) or smaller; we thus compare against T5-small as T5-Base has 220M parameters. We discuss optimization and sampling hyperparameters in Appendix A.3.

Evaluation metrics. We evaluate models over the validation and test sets. We consider three main evaluation metrics: BLEU (Papineni et al., 2002) with two references, the recently proposed BERTScore (Zhang et al., 2020) , and conduct a human evaluation. Humans rate both the fluency of post-action text, as well as its faithfulness to true action result, on a scale from −1 to 1.

Results. We show our results in Table 3 . Of note, PIGLeT is competitive with T5 and significantly outperforms the pure LM baseline, which uses a pretrained encoder for object states, yet has the physical simulation piece MLP apply removed. This suggests that simulating world dynamics not only allows the model to predict what might happen next, it leads to more faithful generation as well.

Table 3: Text generation results on PIGPeN-NLG, showing models of roughly equivalent size (up to 117M parameters). Our PIGLeT outperforms the LM baseline (using the same architecture but omitting the physical reasoning component) by 4 BLEU points, 2 BERTScore F1 points, and 0.35 points in a human evaluation of language faithfulness to the actual scene.

5 Analysis

5.1 Qualitative examples.

We show two qualitative examples in Figure 4 , covering both PIGPeN-NLU as well as PIGPeN-NLG.

Figure 4: Qualitative examples. Our model PIGLeT reliably predicts what might happen next (like the Mug becoming empty in Row 1), in a structured and explicit way. However, it often struggles at generating sentences for

In the first row, the robot empties a held Mug that is filled with water. PIGLeT gets the state, and generates a faithful sentence summarizing that the mug becomes empty. T5 struggles somewhat, emptying the water from both the Mug and the (irrelevant) Sink . It also generates text saying that the Sink becomes empty, instead of the Mug. In the second row, PIGLeT correctly predicts the next object states, but its generated text is incomplete -it should also write that the mug becomes filled wtih Coffee. T5 makes the same mistake in generation, and it also underpredicts the state changes, omitting all changes to the Mug .

We suspect that T5 struggles here in part because Mug is an unseen object. T5 only experiences it through language-only pretraining, but this might not be enough for a fully grounded representation.

5.2 Representing Novel Words

The language models that perform best today are trained on massive datasets of text. However, this has unintended consequences (Bender et al., 2021) and it is unlike how children learn language, with children learning novel words from experience (Carey and Bartlett, 1978) . The large scale of our pretraining datasets might allow models to learn to perform physical-commonsense like tasks for wrong reasons, overfitting to surface patterns rather than learning meaningful grounding.

We investigate the extent of this by training a 'zero-shot' version of our backbone LM on Wikipedia and books -the only difference is that Figure 5 : PIGPeN-NLU performance of a zero-shot PIGLeT, that was pretrained on Books and Wikipedia without reading any words of our 'unseen' objects like 'mug.' It outperforms a much bigger T5-11B overall, though is in turn beaten by PIGLeT on unseen objects like 'Sink' and 'Microwave.'

Figure 5: PIGPeN-NLU performance of a zero-shot PIGLeT, that was pretrained on Books and Wikipedia without reading any words of our ‘unseen’ objects like ‘mug.’ It outperforms a much bigger T5-11B overall, though is in turn beaten by PIGLeT on unseen objects like ‘Sink’ and ‘Microwave.’

we explicitly exclude all mentioned sentences containing one of our "unseen" object categories. In this setting, not only must PIGLeT learn to ground words like 'mug,' it must do so without having seen the word 'mug' during pretraining. This is significant because we count over 20k instances of 'Mug' words (including morphology) in our dataset. We show results in Figure 5 . A version of PIGLeT with the zero-shot LM does surprisingly well -achieving 80% accuracy at predicting the state changes for "Mug" -despite never having been pretrained on one before. This even outperforms T5 at the overall task. Nevertheless, PIGLeT outperforms it by roughly 7% at unseen objects, with notable gains of over 10% on highly dynamic objects like Toaster s and Sink s.

6 Related Work

Grounded commonsense reasoning. In this work, we study language grounding and common-sense reasoning at the representation and concept level. The aim is to train models that learn to acquire concepts more like humans, rather than performing well on a downstream task that (for humans) requires commonsense reasoning. Thus, this work is somewhat different versus other 3D embodied tasks like QA (Gordon et al., 2018; Das et al., 2018) , along with past work for measuring such grounded commonsense reasoning, like SWAG, HellaSWAG, and VCR (Zellers et al., 2018 (Zellers et al., , 2019b . The knowledge covered is different, as it is self-contained within THOR. While VCR, for instance, includes lots of visual situations about what people are doing, this paper focuses on learning the physical properties of objects.

Zero-shot generalization. There has been a lot of past work involved with learning 'zero-shot': often learning about the grounded world in language, and transferring that knowledge to vision. Techniques for this include looking at word embeddings (Frome et al., 2013) and dictionary definitions (Zellers and Choi, 2017) . In this work, we propose the inverse. This approach was used to learn better word embeddings or semantic tuples (Yatskar et al., 2016 ), but we consider learning a component to be plugged into a deep Transformer language model. Past work evaluating these types of zero-shot generalization have also looked into how well models can compose concepts in language together (Lake and Baroni, 2018; Ruis et al., 2020) . Our work considers elements of compositionality through grounded transfer. For example, in PIGPeN-NLG, models must generate sentences about the equivalent of dropping a 'dax', despite never having seen one before. However, our work is also contextual, in that the outcome of 'dropping a dax' might depend on external attributes (like how high we're dropping it from).

Structured Models for Attributes and Objects. The idea of modeling actions as functions that transform objects has been explored in the computer vision space (Wang et al., 2016) . Past work has also built formal structured models for connecting vision and language (Matuszek et al., 2012; Krishnamurthy and Kollar, 2013) , we take a neural approach and connect today's best models of language form to similarly neural models of a simulated environment.

7 Conclusion

In this paper, we presented an approach PIGLeT for jointly modeling language form and meaning. We presented a testbed PIGPeN for evaluating our model, which performs well at grounding language to the (simulated) world.

A Model implementation details and hyperparameters.

We discuss the architectures and learning hyperparameters of our various models in the subsections below.

A.1 Physical Dynamics Model

We implemented our dynamics model with three Transformer layers for both the encoder and the decoder, and a hidden dimension of 256 for objects and actions. The resulting model has 17 million parameters. We pretrained the model for 20 epochs over 280k state transitions, with a batch size of 1024. We use an Adam optimizer (Kingma and Ba, 2014) with a learning rate of 1e −3 .

A.2 Ordering attributes in decoding.

Recall that we use a left-to-right transformer to decode into an attribute representation, predicting attributes one-by-one from c 1 to c n . Our model is agnostic to the actual order, as no matter what the order is, it still is modeling a decomposition of the joint probability of generating that object. However, we implemented this by using the name as the first attribute c 1 that is predicted, and ordered the rest in a descending way by vocabulary size so as to predict harder attributes first.

A.3 Optimization Hyperparameters Chosen

We finetuned PIGLeT for both tasks with an Adam optimizer (Kingma and Ba, 2014). We did a small grid search for hyperparameter values, choosing the best learning rate {2e −5 , 1e −5 , 1e −6 } by accuracy on the development set, and likewise the best batch size 16 or 32. We considered freezing the physical dynamics backbone as another hyperparameter. We found it slightly boosted performance on PIGPeN-NLG when we froze the physical dynamics backbone, but not so for PIGPeN-NLU. We trained our model for 80 epochs on paired data. We trained the baseline models with the same backbone in the same way, using similar hyperparameters. However, we found that after 80 epochs, the baseline models without pretrained dynamics failed to converge, so we finetuned them for 200 epochs total. For T5, we used similar identical hyperparameter ranges as the other models. However, because T5 uses a different optimizer (AdaFactor; Shazeer and Stern (2018)), which operates on a slightly different scale, we used a different set of learning rates. We chose the best one over {1e −4 , 2e −4 , 4e −4 }.

Search. Both of our tasks involve left-to-right decoding. We used argmax (greedy) search for PIGPeN-NLU, finding that it worked well as a 'closed-ended generation' style task. On the other hand, we used Nucleus Sampling for PIGPeN-NLG as there are often several ways to communicate a state transition; here we set p = 0.8.

A.4 Encoding The Input For Text-To-Text Models

Text-to-text models, needless to say, can only handle text. We encode the world states into a representation suitable for these models by formatting the object states as a JSON-style dictionary of keys and values. We had to make several modifications to the encoding however from a default JSON, because we handle a lot of attributes in this task, and JSON has quote characters ''' that take up a lot of space in a BPE encoding. We thus strip the quote characters and lowercase everything (with this also helping BPE-efficiency). We put parentheses around each object and give names to all 'binned' attributes. An example encoding might be: We have models decode directly into this kind of format when predicting state changes. Though the T5 models usually get the format right, we often have to sanitize the text in order for it to be a valid object state in our framework. This is espe-cially an issue with GPT3, since it is given limited supervision (we squeeze 3 examples into the 2048-BPE token context window) and often makes up new names and attributes. Thus, for each word not in an attribute's vocabulary, we use a Levenstein distance heuristic to match the an invalid choice with its closest (valid) option. If the model fails to generate anything for a certain attribute keyfor example if it does not include something like openable somewhere, we copy the representation of the input object for that attribute, thereby making the default assumption that attributes do not change.

B All Thor Attributes

We list a table with all of the attributes we used for this work in Table 4 .

Table 4: All attributes that we consider for this work in THOR. We list the attribute’s name, the size of the attribute vocabulary, and the range of values the attribute can take on. For attributes like ‘mass’, ‘size’, and ‘distance’, we note that the underlying simulator stores them as floats; we bin them to 8 values for this work. All the values for attributes with a vocabulary size of 2 are boolean.

C Turk Annotation Details

We followed crowdsourcing best practices, such as using a qualification exam, giving feedback to workers, and paying workers well (above $15 per hour). Each of our HITs required writing three sentences, and we paid Mechanical Turk workers 57 cents per HIT. We used three workers per example, allowing us to have multiple language references for evaluation. A screenshot of our user interface is shown in Figure 6 .

Figure 6: Our user interface for Mechanical Turk annotation.

D Our Pretrained Language Model

We use our own pretrained language model primarily because it allows us to investigate the impact of data on model performance. We trained a prefixmasked language model (Dong et al., 2019) on Wikipedia and Book data, mimicing the data used by the original BERT paper (Devlin et al., 2019) . We trained the model for 60000 iterations, at a batch size of 8192 sequences each of length 512. This corresponds to 50 epochs over the dataset. We masked inputs in the bidirectional prefix with Span-BERT masking (Joshi et al., 2020) . Since BERTstyle 'masked' out inputs are easier to predict than tokens generated left-to-right, we reduced the loss component of left-to-right generation by a factor of 20; roughly balancing the two loss components. Figure 7 : Counts of zero-shot words that appear in BERT's training data (Wikipedia and Toronto Books). For example, in the 4 billion words BERT is trained on, it sees the word 'Bed' almost 500k times. This might allow it to perform superficially well at answering questions about beds -while not necessarily possessing deep physical knowledge about them. Table 4 : All attributes that we consider for this work in THOR. We list the attribute's name, the size of the attribute vocabulary, and the range of values the attribute can take on. For attributes like 'mass', 'size', and 'distance', we note that the underlying simulator stores them as floats; we bin them to 8 values for this work.

Figure 7: Counts of zero-shot words that appear in BERT’s training data (Wikipedia and Toronto Books). For example, in the 4 billion words BERT is trained on, it sees the word ‘Bed’ almost 500k times. This might allow it to perform superficially well at answering questions about beds – while not necessarily possessing deep physical knowledge about them.

All the values for attributes with a vocabulary size of 2 are boolean.

Generator Description

put_X_in_Y Samples an object X from the scene, and a receptacle Y . Tries to put it in Y . throw_X_at_Y Samples two objects X and Y from the scene. Picks up X , moves to face Y , and throws it forward with variable intensity. toggle_X Samples an object X , and turns it on or off. slice_X Samples an object X and a surface Y . Picks up X , places it on Y , and cuts it. dirty_X Samples an object X , and makes it dirty. clean_X Samples a dirty object X . Finds a Sink nearby a Faucet , and places X inside. Turns on/off the Faucet , cleaning X . toast_bread Finds some Bread , slicing it if necessary, places it in a Toaster , then turns it on. brew_coffee

Picks up a Mug , places it under a CoffeeMachine , and turns the machine on. fry_X

Picks up a food X , slices it if necessary, and puts it in a Pot or Pan . Brings it to a StoveBurner and turns the burner on. microwave_X

Picks up an object X and slices it if necessary. Places it in a Microwave , closes it, and then turns it on. fill_X Picks up an object X places it under a Faucet . Turns on/off the Faucet , then pours out the liquid. Table 5 : Trajectory generation functions that we used to sample 'interesting' physical interactions, such as the effects that actions will have on objects, and which actions will succeed or not.