OCNLI: Original Chinese Natural Language Inference
Despite the tremendous recent progress on natural language inference (NLI), driven largely by large-scale investment in new datasets (e.g.,SNLI, MNLI) and advances in modeling, most progress has been limited to English due to a lack of reliable datasets for most of the world’s languages. In this paper, we present the first large-scale NLI dataset (consisting of ~56,000 annotated sentence pairs) for Chinese called the Original Chinese Natural Language Inference dataset (OCNLI). Unlike recent attempts at extending NLI to other languages, our dataset does not rely on any automatic translation or non-expert annotation. Instead, we elicit annotations from native speakers specializing in linguistics. We follow closely the annotation protocol used for MNLI, but create new strategies for eliciting diverse hypotheses. We establish several baseline results on our dataset using state-of-the-art pre-trained models for Chinese, and find even the best performing models to be far outpaced by human performance (~12% absolute performance gap), making it a challenging new resource that we hope will help to accelerate progress in Chinese NLU. To the best of our knowledge, this is the first human-elicited MNLI-style corpus for a non-English language.
In the last few years, natural language understanding has made considerable progress, driven largely by the availability of large-scale datasets and advances in neural modeling (Peters et al., 2018; Devlin et al., 2019) . At the center of this progress has been natural language inference (NLI), which focuses on the problem of deciding whether two statements are connected via an entailment or a contradiction. NLI profited immensely from new datasets such as the Stanford NLI (SNLI, Bowman et al. (2015) ) and Multi-Genre NLI (MNLI, ) datasets. However, as often the case, this progress has centered around the English language given that the most well-known datasets are limited to English. Efforts to build comparable datasets for other languages have largely focused on (automatically) translating existing English NLI datasets Conneau et al., 2018) . But this approach comes with its own issues (see section 2).
To overcome these shortcomings and contribute to ongoing progress in Chinese NLU, we present the first large-scale NLI dataset for Chinese called the Original Chinese Natural Language Inference dataset (OCNLI). Unlike previous approaches, we rely entirely on original Chinese sources and use native speakers of Chinese with special expertise in linguistics and language studies for creating hypotheses and for annotation. Our dataset contains ∼56,000 annotated premise-hypothesis pairs and follows a similar procedure of data collection to the English MNLI. Following MNLI, the premises in these sentence pairs are drawn from multiple genres (5 in total), including both written and spoken Chinese (see Table 1 for examples). To ensure annotation quality and consistency, we closely mimic MNLI's original annotation protocols for monitoring annotator performance. We find that our trained annotators have high agreement on label prediction (with ∼98% agreement based on a 3-vote consensus). To our knowledge, this dataset constitutes the first large-scale NLI dataset for Chinese that does not rely on automatic translation.
Additionally, we establish baseline results based on a standard set of NLI models (Chen et al., 2017) tailored to Chinese, as well as new pre-trained Chinese transformer models (Cui et al., 2019) . We find that our strongest model, based on RoBERTa (Liu arXiv:2010.05444v1 [cs.CL] 12 Oct 2020
Majority label All labels Hypothesis 但是不光是中国，日本，整个东亚文化都有这个特 点就是被权力影响很深 But not only China and Japan, the entire East Asian culture has this feature, that is it is deeply influenced by the power.
Entailment E E E E E 有超过两个东亚国家有这个 特点 More than two East Asian countries have this feature.
(We need to) perfect our work and trade policies.
Entailment E E E E E 贸易政策体系还有不足之处 (Our) trade policies still need to be improved. 咖啡馆里面对面坐的年轻男女也是上一代的故事， 她已是过来人了 Stories of young couples sitting face-to-face in a cafe is already something from the last generation. She has gone through all that.
LIT medium Contradiction C C C N N 男人和女人是背对背坐着的 The man and the woman are sitting back-to-back.
Neutral N E N N N 专门行政法规是解决拖欠工资问题 的根本途径 (Designing) specific administrative regulations is the most fundamental way of solving the issue of wage delays.
Today, this conference which has drawn much attention finally took place in Bonn.
Neutral N N N N C 这一会议原定于昨天举行 This conferences was scheduled to be held yesterday.
嗯,今天星期六我们这儿,嗯哼. En, it's Saturday today in our place, yeah.
Contradiction C C C C C 昨天是星期天 It was Sunday yesterday. Table 1 : Examples from the MULTICONSTRAINT elicitation of our Chinese NLI dataset, one from each of the five text genres. easy: 1st hypothesis the annotator wrote for that particular premise and label; medium: 2nd hypothesis; hard: 3rd hypothesis. Bold label shows the majority vote from the annotators. et al., 2019), performs far behind expert human performance (∼78% vs. ∼90% accuracy on our test data). These results show that the dataset is challenging without using special filtering that has accompanied many recent NLI datasets .
Contributions of this paper: 1) We introduce a new, high quality dataset for NLI for Chinese, based on Chinese data sources and expert annotators; 2) We provide strong baseline models for the task, and establish the difficulty of our task through experiments with recent pre-trained transformers.
3) We also demonstrate the benefit of naturally annotated NLI data by comparing performance with large-scale automatically translated datasets.
Contradiction C C C C C 我没房 I don't have a house.
Neutral N N N N N 是别人想问我借这个东西 Someone else is trying to borrow this from me. 桥一顶一顶地从船上过去，好像进了一扇 一扇的门 Bridge after bridge was passed above the boat, just like going through door after door.
2 Related Work
Natural language inference (NLI), or recognizing textual entailment (RTE), is a long-standing task in NLP. Since we cannot cover the whole field, we focus on existing datasets and current systems.
Data: To date, there exists numerous datasets for English, ranging from smaller/more linguistics oriented resources such as FraCaS (Cooper et al., 1996) , to larger ones like the RTE challenges (Dagan et al., 2005) and SICK (Marelli et al., 2014) . Perhaps the most influential are the two large-scale, human-elicited datasets: the Stanford Natural Language Inference Corpus (SNLI) (Bowman et al., 2015) , whose premises are taken from image captions, and the Multi-Genre Natural Language Inference Corpus (MNLI) , whose premises are from texts in 10 different genres. Both are built by collecting premises from pre-defined text, then having annotators come up with possible hypotheses and inference labels, which is the procedure we also employ in our work.
These large corpora have been used as part of larger benchmark sets, e.g., GLUE (Wang et al., 2018) , and have proven useful for problems beyond NLI, such as sentence representation and transfer learning (Conneau et al., 2017; Subramanian et al., 2018; Reimers and Gurevych, 2019) , automated question-answering (Khot et al., 2018; Trivedi et al., 2019) and model probing (Warstadt et al., 2019; Geiger et al., 2020; Jeretic et al., 2020) .
The most recent English corpus Adversarial NLI (Nie et al., 2020) uses Human-And-Model-in-the-Loop Enabled Training (HAMLET) method for data collection. Their annotation method requires an existing NLI corpus to train the model during annotation, which is not possible for Chinese at the moment, as there exists no high-quality Chinese data.
In fact, there has been relatively little work on de- (Conneau et al., 2018) , showing problems of translationese (top) and poor translation quality (bottom).
veloping large-scale human-annotated resources for languages other than English. Some NLI datasets exist in other languages, e.g., Fonseca et al. (2016) and Real et al. (2020) for Portuguese, Hayashibe (2020) for Japanese, and Amirkhani et al. (2020) for Persian, but none of them have human elicited sentence pairs. Efforts have largely focused on automatic translation of existing English resources , sometimes coupled with smaller-scale hand annotation by native speakers Agić and Schluter, 2017) . This is also true for some of the datasets included in the recent Chinese NLU benchmark CLUE (Xu et al., 2020) and for XNLI (Conneau et al., 2018) , a multilingual NLI dataset covering 15 languages including Chinese.
While automatically translated data have proven to be useful in many contexts, such as cross-lingual representation learning (Siddhant et al., 2020) , there are well-known issues, especially when used in place of human annotated, quality controlled data. One issue concerns limitations in the quality of automatic translations, resulting in incorrect or unintelligible sentences (e.g., see Table 2b ). But even if the translations are correct, they suffer from "translationese", resulting in unnatural language, since lexical and syntactic choices are copied from the source language even though they are untypical for the target language (Koppel and Ordan, 2011; Hu and Kübler, 2020) .
A related issue is that a translation approach also copies the cultural context of the source language, such as an overemphasis on Western themes or cultural situations. The latter two issues are shown in Table 2a , where many English names are directly carried over into the Chinese translation, along with aspects of English syntax, such as long relative clauses, which are common in English but dispreferred in Chinese (Lin, 2011).
Systems: As inference is closely related to logic, there has always been a line of research building logic-based or logic-and-machine-learning hybrid models for NLI/RTE problems (e.g. MacCartney, 2009; Abzianidze, 2015; Martínez-Gómez et al., 2017; Yanaka et al., 2018; . However, in recent years, large datasets such as SNLI and MNLI have been almost exclusively approached by deep learning models. For examples, several transformer architectures achieve impressive results on MNLI, with current state-of-the-art T5 (Raffel et al., 2019) reaching 92.1/91.9% accuracy on the matched and mismatched sets.
Re-implementations of these transformer models for Chinese have led to similar successes on related tasks. For example, Cui et al. (2019) report that a large RoBERTa model , pre-trained with whole-word masking, achieves the highest accuracy (81.2%) among their transformer models on XNLI. In the CLUE benchmark (Xu et al., 2020) , the same RoBERTa model also achieves the highest aggregated score from eight tasks. We will use this model to establish baselines on our new dataset.
The advances in dataset creation have led to an increased awareness of systematic biases in existing datasets (Gururangan et al., 2018) , as measured through partial-input baselines, e.g., the hypothesis-only baselines explored in Poliak et al. (2018) where a model can achieve high accuracy by only looking at the hypothesis and ignoring the premise completely (see also Feng et al. (2019)). These biases have been mainly associated with the annotators (crowd workers in MNLI's case) who use certain strategies to form hypotheses of a specific label, e.g., adding a negator for contradictions.
There have been several recent attempts to reduce such biases (Belinkov et al., 2019; Sakaguchi et al., 2020; Nie et al., 2020) . There has also been a large body of work using probing datasets/tasks to stress-test NLI models trained on datasets such as SNLI and MNLI, in order to expose the weaknesses and biases in either the models or the data (Dasgupta et al., 2018; Naik et al., 2018; McCoy et al., 2019) . For this work, we closely monitor the hypothesis-only and other biases but leave systematic filtering/biasreduction/stress-testing for future work. An interesting future challenge will involve seeing how such techniques, which focus exclusively on English, transfer to other languages such as Chinese.
3 Creating Ocnli
Here, we describe our data collection and annotation procedures. Following the standard definition of NLI (Dagan et al., 2006) , our data consists of ordered pairs of sentences, one premise sentence and one hypothesis sentence, annotated with one of three labels: Entailment, Contradiction, or Neutral (see examples in Table 1 ). Following the strategy that Williams et al. (2018) established for MNLI, we start by selecting a set of premises from a collection of multi-genre Chinese texts, see Section 3.1. We then elicit hypothesis annotations based on these premises using expert annotators (Section 3.2). We develop novel strategies to ensure that we elicit diverse hypotheses. We then describe our verification procedure in Section 3.3.
3.1 Selecting The Premises
Our premises are drawn from the following five text genres: government documents, news, literature, TV talk shows, and telephone conversations. The genres were chosen to ascertain varying degrees of formality, and they were collected from different primary Chinese sources. The government documents are taken from annual Chinese government work reports 2 . The news data are extracted from the news portion of the Lancaster Corpus of Mandarin Chinese (McEnery and Xiao, 2004) . The data in the literature genre are from two contemporary Chinese novels 3 , and the TV talk show data and telephone conversations are extracted from transcripts of the talk show Behind the headlines with Wentao 4 and the Chinese Callhome transcripts (Wheatley, 1996) .
As for pre-processing, annotation symbols in the Callhome transcripts were removed and we limited our premise selection to sentences containing 8 to 50 characters.
3.2 Hypothesis Generation
One issue with the existing data collection strategies in MNLI is that humans tend to use the simplest strategies to create the hypotheses, such as negating a sentence to create a contradiction. This makes the problem unrealistically easy. To create more realistic, and thus more challenging data, we propose a new hypothesis elicitation method called multi-hypothesis elicitation. We collect four sets of inference pairs and compare the proposed method with the MNLI annotation method, where a single annotator creates an entailed sentence, a neutral sentence and a contradictory sentence given a premise (Condition: SINGLE).
Multi-hypothesis elicitation In this newly proposed setting, we ask the writer to produce three sentences per label, resulting in three entailments, three neutrals and three contradictions for each premise (Condition: MULTI). I.e. we obtain a total of nine hypotheses if the writer is able to come up with that many inferences, which is indeed the case for most premises in our experiment. Our hypothesis is that by asking them to produce three sentences for each type of inference, we push them to think beyond the easiest case. We call the 1st, 2nd and 3rd hypothesis by an annotator per label easy, medium and hard respectively, with the assumption that they start with the easiest inferences and then move on to harder ones. First experiments show that MULTI is more challenging than SINGLE, and at the same time, inter-annotator agreement is slightly higher than for SINGLE (see section 3.3).
However, we also found that MULTI introduces more hypothesis-only bias. Especially in contradictions, negators such as 没有 ("no/not") stood out as cues, similar to what had been reported in SNLI and MNLI (Poliak et al., 2018; Gururangan et al., 2018; Pavlick and Kwiatkowski, 2019 ). Therefore we experiment with two additional strategies to control the bias, resulting in MULTIENCOURAGE (encourage the annotators to write more diverse hypothesis) and MULTICONSTRAINT (put constraints on what they can produce), which will be explained in detail below.
These four strategies result in four different subsets. Table 3 gives a summary of these subsets.
Instructions for hypothesis generation The basis of our instructions are very similar to those for MNLI, but we modified them for each setting: SINGLE We asked the writer to produce one hypothesis per label, same as MNLI 5 .
MULTI Instructions are the same except that we ask for three hypotheses per label.
MULTIENCOURAGE We encouraged the writers to write high-quality hypotheses by telling them explicitly which types of data we are looking for, and promised a monetary bonus to those who met our criteria after we examined their hypotheses. Among our criteria are: 1) we are interested in diverse ways of making inferences, and 2) we are looking for contradictions that do not contain a negator.
MULTICONSTRAINT We put constraints on hypothesis generation by specifying that only one out of the three contradictions can contain a negator, and that we would randomly check the produced hypothesis, with violations of the constraint resulting in lower payment. We also provided extra examples in the instructions to demonstrate contradictions without negators. These examples are drawn from the hypotheses collected from prior data. We are also aware of other potential biases or heuristics in human-elicited NLI data such as the lexical overlap heuristic (McCoy et al., 2019) . Thus in all our instructions, we made explicit to the annotators that no hypothesis should overlap more than 70% with the premise. However, examining how prevalent such heuristics are in our data requires constructing new probing datasets for Chinese, which is beyond the scope of this paper.
5 See Appendix A for the complete instructions.
Annotators We hired 145 undergraduate and graduate students from several top-tier Chinese universities to produce hypotheses. All of the annotators (writers) are native speakers of Chinese and are majoring in Chinese or other languages. They were paid roughly 0.3 RMB (0.042 USD) per P-H pair. No single annotator produced an excessive amount of data to avoid annotator-bias (for a discussion of this, see Geva et al. (2019) ).
3.3 Data Verification
Following SNLI and MNLI, we perform data verification, where each premise-hypothesis pair is assigned a label by four independent annotators (labelers). Together with the original label assigned by the annotator, each pair has five labels. We then use the majority vote as the gold label. We selected a subset of the writers from the hypothesis generation experiment to be our labelers. For each subset, about 15% of the total data were randomly selected and relabeled. The labelers were paid 0.2 RMB (0.028 USD) for each pair.
Relabeling results Our results, shown in Table 4, are very close to the numbers reported for SNLI/MNLI, with labeler agreement even higher than SNLI/MNLI for SINGLE and MULTI.
Crucially, the three MULTI subsets, created using the three variants of the multi-hypothesis generation method, have similar agreement to MNLI, suggesting that producing nine hypotheses for a given premise is feasible. Furthermore, the agreement rates on the medium and hard portions of the subsets are only slightly lower than on the easy portion, with agreement rates of 3 labels at least Conneau et al., 2018) . For XNLI, the numbers are for the English portion of the dataset, which is the only language that has been relabelled.
97.90% (see Table 10 in the Appendix), suggesting that our data in general is of high quality. Agreement is lower for MULTICONSTRAINT, showing that it may be difficult to produce many hypotheses under these constraints.
In a separate relabeling experiment, we examine the quality of human-translated examples from the XNLI dev set. The results show considerably lower agreement: The majority vote of our five annotators only agree with the XNLI gold-label 67% of the time, as compared to the lowest rate of 88.2% on MULTICONSTRAINT. Additionally, 11.6% of the XNLI dev examples in Chinese contain more than 10 Roman alphabets, which are extremely rare in original, every-day Chinese speech/text. These results suggest that XNLI is less suitable as validation set for Chinese NLI, and thus we excluded XNLI dev set in our evaluation. For further details, see Appendix C.
3.4 The Resulting Corpus
Overall, we have a corpus of more than 56,000 pairs of inference pairs in Chinese. We have randomized the total of 6,000 relabeled pairs from MULTIENCOURAGE and MULTICONSTRAINT and used them as the development and test sets, each consisting of 3,000 examples. All pairs from SINGLE and MULTI, plus the remaining 26,211 pairs from MULTIENCOURAGE and MULTICON-STRAINT are used for the training set, about 50,000 pairs 6 . This split ensures that all labels in the de- 6 We note that given the constraints of having equal number of easy, medium and hard examples in dev/test sets, the resulting corpus ended up having high premise overlap velopment and test sets have been verified, and the number of pairs in the easy, medium and hard portions are roughly the same in both sets. It is also closer to a realistic setting where contradictions without negation are much more likely. Pairs that do not receive a majority label in our relabeling experiment are marked with "-" as their label, and can thus be excluded if necessary.
4.1 Experimental Setup
To demonstrate the difficulty of our dataset, we establish baselines using several widely-used NLI models tailored to Chinese 7 . This includes the baselines originally used in such as the continuous bag of words (CBOW) model, the biLSTM encoder model and an implementation of ESIM (Chen et al., 2017) 8 . In each case, we use Chinese character embeddings from in place of the original GloVe embeddings.
We also experiment with state-of-the-art pretrained transformers for Chinese (Cui et al., 2019) between training and dev/test sets, in contrast to the original MNLI design. To ensure that such premise overlap does not bias the current models and inflate performance, we experimented with a smaller non-overlap train and test split, which was constructed by filtering parts of the training. This lead to comparable results, despite the non-overlap being much smaller in size, which we detail in Appendix G. Both the overlap and non-overlap splits will be released for public use, as well as part of the the public leaderboard at https://www.cluebenchmarks.com/nli.html. 7 Additional details about all of our models and hyperparameters are included as supplementary material. using the fine-tuning approach from Devlin et al. (2019) . Specifically, we use the Chinese versions of BERT-base (Devlin et al., 2019) and RoBERTalarge with whole-word masking (see details in Cui et al. (2019) ). In both cases, we rely on the publicly-available TensorFlow implementation provided in the CLUE benchmark (Xu et al., 2020) 9 . Following Bowman et al. 2020, we also fine-tune hypothesis-only variants of our main models to measure annotation artifacts.
To measure human performance, we employed an additional set of 5 Chinese native speakers to annotate a sample (300 examples) of our OCNLI test set. This follows exactly the strategy used in for measuring human performance in GLUE, and provides a conservative estimate of human performance in that annotators were provided with minimal amounts of task training (see Appendix E for details).
Datasets In addition to experimenting with OC-NLI, we also compare the performance of our main models against models fine-tuned on the Chinese training data of XNLI (Conneau et al., 2018 ) (an automatically translated version of MNLI), as well as combinations of OCNLI and XNLI. The aim of these experiments is to evaluate the relative advantage of automatically translated data. We also compare both models against the CLUE diagnostic test from Xu et al. (2020) , which is a set of 514 NLI problems that was annotated by an independent set of Chinese linguists.
To analyze the effect of our different hypothesis elicitation strategies, we look at model performance on different subsets of OCNLI. Due to the way in which the data is partitioned (all of SINGLE and MULTI are in the training set), it is difficult to fine-tune on OCNLI and test on all four subsets. We instead use an XNLI trained model, which is independent of any biases related to our annotation process, to probe the difficulty of our different subsets.
4.2 Baseline Results And Analysis
In this section, we describe our main results.
How Difficult Is Ocnli?
To investigate this, we train/fine-tune all five neural architectures on OCNLI training data and test on the OCNLI test set. The main results are shown in Table 5 . All of the non-transformer models perform poorly while 9 See: https://github.com/CLUEbenchmark/CLUE BERT and RoBERTa reach a ∼20 percentagepoint advantage over the strongest of these models (ESIM). This shows the relative strength of pretrained models on our task.
We find that while transformers strongly outperform other baseline models, our best model, based on RoBERTa, is still about 12 points below human performance on our test data (i.e., 90.3% versus 78.2%). This suggests that models have considerable room for improvement, and provides additional evidence of task difficulty. In comparison, these transformer models reach human-like performance in many of the GLUE (Wang et al., 2018) and SuperGLUE tasks. For NLI specifically, the performance of the English RoBERTa on MNLI is 90.4%, and only about 2 percentage-points below the human score (Bowman et al., 2020; . We see a similar trend for BERT, which is about 18 points behind human performance on OCNLI, but the difference is roughly 8 points for MNLI (Devlin et al., 2019) . We also see much room for improvement on the CLUE diagnostic task, where our best model achieves only 61.3% (a slight improvement over the result reported in Xu et al. (2020) ).
We also looked at how OCNLI fares on hypothesis-only tests, where all premises in train and test are replaced by the same non-word, thus forcing the system to make predictions on the hypothesis only. Table 7 shows the performance of these models on different portions of OCNLI. These results show that our elicitation gives rise to annotation artifacts in a way similar to most benchmark NLI datasets (e.g., OCNLI: ∼ 66%; MNLI ∼ 62% and SNLI: ∼ 69%, as reported in Bowman et al. (2020) and Poliak et al. (2018) , respectively). We specifically found that negative polarity items ("any", "ever"), negators and "only" are among the indicators for contradictions, whereas "at least" biases towards entailments. We see no negators for the MULTICONSTRAINT subset, which shows the effect of putting constraints on the hypotheses that the annotators can produce. Instead, "only" is correlated with contradictions. A more detailed list is shown in Figure 8 , listing individual word and label pairs with high pairwise mutual information (PMI). PMI was also used by Bowman et al. (2020) for the English NLI datasets.
Given the large literature on adversarial filtering and adversarial learning (Belinkov et al., 2019) limited to English and on much larger datasets that are easier to filter, we see extending these methods to our dataset and Chinese as an interesting challenge for future research.
Comparison with XNLI To ensure that our dataset is not easily solved by simply training on existing translations of MNLI, we show the performance of BERT and RoBERTa when trained on XNLI but tested on OCNLI. The results in Table 6 (column XNLI) show a much lower performance than when the systems are trained on OCNLI, even though XNLI contains 8 times more examples. 10 While these results are not altogether comparable, given that the OCNLI training data was generated from the same data sources and annotated by the same annotators (see Geva et al. (2019 see these results as noteworthy given that XNLI is currently the largest available multi-genre NLI dataset for Chinese. The results are indicative of the limitations of current models trained solely on translated data. More strikingly, we find that when OCNLI and XNLI are combined for fine-tuning (column Combined in Table 6 ), this improves performance over the results using XNLI, but reaches lower accuracies than fine-tuning on the considerably smaller OCNLI (except for the diagnostics). Figure 1 shows a learning curve comparing model performance on the independent CLUE diagnostic test. Here we see that the OCLNI model reaches its highest performance at 30,000 examples while the XNLI model still shows improvements on 50,000 examples. Additionally, OCNLI reaches the same performance as the model finetuned on the full XNLI set, at around 25,000 examples. This provides additional evidence of the importance of having reliable human annotation for NLI data.
Understanding the OCNLI Subsets To better understand the effect of having three annotator hypotheses per premise, constituting three difficulty levels, and having four elicitation modes, we carried out a set of experiments with XNLI-finetuned models on the different subsets. We used XNLI to avoid imposing specific preferences on the models. Table 9 shows a consistent decrease in accuracy from SINGLE through MULTICONSTRAINT, and a mostly consistent decrease from easy to hard (exception: between easy and medium in MULTI). Both trends suggest that multi-hypothesis elicitation and improved instructions lead to more chal-lenging elicited data.
In this paper, we presented the Original Chinese Natural Language Inference (OCNLI) corpus, the first large-scale, non-translated NLI dataset for Chinese. Our dataset is composed of 56,000 premise-hypothesis pairs, manually created by university students with a background in language studies, using premises from five genres and an enhanced protocol from the original MNLI annotation scheme. Results using BERT and RoBERTa show that our dataset is challenging for the current best pre-trained transformer models, the best of which is ∼ 12 percentage-points below human performance. We also demonstrate the relative advantage of using our human constructed dataset over machine translated NLI such as XNLI.
To encourage more progress on Chinese NLU, we are making our dataset publicly available for the research community at https://github. com/CLUEbenchmark/OCNLI and will be hosting a leaderboard in the Chinese Natural Language Understanding (CLUE) (Xu et al., 2020) benchmark (https://www.cluebenchmarks.com/nli. html).
Given the wide impact that large-scale NLI datasets, such as SNLI and MNLI, have had on recent progress in NLU for English, we hope that our resource will likewise help accelerate progress on Chinese NLU. In addition to making more progress on Chinese NLI, future work will also focus on using our dataset for doing Chinese model probing (e.g., building on work such as Warstadt et al. (2019) ; ; Jeretic et al. (2020)) and sentence representation learning (Reimers and Gurevych, 2019) , as well as for investigating bias-reduction techniques (Clark et al., 2019; Belinkov et al., 2019; for languages other than English.
B Relabeling Results For Different Portions
C Relabeling Results For Xnli Development Set
For this experiment, we follow the same procedure as the relabeling experiment for OCNLI data. We randomly selected 200 examples from XNLI dev, and mixed them with 200 examples from our SINGLE (which has already been relabelled) for another group of annotators to label. The labelers for these 400 pairs were undergraduate students who did not participated in hypothesis generation so as to avoid biasing towards our data. The labeling results for XNLI are presented in Table 11 . Only 67% of the 200 pairs have the same label from our annotators and the label given in XNLI dev. 8.5% of the pairs are considered to be irrelevant by the majority of our annotators. As we mentioned in the introduction, there are other issues with XNLI such as the existence of many Roman alphabets (867 (11.56%) examples in XNLI dev have more than 10 Roman alphabets) which prevent us from using it as proper evaluation data for Chinese NLI.
D Model Details And Hyper-Parameters
We experimented with the following models:
• Continuous bag-of-words (CBOW), where sentences are represented as the sum of its Chinese character embeddings, which are passed on to a 3-layer MLP.
• Bi-directional LSTM (biLSTM), where the sentences are represented as the average of the states of a bidirectional LSTM.
• Enhanced Sequential Inference Model (ESIM), which is MNLI's implementation of the ESIM model (Chen et al., 2017) . • BERT base for Chinese (BERT), which is a 12-layer transformer model with a hidden size of 768, pre-trained with 0.4 billion tokens of the Chinese Wikipedia dump (Devlin et al., 2019) . We use the implementation from the CLUE benchmark (Xu et al., 2020) 11 .
• RoBERTa large pre-trained with whole word masking (wwm) and extended (ext) data (RoBERTa), which is based on RoBERTa and has 24 layers with a hidden size of 1024, pre-trained with 5.4 billion tokens, released in (Cui et al., 2019) . We use the implementation from the CLUE benchmark.
For CBOW, biLSTM and ESIM, we use Chinese character embeddings from 12 , and modify the implementation from MNLI 13 to work with Chinese. Our BERT and RoBERTa models are both finetuned with 3 epochs, a learning rate of 2e-5, and a batch size of 32. Our hyper-parameters deviate slightly from those used in CLUE and (Cui et al., 2019) 14 , because we found them to be better when tuned against our dev sets (as opposed to XNLI or the machine translated CMNLI in CLUE).
E Determining Human Baselines
We follow procedures in to obtain conservative human baselines on OCNLI. Specifically, we first prepared 20 training examples from OCNLI.train and instructions similar to those in the relabeling experiment. Then we asked 5 undergraduate students who did not participate in any part of our previous experiment to perform the labeling. They were first provided with the instructions as well as the 20 training examples, which they were asked to label after reading the instructions. Then they were given the answers and explanations of the training examples. Finally, they were given a random sample of 300 examples from the OCNLI test set for labeling. We computed the majority label from them, and compare that against the gold label in OCNLI.test to obtain the accuracy. For pairs with no majority label, we use the most frequent label from OCNLI.test (neutral), following . We have only 2 (0.7%) such cases. The results are presented in Table 12 .
We performed the same experiment with 5 linguistics PhDs, who are already familiar with the NLI task from their research, and thus their results may be biased. We see a higher 5-label agreement and similar accuracy compared against the gold label of OCNLI.test. We use the score from undergraduate students as our human baseline as it is the "unbiased" score obtained using the same procedure as .
The human score of OCNLI is similar to that of MNLI (92.0%/92.8% for match and mismatch respectively).
F More Examples From Ocnli
We present more OCNLI pairs in Table 13. 12 https://github.com/Embedding/Chinese-Word-Vectors 13 https://github.com/NYU-MLL/multiNLI 14 https://github.com/ymcui/Chinese-BERT-wwm/
G Filtering Training Data
To mimic the MNLI setting where the training data and the evaluation data (dev/test) have no overlapping premises, we filtered out the pairs in the current training set whose premise can also be found in evaluation. This means the removal of about 20k pairs in OCNLI.train, and results in a new training set which we call OCNLI.train.small, while the development and test sets remain the same. We fine-tune the biLSTM, BERT and RoBERTa models on the new, filtered training data, and the results are presented in Table 14 . We observe that our models have a 1.5-2.5% drop in performance when trained with the filtered training data. Note that OCNLI.train.small is only 60% of OCNLI.train in size, so we consider this drop to be moderate and expected, and more likely the result of reduced training data, rather than the removal of overlapping premises.