CLUE: A Chinese Language Understanding Evaluation Benchmark

Liang Xu
Xuanwei Zhang
Lu Li
Hai Hu
Chenjie Cao
Weitang Liu
Junyi Li
Yudong Li
Kai Sun
Yechen Xu
Yiming Cui
Cong Yu
Qianqian Dong
Yingtao Tian
Dian Yu
Bo Shi
Jun-jie Zeng
Rongzhao Wang
Weijian Xie
Yanting Li
Yina Patterson
Zuoyu Tian
Yiwen Zhang
He Zhou
Shaoweihua Liu
Qipeng Zhao
Cong Yue
Xinru Zhang
Z. Yang
Kyle Richardson
Zhenzhong Lan
COLING
2020
View in Semantic Scholar

Abstract

The advent of natural language understanding (NLU) benchmarks for English, such as GLUE and SuperGLUE allows new NLU models to be evaluated across a diverse set of tasks. These comprehensive benchmarks have facilitated a broad range of research and applications in natural language processing (NLP). The problem, however, is that most such benchmarks are limited to English, which has made it difficult to replicate many of the successes in English NLU for other languages. To help remedy this issue, we introduce the first large-scale Chinese Language Understanding Evaluation (CLUE) benchmark. CLUE is an open-ended, community-driven project that brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension, all on original Chinese text. To establish results on these tasks, we report scores using an exhaustive set of current state-of-the-art pre-trained Chinese models (9 in total). We also introduce a number of supplementary datasets and additional tools to help facilitate further progress on Chinese NLU. Our benchmark is released at https://www.cluebenchmarks.com

1 Introduction

Full-network pre-training methods such as BERT (Devlin et al., 2019) and their improved versions (Yang et al., 2019; Lan et al., 2019) have led to significant performance boosts across many natural language understanding(NLU) tasks. One key driving force behind such improvements and rapid iterations of models is the general use of evaluation benchmarks. These benchmarks use a single metric to evaluate the performance of models across a wide range of tasks. However, existing language evaluation benchmarks are mostly in English, e.g., GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019) . To the best of our knowledge, there is no general language understanding evaluation benchmark for Chinese, whose speakers account for one-fourth of the world's population, and one of the official languages of the United Nations. Also, Chinese is linguistically very different from English and other Indo-European languages, which necessitates an evaluation benchmark specifically designed for Chinese. Without such a benchmark, it would be difficult for researchers in the field to check how good their Chinese language understanding models are.

To address this problem and facilitate studies in Chinese language, we introduce a comprehensive Chinese Language Understanding Evaluation (CLUE) benchmark that contains a collection of eight different natural language understanding tasks, including semantic similarity, natural language inference, short text classification, long text classification with large number of classes, and different types of machine reading comprehension. To better understand the challenges posed by these tasks, we evaluate them using several popular pretrained language understanding models for Chinese. Overall, we find that these tasks display different levels of difficulty, manifest in different accuracies across models, as well as the comparison between human and machine performance.

The size and quality of unlabeled corpora play an essential role in language model pre-training (Devlin et al., 2019; Yang et al., 2019; Lan et al., 2019) . There are already popular pre-training corpora such as Wikipedia and the Toronto Book Corpus in English. However, we are not aware of any large-scale open-source pre-training dataset in Chinese; and different Chinese models are trained on different and relatively small corpora. Therefore, it is difficult to compare performance across model architectures. This difficulty motivates us to construct and release a standard CLUE pre-training dataset: a corpus with over 214 GB of raw text and roughly 76 billion Chinese words. We also introduce a diagnostic dataset hand-crafted by linguists. Similar to GLUE, this dataset is designed to highlight linguistic common knowledge and logical operators that we expect models to handle well.

Overall, we present in this paper: (1) A Chinese natural language understanding benchmark that covers a variety of sentence classification and machine reading comprehension tasks, at different levels of difficulty, in different sizes and forms. (2) A large-scale raw corpus for general-purpose pretraining in Chinese so that the comparisons across different model architectures are as meaningful as possible. (3) A diagnostic evaluation dataset developed by linguists containing multiple linguistic phenomena, some of which are unique to Chinese. (4) A user-friendly toolkit, as well as an online leaderboard with an auto-evaluation system, supporting all our evaluation tasks and models, with which researchers can reproduce experimental results and compare the performance of different submitted models easily.

2 Related Work

It has been a common practice to evaluate language representations on different intrinsic and downstream linguistic tasks. For example, Mikolov et al. (2013) measure word embeddings through a semantic analogy task and a syntactic analogy task. Pennington et al. (2014) further expands the testing set to include other word similarity and named entity recognition tasks. Similar patterns have happened in sentence representations . However, as different people use different evaluation pipelines on different datasets, results reported in different papers are not always fully comparable. Especially in the case where the datasets are small, a minor change in evaluation can lead to big differences in outcomes.

SentEval (Conneau and Kiela, 2018) addresses the above problem by introducing a standard evaluation pipeline on a set of popular sentence embedding evaluation datasets. GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019) further improve SentEval along the line of getting consistent reported results for natural language understanding tasks. They introduce a set of more difficult datasets and a model-agnostic evaluation pipeline. Along with other reading comprehension tasks like SQuAD (Rajpurkar et al., 2016) and RACE (Lai et al., 2017) , GLUE and SuperGLUE have become standard testing benchmarks for fullnetwork pre-training methods such as BERT (Devlin et al., 2019) and ALBERT (Lan et al., 2019) .

We believe a similar problem exists in Chinese language understanding evaluation. Although more and more Chinese linguistic tasks (Liu et al., 2018; Cui et al., 2019b) have been proposed, there is still a need for a standard evaluation pipeline and an evaluation benchmark with a set of diverse and difficult language understanding tasks.

3 Clue Overview

CLUE consists of 1) eight language understanding tasks in Chinese, 2) a large-scale raw dataset for pre-training and a small hand-crafted diagnostic dataset for linguistic analysis, and 3) a ranking system, a leaderboard and a toolkit.

3.1 Task Selection

For this benchmark, we selected eight different tasks, to make sure this benchmark could test different aspects of pre-trained models. To ensure the quality and coverage of the language understanding tasks, we select tasks using the following criteria:

Diversity The tasks in CLUE should vary in terms of the task, the size of the text, the type of understanding required, the number of training examples, etc.

Well-defined and easy-to-process We select tasks that are well-defined, and we pre-process them for our users so that they can focus on modeling.

Moderate Difficulty: Challenging But Solvable

To be included in CLUE, a task should not be too simple or already solved so as to encourage researchers to design better models (e.g., multiplechoice machine reading comprehension task).

Representative and useful Our tasks should be representative of common language understanding tasks, easily applicable to real-world situations (e.g., classification task with lots of labels, or semantic similarity task).

Tailor to Chinese-specific characteristics Ideally, tasks should measure the ability of models to handle Chinese-specific linguistic phenomena (e.g., four-character idioms).

Following the above standards, we select eight tasks covering a wide range of tasks for Chinese. These include three single-sentence tasks, three sentence-pair tasks, and three machine reading comprehension tasks.

3.2 Large-Scale Pre-Training Dataset

We collect data from the internet and preprocess them to make a large pre-training dataset for Chinese language processing researchers. In the end, a total of 14 GB raw corpus with around 5 billion Chinese words are collected in our pre-training corpus (see Section 5 for details).

3.3 Diagnostic Dataset

In order to measure how well models are doing on specific language understanding phenomena, we handcraft a diagnostic dataset that contains nine linguistic and logic phenomena (details in Section 7).

3.4 Leaderboard

We also provide a leaderboard for users to submit their own results on CLUE. The evaluation system will give final scores for each task when users submit their predicted results. To encourage reproducibility, we mark the score of a model as "certified" if it is open-source, and we can reproduce the results.

3.5 Toolkit

To make it easier for using the CLUE benchmark, we also offer a toolkit named PyCLUE implemented in TensorFlow (Abadi et al., 2016) . Py-CLUE supports mainstream pre-training models and a wide range of target tasks. Different from existing pre-training model toolkits (Wolf et al., 2019; Zhao et al., 2019) , PyCLUE is designed with a goal of quick model performance validations on the CLUE benchmark. We implement many pretraining model baselines on the CLUE benchmark and provide interfaces to support the evaluation of users' custom models.

4 Tasks

CLUE has eight Chinese NLU tasks, covering single sentence classification, sentence pair classification, and machine reading comprehension. Descriptions of these tasks are shown in Table 1 , and examples of these are shown in Table 5 Development set examples in the Appendix.

Table 1: Task descriptions and statistics. TNEWS has 15 classes; IFLYTEK has 119 classes; all other classification tasks are binary classification.

Table 5: Development set examples from the tasks in CLUE. Bold text represents part of the example format for each task. Chinese text is part of the model input, and the corresponding text in italics is the English version translated from that. Underlined text is specially marked in the input. Text in a monospaced font represents the expected model output.

4.1 Single Sentence Tasks

TNEWS TouTiao Text Classification for News Titles 2 consists of Chinese news published by TouTiao before May 2018, with a total of 73,360 titles. Each title is labeled with one of 15 news categories (finance, technology, sports, etc.) and the task is to predict which category the title belongs to. To make the dataset more discriminative, we use cross-validation to filter out some of the easy examples (see Section D Dataset Filtering in the Appendix for details). We then randomly shuffle and split the whole dataset into a training set, development set and test set.

IFLYTEK IFLYTEK (IFLYTEK CO., 2019) contains 17,332 app descriptions. The task is to assign each description into one of 119 categories, such as food, car rental, education, etc. A data filtering technique similar to the one used for the TNEWS dataset has been applied.

Cluewsc2020

The Chinese Winograd Schema Challenge dataset 3 is an anaphora/coreference resolution task where the model is asked to decide whether a pronoun and a noun (phrase) in a sentence co-refer (binary classification), built following similar datasets in English (e.g. Levesque et al., 2012; Wang et al., 2019) . Sentences in the dataset are hand-picked from 36 contemporary literary works in Chinese. Their anaphora relations are then hand-annotated by linguists, amounting to 1,838 questions in total. Details of the dataset will be updated in https://www.cluebenchmarks.com/.

4.2 Sentence Pair Tasks

Tasks in this section ask a model to predict relations between sentence pairs, or abstract-keyword pairs.

AFQMC Ant Financial Question Matching Corpus 4 comes from Ant Technology Exploration Conference (ATEC) Developer competition. It is a binary classification task that aims to predict whether two sentences are semantically similar.

CSL Chinese Scientific Literature dataset contains Chinese paper abstracts and their keywords from core journals of China, covering multiple fields of natural sciences and social sciences. We generate fake keywords through tf-idf and mix them with real keywords. Given an abstract and some keywords, the task is to tell whether the keywords are all original keywords of a paper. It mainly evaluates the ability of models to judge whether keywords can summarize the document.

4.3 Machine Reading Comprehension

CMRC 2018 CMRC 2018 (Cui et al., 2019b) is a span-extraction based dataset for Chinese machine reading comprehension. This dataset contains about 19,071 human-annotated questions from Wikipedia paragraphs. In CMRC 2018, all samples are composed of contexts, questions, and related answers. Furthermore, the answers are the text spans in contexts.

ChID ChID (Zheng et al., 2019 ) is a large-scale Chinese IDiom cloze test dataset, which contains about 498,611 passages with 623,377 blanks covered from news, novels, and essays. The candidate pool contains 3,848 Chinese idioms. For each blank in the passage, there are ten candidate idioms with one golden option, several similar idioms, and others are randomly chosen from the dictionary. C 3 C 3 (Sun et al., 2019b) is the first free-form multiple-choice machine reading comprehension dataset for Chinese. Given a document, either a dialogue or a more formally written mixed-genre text, and a free-form question that is not limited to a single question type (e.g., yes/no questions), the task is to select the correct answer option from all (2 to 4) options associated with the corresponding question. We employ all of the 19,577 general domain problems for 13,369 documents and follow the original data splitting. These problems are collected from language exams carefully designed by educational experts for evaluating the reading comprehension ability of language learners, similar to its English counterparts RACE (Lai et al., 2017) and DREAM (Sun et al., 2019a) .

5 Pre-Training Dataset

Large-scale language data is the prerequisite for model pre-training. Corpora of various sizes have been compiled and utilized in English, e.g., the Wikipedia Corpus, the BooksCorpus , and more recent C4 corpus (Colin Raffel, 2019).

For Chinese, however, existing public pretraining datasets are much smaller than the English datasets. For example, the Wikipedia dataset in Chinese only contains around 1.1 GB raw text. We thus collect a large-scale clean crawled Chinese corpus to fill this gap.

A total of 214 GB raw corpus with around 76 billion words are collected. It contains text from three parts, including CLUECorpus2020-small, CLUECorpus2020, and CLUEOSCAR. CLUECorpus2020-small It contains 14 GB Chinese corpus, consisting of four sub-parts: News, WebText, Wikipedia and Comments. The details are as follows:

• News This sub-corpus is crawled from the We Media (self-media) platform, with a total of 2.5 million news articles from roughly 63K sources. It is around 8 GB raw corpus with 3 billion Chinese words.

• Wikipedia), containing around 1.1 GB raw texts with 0.4 billion Chinese words on a wide range of topics.

• Comments These comments are collected from E-commerce websites including Dianping.com and Amazon.com by SophonPlus 5 . This subset has approximately 2.3 GB of raw texts with 0.8 billion Chinese words.

CLUECorpus2020 (Xu et al., 2020) It contains 100 GB Chinese raw corpus, which is retrieved from Common Crawl. It is a well-defined dataset that can be used directly for pre-training without requiring additional pre-processing. CLUECor-pus2020 contains around 29K separate files with each file following the pre-training format for the training set.

CLUEOSCAR 6 OSCAR is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus. It contains 250 GB Chinese raw corpus. We do further filtering and finally get 100 GB Chinese corpus.

6.1 Baselines

For baselines, we evaluate a variety of pre-trained models on the CLUE tasks. We implement models in both the TensorFlow library (Abadi et al., 2016) and PyTorch library (Paszke et al., 2019) . Original code for these baselines will be made available at GitHub repository.

Architecture Our baseline architecture combines pre-trained models and fine-tunes the CLUE tasks with one additional output layer. For singlesentence tasks, we encode the sentence and then pass the pooled output to a classifier. For sentencepair tasks, we encode sentence pairs with a separator and then pass the pooled output to a classifier. As for the extraction-style and multi-choice style for machine reading comprehension tasks, we use two fully connected layers after the pooled output to predict the start and end position of the answer for the former. For the latter, we encode multiple candidate-context pairs to a shared classifier and get corresponding scores.

Models We evaluate CLUE on the following public available pre-trained models:

• BERT-base, a BERT model with a 12 layer Transformer (Vaswani et al., 2017 ) and a hidden size of 768. It is trained on the Chinese Wikipedia dump with about 0.4 billion tokens and published by Devlin et al. (2019) .

• BERT-wwm-ext-base, a model with the same configuration of BERT-base except it uses whole word masking and is trained on additional 5 billion tokens (Cui et al., 2019a ).

• ALBERT-tiny, ALBERT (Lan et al., 2019) is the state-of-the-art language representation model, but at the time this paper was written, the author did not publish its Chinese versions. Due to the limitation of computational power, we train a tiny version of ALBERT with only 4 layers and a hidden size of 312 on the CLUE pre-training corpus.

• ERNIE-base (Sun et al., 2019c) extends BERT-base with additional training data and leverages knowledge from Knowledge Graphs.

• XLNet-mid 7 , a model with 24 layers and a hidden size of 768, with sentencepiece tokenzier and other techniques from Yang et al. (2019) .

• RoBERTa-large uses a 24 layer Transformer (Vaswani et al., 2017 ) with a hidden size of 1024, trained with the CLUE pre-training corpus and a sequence length of 256. It has a similar training procedure as that described in • RoBERTa-wwm-ext-base (Cui et al., 2019a) uses a 12 layer Transformer (Vaswani et al., 2017 ) with a hidden size of 768, it uses whole word masking and is trained on the same dataset as BERT-base-wwm except following the training procedure of .

• RoBERTa-wwm-ext-large (Cui et al., 2019a) has a network structure of RoBERTa-large and training procedure of RoBERTa-wwmext-base.

Table 2: Performance of Baseline Models on CLUE benchmark. For results of newly submitted models, including NEZHA-large(Huawei Noah’s Ark lab), ALBERT-xxlarge(Alibaba PAI) and UER(UER), check the leaderboard (http://www.CLUEbenchmark.com). Avg is the average of all tasks. Bold text denotes the best result in each column. Underline indicates the best result for the models. We report EM for CMRC 2018 and accuracy for all other tasks.

Table 3: Two-stage human performance scores and the best accuracy of models comparison. “avg” denotes the mean score from the three annotators. “majority” shows the performance if we take the majority vote from the labels given by the annotators. Bold text denotes the best result among human and model performance.

Table 4: The CLUE diagnostics: Example test items in 9 linguistic categories, with their gold labels and model predictions, as well as model accuracy. E = entailment, N = neutral, C = contradiction. BE = BERT-base, RO = RoBERTa-wwm-ext-large, XL = XLNet-mid.

More details of these models can be found in Table 6 Parameters for pre-training in Appendix and the corresponding papers.

Table 6: Parameters for pre-training. ”BERT-base” is released by google (Devlin et al., 2019). “WWM” stands for whole word masking. “ext” presents for extended data, different models may use different extended data. “∼BERT” means similar to Google’s Chinese BERT.

Fine-tuning We fine-tune the pre-trained models separately for each task. Hyperparameters are chosen based on the performance of each model on the development set. We also use early stopping to select the best checkpoint.

6.2 Human Performance

CMRC 2018, ChID and C 3 have provided human performance (Sun et al., 2019b; Cui et al., 2019b; Zheng et al., 2019) . For those tasks without human performance in CLUE, we ask human annotators to label 100 randomly chosen items from the test set and compute the annotators' majority vote against the gold label.

We are first interested in how human annotators would perform if they have only seen the task instructions and very few pieces of labeled data, i.e. no training on the task. The results are shown in the upper half of Table 3 . We can see that the performance of our annotators is not very satisfactory when they have little or no training.

Next, we follow procedures in SuperGLUE (Wang et al., 2019) to train the annotators before asking them to work on the test data Specifically, each annotator is first asked to annotate 30 to 50 pieces of data from the development set, and then compare their labels with the gold ones. They are then encouraged to discuss their mistakes and questions with other annotators until they are confident about the task. Then they annotate 100 pieces of test data, which is used to compute our final human performance, shown in the lower half of Table 3 and the last row of Table 2 . We observe an average increase in the accuracy of 8.0% (AFQMC) to 19.5% (CSL). We will discuss human performance in light of the models' performance in the next section.

6.3 Benchmark Results

Each model is fine-tuned three times and the test results come from the model that performs the best on development set. Results are shown in Table 2 .

Models Performance Analysis The first thing we notice is that the larger pre-trained model, the more pre-training dataset, using whole word masking, the results become better. Specifically, RoBERTa-wwm-ext-large outperforms other models in all tasks, particularly for machine reading tasks such as C 3 .

Next, we want to highlight the results from ALBERT-tiny, which has fewer than 1/20 of the parameters in BERT-base model. Our results suggest that for single-sentence or sentence-pair tasks, the performance of small models are not very far behind much larger models. However, for tasks involving more global understanding, small model size negatively affects the results seriously, as illustrated by ALBERT-tiny's low accuracy in all three machine reading tasks.

It should be noted that XLNet-mid is based on SentencePiece (Kudo and Richardson, 2018) , which works as a quite common unsupervised text tokenizer in English, performs poorly in token level Chinese tasks like span-extraction based MRC (CMRC 2018) . This shows the gap between the tokenizer used in English and Chinese since there are no natural blanks in Chinese texts. Table 3 : Two-stage human performance scores and the best accuracy of models comparison. "avg" denotes the mean score from the three annotators. "majority" shows the performance if we take the majority vote from the labels given by the annotators. Bold text denotes the best result among human and model performance.

Analysis Of Tasks

mans are very accurate in multiple-choice reading comprehension (C 3 ), whereas machines struggle in it (ALBERT-tiny has a very low accuracy of about 32%, probably due to the small size of the model). The situation is similar for CLUEWSC2020, where the best score of models is far behind human performance. Note that in SuperGLUE, RoBERTa did very well on the English WSC (89% against 100% for humans). Whereas in our case, the performance of variants of RoBERTa is still much lower than the average human performance, though it is better than other models. On the other hand, tasks such as CSL and ChID seem to be of equal difficulty for humans and machines, with accuracies in the 80's for both. For humans, the keyword judgment task (CSL) is hard because the fake keywords all come from the abstract of the journal article, which has many technical terms: annotators are unlikely to perform well when working with unfamiliar jargon. Surprisingly, the hardest dataset for both humans and machines is a single sentence task: TNEWS. One possible reason is that news titles can potentially fall under multiple categories (e.g., finance and technology) at the same time, while there is only one gold label in TNEWS.

The best result from machines remains far below human performance, with 11.6 points lower than human performance on average. This leaves room for further improvement. It indicates that solving CLUE is worthwhile as a driving force for the development of future models and methods.

7 Diagnostic Dataset for CLUE

7.1 Dataset Creation

In order to examine whether the trained models can master linguistically important and meaningful phenomena, we follow GLUE (Wang et al., 2018) to provide a diagnostic dataset, setting up as a natural language inference task and predicting whether a hypothesis is entailed by, contradicts to or is neutral to a given premise. Crucially, we did not translate the English diagnostics into Chinese, as the items in their dataset may be specific to English language or American/Western culture. Instead, we have several Chinese linguists hand-crafting 514 sentence pairs in idiomatic Chinese from scratch. These pairs cover 9 linguistic phenomena and are manually labeled by the same group of linguists. We ensured that the labels are balanced (majority baseline is 35.1%). Examples are shown in Table 4. Some of the categories directly address the unique linguistic properties of Chinese. For instance, items in the "Time of event" category test models on their ability to handle aspect markers such as 着 (imperfective marker), 了 (perfective marker), 过 (experiential marker), which convey information about the time of event, whether it is happening now or has already happened in the past. We believe that for a model to make robust inferences, it needs to understand such unique Chinese phenomena, and also has other important linguistic abilities, such as handling anaphora resolution (Webster et al., 2018) and monotonicity reasoning (Yanaka et al., 2019; Richardson et al., 2020) .

7.2 Evaluation And Error Analysis

We evaluate three representative models on the diagnostic dataset: BERT-base, XLNet-mid, RoBERTa-wwm-ext-large. Each model is finetuned on CMNLI training set, translated from English MNLI dataset into Chinese, then tested on Table 4 : The CLUE diagnostics: Example test items in 9 linguistic categories, with their gold labels and model predictions, as well as model accuracy. E = entailment, N = neutral, C = contradiction. BE = BERT-base, RO = RoBERTa-wwm-ext-large, XL = XLNet-mid.

our diagnostic dataset. As illustrated in Table 4 , the highest accuracy is only about 61%, which indicates that models have a hard time solving these linguistically challenging problems. We believe that both models and inference datasets suggest room for improvement. A breakdown of results is presented in the last few columns of Table 4 . Monotonicity is the hardest, similar to GLUE diagnostics (Wang et al., 2018) . It seems that XLNet-mid also has a hard time dealing with double negation, highlighting the difficulty of monotonicity for some models. An interesting case is the example of lexical semantics in Table 4 , where the two two-character words "sad" (难过 hard-pass) and "ugly" (难看 hard-look) in Chinese have the same first character (难 hard). Thus the premise and hypothesis only differ in the last character, which all three models have decided to ignore. One possible explanation is that the current state-of-the-art models in Chinese are also using the simple lexical overlap heuristic, as illustrated in McCoy et al. (2019) for English.

8 Conclusions And Future Work

In this paper, we present a Chinese Language Understanding Evaluation (CLUE) benchmark, which consists of eight natural language understanding tasks and a linguistically motivated diagnostic dataset, along with an online leaderboard for model evaluation. To the best of our knowledge, CLUE is the first comprehensive language understanding benchmark developed for Chinese. We evaluate several latest language representation models on CLUE and analyze their results. In addition, we release a large clean crawled raw text corpus that can be directly used for pre-training Chinese mod-els. An analysis is conducted on the diagnostic dataset created by our linguists, which illustrates the limited ability of state-of-the-art models to handle some Chinese linguistic phenomena.

Our results suggest that although current state-ofthe-art models can achieve relatively high scores on many tasks, they still fall behind human performance in general. Also, small models such as ALBERT-tiny can have a close performance to the larger ones in simple tasks but may stumble on tasks requiring understanding of longer texts.

In the future, as more small models become publicly available, our leaderboard will reflect this trend. We will also include more results from stateof-the-art models such as ALBERT-xxlarge (Lan et al., 2019) and increase the diversity and difficulty of the tasks in CLUE.

A Dataset Samples

We have compiled examples of each data set for your reference in Table 5 . Some of them are intercepted because the sentences are too long. For the complete data sets, you can refer to related papers. We will also release the download link of those datasets in the final version of the paper.

B Additional Parameters B.1 Hyperparameters For Pre-Training

Although we did not train most of the models by ourselves, we list the hyperparameter for pretraining in Table 6 for reference purpose.

B.2 Hyperparameters For Fine-Tuning

Hyperparameters for fine-tuning in our experiments are listed in Table 7 .

Table 7: Parameters for fine-tuning. CMRC* presents for CMRC dataset in 2018. All* means ALBERTtiny, BERT-base, BERT-wwm-ext-base, ERNIE-base, RoBERTa-large, XLNet-mid, RoBERTa-wwm-ext-base and RoBERTa-wwm-ext-large namely. It should be noted that RoBERTa-large is pre-trained with 256 sequence length, which is shorter than 512 length pre-trained for others. So we individually limit the length of RoBERTa-large to 256 for CMRC*, and use the striding text span to relieve this problem. However, this drawback of RoBERTa-large may decrease performances of some datasets whose length can not be effectively cut down, such as C3.

C Additional Baseline Details

CSL In generating negative samples for CSL, we only replace one of the real keywords with a fake one. When fine-tuning on CSL task, we found that some of the larger models can only converge at very small learning rates, for example, 5e-6.

IFLYTEK There are 126 categories in the original IFLYTEK dataset. However, some of them have few examples. We excluded those classes that have less than 10 examples so that we can apply the cross-validation filtering techniques as described in Section D. During the experiments, we also found when fine-tuning Albert-tiny requires a larger number of epochs to converge compare to other models. Also, sentences in IFLYTEK dataset are relatively long compared to other sentence classification tasks. However, most of the useful information is located at the beginning of the sentences. We, therefore, choose a max length of 128.

D Dataset Filtering

In order to increase the model differentiation and the difficulty of the dataset, we use four-fold crossvalidation to filter iFLYTEK and TNEWS dataset. We divide the data sets in to four and use three of them to fine-tune ALBERT-tiny. After that, the fine-tuned model is used to select and filter those easy examples in the remaining set. ["seeking instant benefit", "to overdo it", "take the branch for the root"(answer)] C 3 document: 男：我们坐在第七排，应该能看清楚字幕吧? 女：肯定可以，对了，我们得把手机设成振动。 question: 他们最可能在哪儿? candidates: ["图书馆", "体育馆","电影院"(answer),"火车站"] document (en): Man: Our seats are in the seventh row. We should be able to see the subtitles clearly, right? Woman: Absolutely. By the way, we should set the phone to vibrate. question (en): Where does the conversation most probably take place? candidates (en): ["In a library", "In a stadium","In a cinema"(answer),"At a train station"] Table 5 : Development set examples from the tasks in CLUE. Bold text represents part of the example format for each task. Chinese text is part of the model input, and the corresponding text in italics is the English version translated from that. Underlined text is specially marked in the input. Text in a monospaced font represents the expected model output.

https://github.com/fatecbf/ toutiao-text-classfication-dataset/ 3 https://github.com/CLUEbenchmark/LightLM

https://dc.cloud.alipay.com/index#/topic/intro?id=3

https://github.com/SophonPlus/ChineseNlpCorpus/ 6 https://dumps.wikimedia.org/zhwiki/latest/

https://github.com/ymcui/Chinese-PreTrained-XLNet