Go To:

Paper Title Paper Authors Table Of Contents Abstract References
Report a problem with this paper

Easy, Reproducible and Quality-Controlled Data Collection with CROWDAQ


  • Qiang Ning
  • Hao Wu
  • Pradeep Dasigi
  • Dheeru Dua
  • Matt Gardner
  • Robert L Logan IV
  • Ana Marasović
  • Zhenjin Nie
  • 2020
  • View in Semantic Scholar


High-quality and large-scale data are key to success for AI systems. However, large-scale data annotation efforts are often confronted with a set of common challenges: (1) designing a user-friendly annotation interface; (2) training enough annotators efficiently; and (3) reproducibility. To address these problems, we introduce CROWDAQ, an open-source platform that standardizes the data collection pipeline with customizable user-interface components, automated annotator qualification, and saved pipelines in a re-usable format. We show that CROWDAQ simplifies data annotation significantly on a diverse set of data collection use cases and we hope it will be a convenient tool for the community.

1 Introduction

Data is the foundation of training and evaluating AI systems. Efficient data collection is thus important for advancing research and building time-sensitive applications. 2 Data collection projects typically require many annotators working independently to achieve sufficient scale, either in dataset size or collection time. To work with multiple annotators, data requesters (i.e., AI researchers and engineers) usually need to design a user-friendly annotation interface and a quality control mechanism. However, this involves a lot of overhead: we often spend most of the time resolving frontend bugs and manually checking or communicating with individual annotators to filter out those who are unqualified, instead of focusing on core research questions.

Another issue that has recently gained more attention is reproducibility. Dodge et al. (2019) and Pineau (2020) provide suggestions for system reproducibility, and Bender and Friedman (2018) and Gebru et al. (2018) propose "data statements" and "datasheets for datasets" for data collection reproducibility. However, due to irreproducible human interventions in training and selecting annotators and the potential difficulty in replicating the annotation interfaces, it is often difficult to reuse or extend an existing data collection project.

We introduce CROWDAQ, an open-source data annotation platform for NLP research designed to minimize overhead and improve reproducibility. It has the following contributions. First, CROWDAQ standardizes the design of data collection pipelines, and separates that from software implementation. This standardization allows requesters to design data collection pipelines declaratively without being worried about many engineering details, which is key to solving the aforementioned problems (Sec. 2).

Second, CROWDAQ automates qualification control via multiple-choice exams. We also provide detailed reports on these exams so that requesters know how well annotators are doing and can adjust bad exam questions if needed (Sec. 2).

Third, CROWDAQ carefully defines a suite of pre-built UI components that one can use to compose complex annotation user-interfaces (UIs) for a wide variety of NLP tasks without expertise in HTML/CSS/JavaScript (Sec. 3). For non-experts on frontend design, CROWDAQ can greatly improve efficiency in developing these projects.

Fourth, a dataset collected via CROWDAQ can be more easily reproduced or extended by future data requesters, because they can simply copy the pipeline and pay for additional annotations, or treat existing pipeline as a starting point for new projects.

In addition, CROWDAQ has also integrated many useful features: requesters can conveniently monitor the progress of annotation jobs, whether they are paying annotators fairly, and the agreement arXiv:2010.06694v1 [cs.HC] 6 Oct 2020 level of different annotators on CROWDAQ. Finally, Sec. 4 shows how to use CROWDAQ and Amazon Mechanical Turk (MTurk) 3 to collect data for an example project. More use cases can be found in our documentation.

2 Standardized Data Collection Pipeline

A data collection project with multiple annotators generally includes some or all of the following: (1) Task definition, which describes what should be annotated. (2) Examples, which enhances annotators' understanding of the task. (3) Qualification, which tests annotators' understanding of the task and only those qualified can continue; this step is very important for reducing unqualified annotators. (4) Main annotation process, where qualified annotators work on the task. CROWDAQ provides easy-to-use functionality for each of these components of the data collection pipeline, which we expand next.

INSTRUCTION A Markdown document that defines a task and instructs annotators how to complete the task. It supports various formatting options, including images and videos.

TUTORIAL Additional training material provided in the form of multiple-choice questions with provided answers that workers can use to gauge their understanding of the INSTRUCTION. CROWDAQ received many messages from real annotators saying that TUTORIALS are quite helpful for learning tasks.

EXAM A collection of multiple-choice questions similar to TUTORIAL, but for which answers are not provided to participants. EXAM is used to test whether an annotator understands the instructions sufficiently to provide useful annotations. Participants will only have a finite number of opportunities specified by the requesters to work on an EXAM, and each time they will see a random subset of all the exam questions. After finishing an EXAM, participants are informed of how many mistakes they have made and whether they have passed, but they do not receive feedback on individual questions. Therefore, data requesters should try to design better INSTRUCTIONS and TUTORIALS instead of using EXAM to teach annotators.

Figure 1: Data collection using CROWDAQ and MTurk. Note that this is a general workflow and one can use only part of it, or use it to build even more advanced workflows.
Figure 2: Illustration of an acceptability judgement task deployed on CROWDAQ. Because contexts can contain html, this CROWDAQ user was easily able to highlight relevant spans of text for crowd workers using tags.

We restrict TUTORIALS and EXAMS to always be in a multiple-choice format, irrespective of the original task format, because it is natural for humans to learn and to be tested in a discriminative setting. 4 An important benefit of using multiplechoice questions is that their evaluation can be automated easily, minimizing the effort a requester spends on manual inspections. Another convenient feature of CROWDAQ is that it displays useful statistics to requesters, such as the distribution of scores in each exam and which questions annotators often make mistakes on, which can highlight areas of improvement in the INSTRUCTION and TUTORIAL. Below is the JSON syntax to specify TUTORIAL-S/EXAMS (see Fig. 3 and Fig. 4 in the appendix).

Figure 3: Specification and visualization of TUTORIALS. In this particular TUTORIAL, there are eight questions and the example participant has only made choice on one of them. Please see https://beta.crowdaq.com/w/ tutorial/qiang/CrowdAQ-demo for this interface.
Figure 4: Specification and visualization of EXAMS. In this particular EXAM, the requester has specified that every time a participant will see five questions randomly sampled from the pool, and every participant only has two opportunities to pass it. Please see https://beta.crowdaq.com/w/exam/qiang/CrowdAQ-demo for this interface.

"question_set": [ { "type": "multiple-choice", "question_id": ..., "context": [{ "type": "text", "text": "As of Tuesday, 144 of the state's then-294 deaths involved nursing homes or longterm care facilities." }], "question": { "question_text": "In \"294 deaths\", what should you label as the quantity?", "options": {"A": "294", "B": "294 deaths"} }, "answer": "A", "explanation": { "A": "Correct", "B": "In our definition, the quantity should be \"294\"." } }, ... ]

TASK For example, if we are doing sentencelevel sentiment analysis, then a TASK is to display a specific sentence and require the annotator to provide a label for its sentiment. A collection of TASKS are bundled into a TASK SET that we can launch as a group. Unlike TUTORIALS and EXAMS where we only need to handle multiplechoice questions in CROWDAQ's implementation, a major challenge for TASK is how to meet different requirements for annotation UI from different datasets in a single framework, which we discuss next.

3 Customizable Annotation Interface

It is time-consuming for non-experts on the frontend to design annotation UIs for various datasets. At present, requesters can only reuse the UIs of very similar tasks and still, they often need to make modifications with additional tests and debugging. CROWDAQ comes with a variety of built-in resources for easily creating UIs, which we will explain using an example dataset collection project centered around confirmed COVID-19 cases and deaths mentioned in news snippets.

3.1 Concepts

The design of CROWDAQ's annotation UI is built on some key concepts. First, every TASK is associated with contexts-a list of objects of any type: text, html, image, audio, or video. It will be visible to the annotators during the entire annotation process before moving to the next TASK, so a requester can use contexts to show any useful information to the annotators. Below is an example of showing notes and a target news snippet (see Fig. 5 in the appendix for visualization). CROWDAQ is integrated with online editors that can auto-complete, give error messages, and quickly preview any changes.

Figure 5: How CROWDAQ renders the context specification in Sec. 3. A major difference from TUTORIALS is that the participant will not see the answers. Please see https://beta.crowdaq.com/w/task/qiang/ CrowdAQ-demo/quantity_extraction_typing for this interface.

"contexts": [ { "label": "Note", "type": "html", "html": "

Remember to ...

", "id": "note" }, { "type": "text", "label": "The snippet was from an article published on 2020-05-20 10:30:00", "text": "As of Tuesday, 144 of the state's then-294 deaths involved nursing homes or longterm care facilities.", "id": "snippet" } ],

Second, each TASK may have multiple annotations. Although the number of dataset formats can be arbitrary, we observe that the most basic formats fall into the following categories: multiple-choice, span selection, and free text generation. For instance, to emulate the data collection process used for the CoNLL-2003 shared task on named entity recognition (Tjong Kim Sang and De Meulder, 2003) , one could use a combination of a span selection (for selecting a named entity) and a multiple-choice question (selecting whether it is a person, location, etc.); for the process used for natural language inference in SNLI (Bowman et al., 2015) , one could use an input box (for writing a hypothesis) and a multiple-choice question (for selecting whether the hypothesis entails or contradicts the premise); for reading comprehension tasks in the question-answering (QA) format, one could use an input box (for writing a question) and a multiplechoice question (for yes/no answers; Clark et al. (2019)), a span selection (for span-based answers; Rajpurkar et al. (2016) ), or another input box (for free text answers; Kočiskỳ et al. (2018) ).

These annotation types are built in CROWDAQ, 5 which requesters can easily use to compose complex UIs. For our example project, we would like the annotator to select a quantity from the "snippet" object in the contexts, and then tell us whether it is relevant to COVID-19 (see below for how to build it and Fig. 6 in the appendix for visualization).

Figure 6: In this UI the annotator is asked to select a valid quantity and then choose whether it is relevant to COVID-19.

"annotations": [ { "type": "span-from-text", "from_context": "snippet", "prompt": "Select one quantity from below.", "id": "quantity", }, { "type": "multiple-choice", "prompt": "Is this quantity related to COVID-19?", "options":{ "A": "Relevant", "B": "Not relevant" } "id": "relevance" } ]

Third, a collection of annotations can form an annotation group and a TASK can have multiple of them. For complex TASKS, this kind of semantic hierarchy can provide a big picture for both the requesters and annotators. We are also able to provide very useful features for annotation groups. For example, we can put the annotations object above into an annotation group, and require 1-3 responses in this group. Below is its syntax, and Fig. 7 in the appendix shows the result.