Searching for scientific evidence in a pandemic: An overview of TREC-COVID

Kirk Roberts
Tasmeer Alam
Steven Bedrick
Dina Demner-Fushman
Kyle Lo
I. Soboroff
E. Voorhees
Lucy Lu Wang
W. Hersh
Journal of Biomedical Informatics
2021
View in Semantic Scholar

Abstract

We present an overview of the TREC-COVID Challenge, an information retrieval (IR) shared task to evaluate search on scientific literature related to COVID-19. The goals of TREC-COVID include the construction of a pandemic search test collection and the evaluation of IR methods for COVID-19. The challenge was conducted over five rounds from April to July, 2020, with participation from 92 unique teams and 556 individual submissions. A total of 50 topics (sets of related queries) were used in the evaluation, starting at 30 topics for Round 1 and adding 5 new topics per round to target emerging topics at that state of the still-emerging pandemic. This paper provides a comprehensive overview of the structure and results of TREC-COVID. Specifically, the paper provides details on the background, task structure, topic structure, corpus, participation, pooling, assessment, judgments, results, top-performing systems, lessons learned, and benchmark datasets.

1. Introduction

The Coronavirus Disease 2019 (COVID-19) pandemic has resulted in an enormous demand for and supply of evidence-based information. On the demand side, there are numerous information needs regarding the basic biology, clinical treatment, and public health response to COVID-19. On the supply side, there have been a vast number of scientific publications, including preprints. Despite the large supply of available scientific evidence, beyond the medical aspects of the pandemic, COVID-19 has resulted in an "infodemic" as well [1] [2] [3] with large amounts of confusion, disagreement, and distrust about available information.

A key component in identifying available evidence is by accessing the scientific literature using the best possible information retrieval (IR, or search) systems. As such, there was a need for rapid implementation of IR systems tuned for such an environment and a comparison of the efficacy of those systems. A common approach for large-scale comparative evaluation of IR systems is the challenge evaluation, with the largest and best-known approach coming from the Text Retrieval Conference (TREC) organized by the US National Institute of Standards and Technology (NIST) [4] . The TREC framework was applied to the COVID-19 Open Research Dataset (CORD-19), a dynamic resource of scientific papers on COVID-19 and related historical coronavirus research [5] .

The primary goal of the TREC-COVID Challenge was to build a test collection for evaluating search engines dealing with the complex information landscape in events such as a pandemic.

Since IR focuses on large document collections and it is infeasible to manually judge every document for every topic, IR test collections are generally built via manual judgment using participants' retrieval results to guide the selection of which documents to judge. This allows for a wide variety of search techniques to identify potentially relevant documents, and focuses the manual effort on just those documents most likely to be relevant. Thus, to build an excellent test collection for pandemics, it is necessary to conduct a shared task such as TREC-COVID with a large, diverse set of participants.

A critical aspect of a pandemic is the temporal nature of the event: as new information arises a search engine must adapt to these changes, including the rapid pace with which new discoveries are added to the growing corpus of scientific knowledge on the pandemic. The three distinct aspects of temporality in the context of the pandemic are (1) rapidly changing information needs: as knowledge about the pandemic grows, the information needs evolve to include both the new aspects of the existing topics and new topics; (2) rapidly changing state of knowledge reflected both in the high rate at which the new work is published and the initial publications are edited; and (3) heterogeneity of the relevant work: whereas in traditional biomedical collections the documents and journals are peer-reviewed, in a pandemic scenario any publications, e.g., preprints, containing new information may be relevant and may actually contain the most up-to-date information. The result of all these factors is that the best search strategy at the beginning of a pandemic (with small amounts of scattered information, many unknowns) may be different than the best strategy mid-pandemic (rapidly-growing burst of information with some emerging answers, unknowns still exist but are better defined) or after the pandemic (many more answers but with a corpus that contains a significant evolution through time, may require filtering out many of the early pandemic information that has become outdated). TREC-COVID models the pandemic stage using a multi-round structure, where more documents are available and additional topics are added as new questions emerge.

The other critical aspect of a pandemic from an IR perspective is the ability to gather feedback on search performance as a pandemic proceeds. As new topics emerge, judgments on these topics can be collected (manually or automatically, e.g. click data) that can be used to improve search performance (both on that topic of interest and other topics). This is subject to similar temporality constraints as above: feedback is only available on documents that previously exist, while the amount of feedback data available steadily grows over the course of the pandemic.

These two aspects-temporality of data and the availability of relevance feedback for model development-are the two core contributions of TREC-COVID from an IR perspective. From a biomedical perspective, TREC-COVID's contributions include its unique focus on an emerging infectious disease, the inclusion of both peer-reviewed and preprint articles, and its substantial size in terms of the number of judgments and proportion of the collection that was judged.

Finally, a practical contribution of TREC-COVID was the rapid availability of its manual judgments so that public-facing COVID-19-focused search engines could tune their approach to best help researchers and consumers find evidence in the midst of the pandemic.

TREC-COVID was structured as a series of rounds to capture these changes. Over five rounds of evaluation, TREC-COVID received 556 submissions with 92 participating teams. The final test collection contains 69,318 manual judgments on 50 topics important to COVID-19. Each round included an increasing number of topics pertinent to the pandemic, where each topic is a set of queries around a common theme (e.g., dexamethasone) provided at three levels of granularity (described in Section 4) . Capturing the evolving corpus proved to be quite challenging as preprints were released, updated, and published, sometimes with substantial changes in content. An additional benefit of the multi-round structure of the collection was support of research on relevance feedback, supervised machine learning techniques that find additional relevant documents for a topic by exploiting existing relevance judgments. This paper provides a complete overview of the entire TREC-COVID Challenge. In prior publications, we provided our initial rationale for TREC-COVID and its structure [6] as well as a snapshot of the task after the first round [7] . Additional post-hoc evaluations have also been conducted comparing system qualities [8] and assessing the quality of the final collection [9] . This paper presents a description of the overall challenge now that it has formally concluded. Section 2 places TREC-COVID within the scientific context of IR shared tasks. Section 3 provides an overview of the overall task structure. Section 4 explains the topic structure, how the topics were created, and what types of topics were used. Section 5 details the corpus that systems searched over. Section 6 provides the participation statistics and list of submission information. Section 7 describes how those runs were pooled to select documents for evaluation. Section 8 details the assessment process: who performed the judging, how it was done, and what types of judgments were made. Section 9 describes the resulting judgment sets. Section 10 provides the overall results of the participant systems across the different metrics used in the task. Section 11 contains short descriptions of the systems with published descriptions. Section 12 discusses some of the lessons learned by the TREC-COVID organizers, including lessons for IR research in general, COVID-19 search in particular, and the construction of pandemic test collections should the unfortunate opportunity arise to create another such test collection amidst a new pandemic. Finally, Section 13 describes the different benchmark test collections resulting from TREC-COVID. All data produced during TREC-COVID has been archived on the TREC-COVID web site at http://ir.nist.gov/trec-covid/.

2. Related Work

While there has never been an IR challenge evaluation specifically for pandemics, there is a rich history of biomedical IR evaluations, especially within TREC. Similar to TREC-COVID, most of these evaluations have focused on retrieving biomedical literature. The TREC Genomics track (2003) (2004) (2005) (2006) (2007) [10] [11] [12] [13] [14] [15] targeted biomedical researchers interested in the genomics literature. The TREC Clinical Decision Support track (2014-2016) [16] [17] [18] targeted clinicians interested in finding evidence for diagnosis, testing, and treatment of patients. The TREC Precision Medicine track (2017-2020) [19] [20] [21] [22] refined that focus to oncologists interested in treating cancer patients with actionable gene mutations. Beyond these, the TREC Medical Records track [23, 24] focused on retrieving patient records for building research cohorts (e.g., for clinical trial recruitment). The Medical ImageCLEF tasks [25] [26] [27] [28] focused on the multi-modal (text and image) retrieval of medical images (e.g., chest x-rays). Finally, the CLEF eHealth tasks [29, 30] focused largely on retrieval for health consumers (patients, caregivers, and other non-medical professionals). TREC-COVID differs from these in terms of medical content, as no prior evaluation had focused on infectious diseases, much less pandemics. However, TREC-COVID also differs from these tasks in terms of its temporal structure, which enables evaluating how search engines adapt to a changing information landscape.

As mentioned earlier, TREC-COVID provided infrastructural support for research on relevance feedback. Broadly speaking, a relevance feedback technique is any search method that uses known relevant documents to retrieve additional relevant documents for the same topic. The now-classic "more like this" query is a prototypical relevance feedback search. Information filtering, in which a user's standing information need is used to select the documents in a stream that should be returned to the user, can be cast as a feedback problem in that feedback from the user on documents retrieved earlier in the stream informs the selection of documents later in the stream. TREC focused research on the filtering task with the Filtering track, and in TREC 2002 track organizers used relevance feedback algorithms to select documents for assessors to judge to create the ground truth data for the track [31] . But the filtering task is a special case of feedback where the emphasis is on the on-line learning of the information need. Other TREC tracks including the Robust track, Common Core track, and the current Deep Learning track reused topics from one test collection to target a separate document set. In these tracks the focus has been on the viability of the transfer learning. TREC also included a Relevance Feedback track in TRECs 2008 and 2009 [32] with the explicit goal of creating an evaluation framework for direct comparison of feedback reformulation algorithms. The track created the framework, but it was based on an existing test collection with randomly selected, very small numbers of relevant documents as the test conditions. TREC-COVID also enabled participants to compare feedback techniques using identical relevance sets, but in contrast to the other tracks, these sets were naturally occurring and relatively large, were targeted at the same document set, and contain multiple iterations of feedback.

3. Task Structure

The standard TREC evaluation involves providing participants with a fixed corpus and a fixed set of topics, as well as having a timeline that lasts several months (2-6 months to submit results, 1-3 months to conduct assessment). As previously described, these constraints are not compatible with pandemic search, since the corpus is constantly growing, topics of interest are constantly emerging, systems need to be built quickly, and assessment needs to occur rapidly.

Hence, the structure of a pandemic IR shared task must diverge from the standard TREC model in several important and novel ways.

TREC-COVID was conceived as a multi-round evaluation, where in each round an updated corpus would be used, the number of topics would increase, and participants would submit new results. An initial, somewhat arbitrary, choice of five rounds was proposed to ensure enough iterations to evaluate the temporal aspects of the task while keeping manual assessment feasible. The time between rounds was proposed to be limited to just 2-3 weeks in order to capture rapid snapshots of the state of the pandemic. Ultimately, the task did indeed last five rounds and the iteration format was largely adopted.

[INSERT FIGURE 1 HERE] Figure 1 . High-level structure of TREC-COVID.

A high-level overview of the structure of TREC-COVID is shown in Figure 1 . This highlights the interactions between rounds, assessment, and corpus. Table 1 provides the timeline of the task, including the round, start/end dates, release and size of the corpus, number of topics, participation, and cumulative judgments available after the completion of that round. The start date of a round is when the topics were made available as well as the manual judgments from the prior round. The end date is when submissions were due for that round. In between rounds, manual judging occurred for the prior round (referred to below as the X.0 judging for Round X).

Table 1. Overview of the TREC-COVID timeline over the five rounds.

During the next round, while participants were developing systems using the Round X.0 (and all prior) data, additional manual judging occurred for the prior round (referred to as the X.5 judging). However, these would not be available until the conclusion of the next round (Round X+1). This enabled a near-constant judging process to maximize the number of manual judgements while still keeping to a rapid iteration schedule. As can be seen in Table 1 , Round 1 started with 30 topics and 5 new topics were added every round. This allowed for emerging "hot" topics to respond to the evolving nature of the pandemic.

The participation numbers in Table 1 reflect the number of unique teams for each round and the total number of submissions for that round. Teams were restricted to a maximum of three submissions per round except for Round 5 when the limit was eight submissions. The participation numbers include a baseline "team" and several baseline submissions starting in Round 2. The baselines were provided by the University of Waterloo based on the Anserini toolkit [33, 34] for the purpose of providing a common yardstick between rounds and to encourage teams to use all three of their submissions for non-baseline methods.

The manual judgment numbers in Table 1

4. Topics

The search topics have a three-part structure, with increasing levels of granularity. The query is a few keywords, analogous to most queries submitted to search engines. The question is a natural language question that more clearly expresses the information need, and is a more complete alternative to the query. Finally, the narrative is a longer exposition of the topic, which provides more details and possible clarifications, but does not necessarily contain all the information provided in the question. Table 2 lists three example topics from different rounds. All topics referred directly to COVID-19 or the SARS-CoV-2 virus, but in some cases the broader term "coronavirus" was used in either the query or question. For some of these topics, background information on other coronaviruses could be partially relevant, but was left to the discretion of the manual assessors. See Voorhees et al. [7] for a more thorough discussion on this terminology issue. The topics were designed to be responsive to many of the scientific needs of the major stakeholders of the biomedical research community. The topics were intentionally balanced between bench science (e.g., microbiology, proteomics, drug modeling), clinical science (e.g., Several efforts were made to ensure the topics were broadly representative of the needs of the pandemic. Calls were put out via social media asking for community input for topic ideas.

Table 2. Not extracted; please refer to original document.

Queries submitted to the National Library of Medicine were examined to gauge concerns of the wider public. Additionally, the streams of prominent Twitter medical influencers were examined to identify hot topics in the news. The iterative nature of the task also enabled the topics to adapt to the evolving needs of the pandemic. For every round, five new topics were created in an effort to both address any deficiencies in the existing topics as well as to include recently high-profile topics that received little scientific attention at the time of the prior rounds (e.g., the major dexamethasone trial [36] was not published until July, just in time for Round 5). Table 3 lists the query for all the topics used in the task, as well as an extension of the Soni &

Table 3: All 50 topics (only the Query field) along with the research field and function categories assigned to each topic.

Roberts [35] categories to all 50 topics. Again, these categories were not intended to be authoritative, merely to help balance the types of topics used in the challenge and aid in post-hoc analysis. Many-or even most-topics could feasibly fit into multiple categories. We provide this here for the purpose of providing insights into the types of topics used in the challenge. The majority of the documents in CORD-19 were published in 2020 and are on the subject of COVID-19. Around a quarter of the articles are in the field of virology, followed by articles on the medical specialties of immunology, surgery, internal medicine, and intensive care medicine, as classified by Microsoft Academic fields of study [38] . The corpus has been used by clinical researchers as a source of documents for systematic literature reviews on COVID-19, and has been the foundation of many dozens of search and exploration interfaces for COVID-19

literature [39] .

6. Participation

Teams submitted runs (synonymous with a 'submission') where a run consists of a sorted list of documents for each topic in the corpus and the document list for a topic is sorted by decreasing likelihood that the document is relevant to the topic (in the system's estimation). A TREC-COVID run was required to contain at least one and no more than 1000 documents per topic.

7. Pooling

Relevance judgments are what turns a set of topics and documents into a retrieval test collection. The judgments are the set of documents that should be returned for a topic and are used to compute evaluation scores of runs. When the scores of two runs produced using the same test collection are compared, the system that produced the run with the higher score is assumed to be the better search system. Ideally we would have a judgment for every document in the corpus for every topic in the test set, but humans need to make these judgments (if the relevance of a document could be automatically determined then the information retrieval problem itself is solved), so a major design decision in constructing a collection is selecting which documents to show to a human annotator for each topic. The goal of the selection process is to obtain a representative set of the relevant documents so that the score comparisons are fair for any pair of runs.

In general, the more judgments that can be obtained the more fair the collection will be, but judgment budgets are almost always determined by external resource limits. For TREC-COVID, the limiting factor was time. Since the time between rounds was short, the amount of time available for relevance annotation was also short. Based on previous TREC biomedical tracks, we estimated that we would be able to obtain approximately 100 judgments per topic per week with two weeks per TREC-COVID round, though that estimate proved to be somewhat low.

For most retrieval test collections, the number of relevant documents for a topic is very much smaller than the number of documents in the collection, small enough that the expected number of relevant documents found is zero when selecting documents to be judged uniformly at random while fitting within the judgment budget. But, search systems actively try to retrieve relevant documents at the top of their ranked lists, so the union of the set of top-ranked documents from many different runs should contain the majority of the relevant documents.

This insight led to a process known as pooling that was first suggested by Spärck Jones and van Rijsbergen [40] and has been used to build the original TREC ad hoc collections. When scoring runs using relevance judgments produced through pooling, most IR evaluation measures treat a document that has no relevance judgment (because it was not shown to an annotator) as if it had been judged not relevant. The medians generally increased over the different TREC-COVID rounds. This was mainly caused by the submitted runs becoming more similar to one another as the rounds progressed, except for Round 5 where many more documents overall were judged since it was the last round which allowed for more judging time. There is a decrease in median number judged between Rounds 2 and 3. This dip is explained by the CORD-19 release used in Round 3 was much bigger than in Round 2 (see Table 1 ) so runs had both more room to diverge and significantly less training data for the new portion.

But what about runs that have little overlap with other runs and thus have relatively few judged documents to inform evaluation scores? Figure 2 shows that some runs with very little overlap with other runs were submitted to TREC-COVID. Even runs with relatively many judged documents can have unjudged documents at ranks important to the evaluation measure being used to score the run (for example, unjudged documents at ranks 8-10 when evaluating using Precision@10

Figure 2: Number of documents judged in the top 50 ranks of a submission by round. The black line within a box is the median number of documents judged for that submission over the set of topics in that round. Judged submissions (submissions that contributed to the qrels) are plotted in light blue and unjudged submissions are in dark blue.

8. Assessment

The goal of the assessment process is to manually label all of the pooled results for relevance to the corresponding topic. In TREC-COVID, each result could receive one of three possible judgment labels: Before assigning topics, all assessors were asked for their preferences for judging individual topics, with the hope of aligning topics with expertise. Finally, while it is ideal in an IR evaluation to limit each topic to one assessor, the constraints of both timing and funding made this infeasible. However, to every extent possible assessors were assigned the same topic as prior rounds in order to minimize intra-topic disagreements. Double-assessment was not performed, as single-assessment has become standard in IR evaluations.

The web-based assessment platform used for TREC-COVID is shown in Figure 3 . A URL corresponds to one topic for one assessor. For assessors assigned more than one topic, or for topics whose judgments needed to be split between multiple assessors in a round, multiple URLs were used. The assessor was provided with a list of articles to judge on the left, the topic information at the top of the page, and an iframe with the HTML/PDF of the article to be judged taking up most of the screen. No specific requirements were placed on the assessor (e.g., they

did not have to read the entire article). It is assumed an assessor can judge 50 articles for a topic in one hour.

[INSERT FIGURE 3 HERE] Figure 3 . Assessment platform. Figure 4 shows the number of judgments made for each topic, by round. As can be seen, an attempt was made to increase the number of judgments for later topics, so these were often pooled to a greater depth than the earlier topics. A consequence of pooling to the depth of each judged run, as opposed to some kind of depth across runs for a topic, is a fair degree of variability amongst the number of judgments per topic. In general, the greater the agreement between runs for a topic, the fewer articles were required to be judged. On the other hand, topics with sizable disagreement between runs meant a wider net needed to be cast to identify the relevant articles. Pooling to a specific depth on each run accomplishes both these goals.

Figure 4: The number of articles judged per topic, by round.

9. Judgments

The prior section described the manual assessment process. This section describes how those manual judgments are organized into distinct judgments sets to facilitate the evaluation of participant runs. After assessment is performed, the judgments are organized in files known as qrels. These are posted on the TREC-COVID web site. The format of an entry in a qrels is where topic-number designates the topic the judgments apply to, document-id is a CORD-19 document identifier, and judgment is 0 for not relevant, 1 for partially relevant, and 2 for fully relevant. The iteration field records the round in which the judgment was made. Annotators continued to make judgments on the weeks when participants were creating their runs for the next round, and judgments made during these weeks are labeled as "half round" judgments. That is, a document labeled as being judged in Round X.5 was selected to be judged from a run submitted to Round X but was used to score runs submitted to Round X+1. For round 0.5, the documents were selected from runs produced by the organizers that are not official submissions. The judgment set for half Round X.5 was created by pooling runs submitted to Round X deeper (i.e., using a larger value of λ) and/or adding to the set of judged runs. Documents that had been previously judged were removed from those pools.

Runs submitted to Round X were scored using only the judgments made in judgment Rounds X-1.5 and X, not the cumulative set of judgments to that point. This was necessitated by the fact the relevance judgments from prior rounds were available to the participants at the time they created their submissions and they could use those judgments to create their runs (these were the feedback runs). To avoid the methodological misstep of using the same data as both training and test, TREC-COVID used residual collection evaluation [41] in all rounds after the first. In residual collection evaluation, any document that has already been judged for a topic is conceptually removed from the collection before scoring. Thus, participants were told not to include any previously judged documents in the ranked lists they submitted (even if that run did not make use of the judgments), and all pre-judged documents that were nonetheless submitted were automatically deleted from runs. The runs were then scored using the qrels built for that round. The runs that are archived on the web site are the runs as scored, that is, with all previously judged documents removed.

The combination of residual collection evaluation and a dynamic corpus results in a complicated structure. While later releases of CORD-19 are generally larger than earlier releases, later releases are not strict supersets of those earlier releases in that articles can be dropped from a release-because the article is no longer available from the original source or because the article no longer qualifies as being part of the collection according to CORD-19 construction processes, for example. Sometimes a "dropped" article has actually just been given a new document id, as can happen when a preprint is published and thus appears in a different venue. Document content can also be updated. For example, CORD-19 went through many changes between the May 1 and May 19 (TREC-COVID Rounds 2 and 3) releases. One result of these changes was that approximately 7000 articles were dropped between the two releases and approximately 600 of those dropped articles had been judged for relevance. Approximately 2000 of the 7000 dropped were articles whose document id had changed.

The valid use of a test collection to score runs requires that the qrels accurately reflect the document set. Documents that are no longer in the collection must be removed from the qrels because otherwise runs would be penalized for not retrieving phantom documents that are marked as relevant. Similarly, the qrels must use the correct document id for the version of the corpus regardless of which round the judgment was made in. Documents whose content was updated must be re-judged to see if the changed content makes a difference to the annotation.

The naming scheme selected for the qrels reflects this complexity. The name of a TREC-COVID qrels file is composed of three parts, the header ("qrels-covid"); the document round (e.g., "d3");

and a range of judgment rounds (e.g., "j0.5-2"). The document round refers to the release that was used in the given TREC-COVID round, and all of the document ids in that file are with respect to that release. The TREC-COVID Complete qrels is the cumulative qrels over all five rounds, with all document ids mapped to the July 16 release of CORD-19, using the document content as of the latest round in which the document was judged, and not including judgments for documents no longer in the collection. Under the naming scheme, this qrels is "qrels-covid_d5_j0.5-5". Note that because of residual collection evaluation, no TREC-COVID submission was scored using this qrels. Round 5 runs were scored using "qrels-covid_d5_j4.5-5".

Table 4: Teams participating in all five TREC-COVID rounds, with run counts for each round. Rounds 1-4 limited participants to 3 runs. Round 5 limited participants to 8 runs.

Table 5. Not extracted; please refer to original document.

10. Results Overview

The top five NDCG automatic/feedback runs (only the best run for each team) for each of the five rounds are shown in Table 6 . Tables S1-S3 in the Supplemental Data contain the top 5

Table 6. Top automatic/feedback runs (best run per team), as determined by NDCG, for each of the five rounds of TREC-COVID. P@N: Precision at rank N; NDCG@N: Normalized Discounted Cumulative Gain at rank N; MAP: Mean Average Precision; bpref: Binary Preference; judged?: whether the run contributed to the pooling.

NDCG team runs for each of the three run submission types. More detailed per-round tables are available on the TREC-COVID website. Due to the depth of the pools, different rounds utilized different metrics. Notably, Rounds 1-3 used P@5 and NDCG@10, while Rounds 4 and 5 used P@20 and NDCG@20. The table also lists which runs were included in the pooling process.

Figure 5: Distributions of assignments per topic across rounds of judging.

Teams could select one of their runs per round to be judged. Since inferred measures were not used, runs that did not contribute to the pools are at a disadvantage. Most of the runs in Table 6 were judged, though this is likely a combination of the advantage given to a judged run and the fact that teams generally select what they believe to be their best run for judging. Figure 6 shows the distribution of median scores for each topic by round. This empirically shows which topics are "easy" and "difficult", relatively speaking, based on system performance. If the topics were consistently easy or difficult across rounds, the marks for the given rounds would be in roughly the same order relative to other marks in that round. This is not the case, which suggests a variance of difficulty at ranking articles at medium ranks (since the later rounds are residual runs) as well as potential variability of the new articles in CORD-19 for that round. In a sense, this means the difficulty of a topic in a pandemic is in part relative to the time point at which that topic is queried.

Figure 6: Median average precision (AP) scores over all runs submitted to a given round. The topics on the x-axis are sorted by decreasing median AP.

Other trends can be observed in Figure 6 as well. Feedback runs outperform automatic runs, which makes sense as the feedback runs have access to topic-specific information to train their models. The median system also generally improved on a topic over the rounds. This applies both to feedback runs (which makes sense) and automatic runs (which is more surprising), though this could also be an artifact of the weaker teams dropping out of the challenge. A more detailed analysis of runs in Rounds 2 and 5 found that fine-tuning datasets with relevance judgments, MS-MARCO, and CORD-19 document vectors was associated with improved performance in Round 2 but not in Round 5 [8] . This analysis also noted that term expansion was associated with improvement in system performance, and that use of the narrative field in TREC-COVID queries was associated with decreased system performance.

[INSERT FIGURE 6 HERE] remarked elsewhere that early in a pandemic, feature-rich systems still fail to outperform decades-old IR approaches [35] . The comparison of automatic versus feedback runs above, however, completes the spectrum to demonstrate that machine learning-based, feature rich systems do indeed outperform non-feedback based systems as the information about the pandemic increases.

11. Methods Overview

In this section, we highlight the methods used by a handful of participants that have published papers or preprints on TREC-COVID. IR shared tasks are not well-suited to identifying a "best" method based solely on the ranking metrics from the prior section, and TREC historically has avoided referring to itself as a competition as well as declaring winners of a particular track.

There are too many factors that go into a search engine's retrieval performance to empirically prove a given technique is better or worse just based on the system description provided by the authors. Further, a recent work attempts a comparative analysis of system features of the TREC-COVID participants [8] . Instead, in this section we briefly focus on interesting aspects of TREC-COVID participants to illustrate the state of the field. Note that of the time of writing, most participants have not published (via preprint or peer review) a description of their system. What follows is the list of papers that have been reported to the organizers.

SLEDGE [42] . This automatic system used SciBERT [43] to re-rank the output of a BM25 retrieval stage. At least for Round 1, SLEDGE was trained on MS MARCO [44] . [45] . This automatic system combined a question answering and abstractive summarization model to re-rank the output of a retrieval stage that utilized approximate knearest neighbor search over TF-IDF, BM25, and Siamese BERT [46] embeddings. [47] . This included a neural index run (NIR) automatic system that appended a

Nir/Rf/Rfrr

BioBERT [48] embedding to the traditional document representation, an automatic relevance feedback (RF) system, and a relevance feedback with BERT-based re-ranking (RFRR) system.

Covidex [49] . This feedback system used T5 [50] to re-rank the output of a BM25 retrieval stage. A paragraph-level index was used instead of a document-level index.

PARADE [51] . This feedback system breaks documents into passages for special handling prior to using BERT [52] to re-rank the output of a BM25 retrieval stage.

RRF102 [53] . This feedback system uses rank fusion to combine an ensemble of 102 runs. The constituent runs come from lexical and semantic retrieval systems, pre-trained and fine-tuned BERT rankers, and relevance feedback runs.

Caos-19 [54] . This feedback system relied in a BM25 retrieval stage and added additional topicrelevant terms. These terms were based on Kaggle challenge tasks and WHO research goals.

12. Lessons Learned

Here we organize a handful of the lessons learned in TREC-COVID. Most of what is described here has been discussed in some detail above, but we hope it is useful to organize it more succinctly here for emphasis. The lessons here are organized according to whether they were anticipated as well as the extent to which they were addressable during the course of the shared task. We follow this up with a set of recommendations for a future pandemic-like IR challenge, should the unfortunate need arise.

Anticipated and Addressed: Some major concerns were anticipated and ended up being welladdressed despite the sizable unknowns that still existed at the time TREC-COVID was launched.

First, our most immediate concern related to the logistics of manual assessment within the timeframes required to meet the goals of TREC-COVID. As Second, unlike most other TREC tracks, TREC-COVID could not use the standard methodology of evaluating submissions using all judged documents. Because the judgments made for a round were publicly released after that round to support the use of relevance feedback, we needed to use an evaluation methodology that accounted for the training effect. Residual collection scoring is a traditional approach to feedback evaluation that is easy to understand and easy to implement, and it worked well in TREC-COVID. The most significant drawback to using residual collection scoring is that it forced all submissions to be scored over only a single round's judgments. As it turned out, the judgments from a single round were sufficient for stable comparisons among submissions (see more on this point below).

Anticipated yet Not Problematic: Next, some of our anticipated concerns ultimately ended up working out well, though not due to any specific effort on the part of the organizers.

First, we understood the judgment pools would likely be fairly shallow (that is, we would not identify the vast majority of relevant articles for each topic). This indeed ended up being the case, though not always for the reasons anticipated (see discussion of topics and document set below). The scientific problem with shallow pools is that they often lead to unstable estimates of system performance-systems that return a large percentage of top-ranked unjudged results cannot be fairly evaluated against those that consistently return results that are judged. We thus have differing levels of confidence in the run scores for each system. After a thorough evaluation of the stability of the collection in other work [9] , this remarkably turns out not to be the case. Despite the shallow nature of the pools, it does not appear likely that judging significantly more results would have resulted in many changes in the system rankings provided in Section 10. While the system scores themselves would certainly be different (probably higher), the relative ranking would hardly change.

Second, the timeline of the task raised the concern about whether participants would be able to develop new approaches or adapt their existing systems. This applied both to an open question. A limitation of our evaluation is that it cannot really assess whether there are meaningful differences between runs. As can be seen from Table 6 , there is not a strict correlation between metrics, which would suggest that even if there are statistically significant differences between runs on a metric, we do not know whether that metric is a good proxy for a user in this context. This would argue for the need for user studies for this domain to better calibrate metrics to actual search workflows. However, the best users for TREC-COVID to study were largely focused on COVID-19-related scientific inquiry and medicine, so conducting a user study in the middle of a pandemic would have been difficult.

Not Anticipated: Some challenges were entirely unanticipated prior to the launch of TREC-COVID. These largely stemmed from the nature of the information content in the document collection, including both its volume and velocity.

First, the quantity and granularity of topics proved a difficult challenge to manage. As described in Section 4, the topics were chosen through a variety of surveillance methods, and the 50 final topics did indeed reflect most of the key information needs of the pandemic (at least in terms of April-June 2020). And yet, the pace of the pandemic certainly resulted in more information needs emerging than the 5 new topics each round. From an evaluation standpoint, however, the real issue was the general nature of many of the topics. This resulted in hundreds of relevant documents for some topics (e.g., topic 38, "COVID inflammatory response", had 765 relevant results), which likely means that there are many more relevant documents that are unjudged.

The impact of this is lessened by the stability assessment discussed above, but we would still suggest a different topic creation strategy for a future task. While there is no 'ideal' number of relevant documents for a topic for the purpose of IR evaluation, it is generally thought that having more topics-so long as they are nearly fully-judged-provides a fairer evaluation than fewer topics. We would thus recommend having more, but finer-grained topics. Further, our topic creation strategy did not involve extensive consultation with a wide body of experts (nor could it feasibly have done so, as described above), so coming up with a diverse set of realistic, fine-grained topics across the biological, clinical, and public health sciences would be quite difficult. But we still feel that, looking back, a larger number of finer-grained topics may well have represented a more realistic use case for expert users as well as a better evaluation for IR systems.

Second, an unexpected difficulty was the churn in the document set. We anticipated the document set would grow over the course of the pandemic, but the actual changes were much more significant. The overall document set did substantially grow in size over the course of the pandemic, but different versions of CORD-19 are not proper supersets of one another. Documents get dropped between rounds because they get withdrawn or change status (move from pre-print to published, for example) or no longer meet CORD-19's inclusion criteria. Some documents get new ids (are renamed) because of a status change. Documents that remain in the collection across two versions may contain different content in the versions. Any of the documents that changed or were dropped or renamed might have been judged in a prior round, complicating both the implementation of residual collection scoring and the definition of the judgment set for a given round. The complicated relationship among the judgment sets caused by this churn was the motivation for defining the different sets of relevance judgment files described in Section 13 and posted on the TREC-COVID website.

13. Test Collection

Lastly, we describe the different ways that TREC-COVID can be used as a test collection for IR research. Our goal is both to suggest different mechanisms for evaluation and to identify canonical benchmark tasks. These benchmark tasks include:

1. TREC-COVID Complete. This benchmark utilizes only the final version of CORD-19

used for the challenge (July 16 snapshot) and the Round 5 cumulative qrels file (with 69,318 judgments). This is the closest benchmark to a standard IR ad hoc task with a fixed corpus and no temporal component. It is suitable for automatic and manual approaches.

2. TREC-COVID Chronological. This benchmark utilizes the five cumulative qrels files, one for each round. The purpose of this benchmark is to evaluate automatic and manual approaches to assess their retrieval performance at different stages of the pandemic.

Without any kind of learning, some systems may be more or less effective at different stages. 4. TREC-COVID Residual. This benchmark is for feedback systems that are allowed to train on prior rounds, but are only evaluated on the new judgments. This differs from the Chronological-ML benchmark in that both training is only allowed on the prior rounds and testing occurs on just the residual updates for that round. This simulates the case where certain "standing topics" get feedback over time and can be improved for future searchers.

3. Trec-Covid

As can be seen, these benchmarks evaluate different aspects of pandemic retrieval, with different benchmarks perhaps being suitable to different research communities.

The qrels that correspond to each of the above benchmarks are available on the TREC-COVID website, organized to be clear as to what judgments are available for training and testing for each round.

14. Conclusion

This paper described the TREC-COVID challenge, an IR shared task conducted in response to the COVID-19 pandemic and inspired by the need to develop search systems in an urgent, rapidly-evolving health crisis. The major goals of the challenge were to evaluate search engine performance for the COVID-19 scientific literature and to build a test collection for pandemic search. In terms of the evaluation, 92 unique teams submitted 556 runs based on manual, automatic, and feedback approaches. In terms of the test collection, we have described four different benchmark datasets based on the TREC-COVID judgments which will be useful for evaluating different perspectives on pandemic search. Overall, the task was extremely popular (exceeding the popularity of any prior TREC evaluation) and, despite the large logistical hurdles, was able to produce a large test collection for inspiring future research in pandemic search.

https://ir.nist.gov/trec-covid/papers/rnd1runs_j0.5-2.0.pdf