Abstract
We introduce a new pretraining approach for language models that are geared to support multi-document NLP tasks. Our crossdocument language model (CD-LM) improves masked language modeling for these tasks with two key ideas. First, we pretrain with multiple related documents in a single input, via cross-document masking, which encourages the model to learn cross-document and long-range relationships. Second, extending the recent Longformer model, we pretrain with long contexts of several thousand tokens and introduce a new attention pattern that uses sequence-level global attention to predict masked tokens, while retaining the familiar local attention elsewhere. We show that our CD-LM sets new state-of-the-art results for several multi-text tasks, including crossdocument event and entity coreference resolution, paper citation recommendation, and documents plagiarism detection, while using a significantly reduced number of training parameters relative to prior works1.
1 Introduction
The majority of NLP research addresses a single text, typically at the sentence or document level. This has been the case for both infrastructure language analysis tasks, such as syntactic, semantic and discourse analysis, as well as applied tasks, such as question answering (and its reading comprehension variant (Xu et al., 2019) ), information extraction, sentiment analysis etc., where the system output is typically extracted from a single document. Yet, there are important applications which are concerned with aggregated information spread across multiple texts, e.g., multidocument summarization (Fabbri et al., 2019a) , cross-document coreference resolution (Cybulska and Vossen, 2014a) , and multi-hop question answering (Yang et al., 2018) . While providing stateof-the-art results for cross-document tasks, current pretraining methods, developed for a single text, are not geared to fully address the needs of crossdocument tasks. As an alternate approach, we propose Cross-Document Language Model (CD-LM), a new language model (LM) that is trained in a cross-document manner. We show that this significantly outperforms previous approaches, resulting in state-of-the-art results for event and entity crossdocument coreference resolution, paper citation recommendation, and documents plagiarism detection.
Tasks that consider multiple documents typically require mapping or linking between pieces of information across documents. Such input documents usually contain overlapping information, e.g., Doc 1 and 2 in Fig. 1 . Desirably, LMs should be able to align between overlapping elements across these related documents. For example, one would expect a competent model to correctly align the events around "name" and "nominates" in Doc 1 and Doc 2, effectively recognizing their relation even when they are in separate documents. Yet, existing LM pretraining methods do not expose the model to learn such information. Here, we propose a scheme to integrate cross-document knowledge already in pretraining, thus allowing the LM to learn to encode relevant cross-document relationships implicitly.
To allow our CD-LM to address large contexts across multiple documents, we leverage the recent appealing architecture of the Longformer model (Beltagy et al., 2020) , designed to address long inputs. Specifically, we leverage its global attention mechanism, originally utilized only during task-specific fine-tuning, and extend its use already in pretraing, enabling the model to consider cross-Doc 1: "... President Obama will name Dr. Regina Benjamin as U.S. Surgeon General in a Rose Garden announcement late this morning. ..." Doc 2: "... Obama nominates new surgeon general: MacArthur "genius grant" fellow Regina Benjamin. ..." Figure 1 : Various document examples from the ECB+ dataset. In Doc 1 and Doc 2, underlined words represent coreferering events and the same color represents a coreference cluster: The entity clusters are ("Dr. Regina Benjamin", "MacArthur "genius grant" fellow Regina Benjamin") and ("President Obama","Obama"), and the single event cluster is ("name","nominates"). These examples are adopted from Cattan et al. (2020) . document, as well as long-range within-document, information. While using this mechanism, we introduce a cross-document masking approach. This approach considers as input multiple-documents containing related, partly overlapping, information. The model is then challenged to unmask the masked tokens while attending to information in both the same as well as the related documents. This way, the model is encouraged to "peek" at other documents and map cross-document information, in order to yield better unmasking. Our pretraining procedure yields a generic cross-document language model, which may be leveraged for various cross-document downstream tasks that would need to map information across related texts. As mentioned above, our experiments assess utility of our CD-LM for a range of cross-document tasks, resulting with significant improvements, suggesting its appeal for future work in the cross-document setting.
2 Background
Transformer-based language models (LMs) (Devlin et al., 2019; Yang et al., 2019) have led to significant performance gains in various natural language understanding tasks, mainly for within-document-related tasks. They use multiple self-attention layers in order to learn to produce high-quality token representations. They were shown to incorporate contextual knowledge by assigning a representation that is an attentive function of the entire input context. Such models are trained using the Masked Language Modeling (MLM) objective (known as the pretraining phase) -given a piece of text, a model uses the context words surrounding a masked token to try to predict it, and by that, maximizing the likelihood of the input words.
These models have significantly advanced the state-of-the-art in various NLP tasks, mostly using post-pretraining, finetuning approaches, e.g., question answering (Yang et al., 2019) , coreference resolution , such as those of the GLUE benchmark . Importantly, pretrained LMs eliminate the need for many heavily-engineered and hand-crafted task-specific architectures for downstream tasks. Additionally, Clark et al. (2019) show that BERT's (Devlin et al., 2019) attention heads encode a substantial amount of linguistic knowledge, such as the ability to represent within-document coreference relations. This enables better performance over downstream tasks, with limited resources of labeled training data. Despite such models' success in within-document tasks, due to memory and time constraints, they limit the input size and are only able to support a rather small context. Thus, they cannot be readily applied in cross-document tasks where the input size is large.
Recently, several models were suggested to handle these issues and bypass the length constraint, by employing techniques for dealing with the computational and memory obstacles (Tay et al., 2020) . Examples to such architectures include the Longformer (Beltagy et al., 2020) BigBird (Zaheer et al., 2020) , and LinFormer , which were introduced to extend the range of context that can be used, both for the pretraining and fine-tuning stages. Specifically for the Longformer model, which we utilize in this work, a localized sliding window-based attention, termed local attention, was proposed for reducing computation and extending the previous LMs to support longer sequences. This enabled the handling of long context processing by removing the restrictions of long inputs. In addition, the authors introduced the global attention mode, which allows the LM to build representations based on the full input sequence for prediction, and is used during fine-tuning only. Both the local attention and the global attention modes rely on the known self-attentive score (Vaswani et al., 2017) which is given by:
Attention (Q, K, V ) = softmax QK T â d k V,
where the learned linear projection matrices Q, K, V are partitioned into two distinct sets; Ξ l = {Q l , K l , V l } and Ξ g = {Q g , K g , V g }, for the local and the global attention modes, respectively. During pretraining, the Longformer assigns the local attention mode for all tokens to optimize the MLM objective. Before task-specific finetuning, the attention mode is predetermined for each input token, assigning global attention to few targeted tokens (e.g., special tokens) to avoid computational inefficiency. We hypothesize that the global attention mechanism is useful for learning meaningful representations for modeling cross-document relationships. We propose augmenting the pretraining phase to exploit the global attention mode, rather than using it only for finetuning, as further described in the next section.
3 Cross-Document Language Model Documents that describe the same event, e.g., different news articles that discuss the same story, usually contain overlapping information. Accordingly, many cross-document tasks may leverage from LM infrastructure that encodes information regarding alignment and mapping across texts. For example, for the cross-document coreference resolution task, consider the underlined predicate examples in Fig. 1 . One would expect a model to correctly align the events around "name" and "nominates", effectively recognizing their coreference relation even when they are in separate documents.
Our approach to cross-document language modeling is based on training a Transformer-based LM on sets (clusters) of documents, all describing the same event. Such document clusters are readily available in a variety of existing datasets for cross-document benchmarks, such as summarization (e.g., MultiNews (Fabbri et al., 2019b) ), crossdocument coreference resolution (e.g., ECB+ (Cybulska and Vossen, 2014b)) and cross-documents alignment benchmarks (Zhou et al., 2020) . Training the LM over a set of related documents provides the potential to learn cross-text mapping and alignment capabilities, as part of the contextualization process. Indeed, we show that this cross-document pretraining strategy directs the model to utilize information across documents for predicting masked tokens, and helps in multiple cross-document downstream tasks.
To support contextualizing information across multiple documents, we need to use efficient Transformer models that scale linearly with input length.
Thus, we base our model on the Longformer (Beltagy et al., 2020). As described in Sec. 2, this is an efficient Transformer model for long sequences that uses a combination of local attention (selfattention restricted to a local sliding window) and global attention (a small set of pre-specified input locations with direct global attention access).
Cross-Document Masking In pretraining, we concatenate the document set using new special document separator tokens,
An illustration of the full cross-document masking procedure is depicted in Fig. 2 , where the masked token associated with "nominates" (colored in orange) globally attends to the whole sequence, and the non-masked token hiding the word "new" (colored in blue) attends to the local context. With regard to the example in Fig. 1 , this masking approach aims to implicitly compel the model to learn to correctly predict the word "nominates" by looking at the second document, optimally at the phrase "name", and thus enforce the alignment between the events.
The loss function induced by the above masking method requires a MLM objective which accounts for the entire sequence, namely, the concatenated documents. We mimic the LM bidirectional conditioning from BERT (Devlin et al., 2019) but instead of using constant model weights for all tokens, we assign the global attention weights Ξ g for the masked tokens, so the model can predict the target token in a multi-document context. The unmasked tokens use the local attention weights, Ξ l . We dub this method Cross-document masked language modeling (CD-MLM). The resulting model includes the following new components: new pretrained special document separator, and pretrained sets of both global and local attention weights that form the cross-document language model (CD- LM). The document separator tokens can be useful for downstream tasks for marking the document bounderies while the global attention weights provide better encoding of cross-document selfattentive information.
Finetuning During finetuning on downstream cross-document tasks, we utilize our model by concatenating the tokens of relevant input documents using the document separator tokens, along with the classification token (referred to as CLS) at the beginning of the input sequence. Moreover, for token-level tasks such as coreference resolution, we assign global attention to several explicit spans of text, as described in Section 5.2. Using global attention on at least one token ensures that the distribution of the data during finetuning is similar to the distribution during pretraining, which avoids pretraining-finetuning discrepancy. Note that this method is much simpler than existing task-specific cross-document models.
4 Cd-Lm Implementation
In this section, we provide experimental details used for pretraining our CD-LM model, and detail the ablations we used.
Corpus data We use the Multi-News dataset (Fabbri et al., 2019a) as the source of related documents for the pretraining. This large-scale dataset contains 44,972 training documents-summary clusters which are originally intended for multi-document summarization. The number of related source documents (that describe the same event) per summary varies from 2 to 10, as detailed in Appendix A.1. We discard the summaries and consider each cluster of related documents, of at least 3 documents, for our cross-document pretraining scheme. We compiled the training corpus by concatenating related documents that were sampled randomly from each cluster, until reaching the input sequence length limit of 4,096 tokens per sample. The average input contains 2.5k tokens and the 90th percentile of input lengths is 3.8K tokens.
Training and hyperparameters We pretrain the model according to our CD-MLM strategy described in Section 3. To that end, we employ the Longformer-base model (Beltagy et al., 2020) 3 and continue its pretraining for additional 25k steps. We use the same hyperparameters and follow the exact setting as in Beltagy et al. 2020: Input sequences are of the length of 4,096, effective batch size of 64 (using gradient accumulation and batch size of 8), a maximum learning rate of 3e-5, and a linear warmup of 500 steps, followed by a power 3 polynomial decay. For speeding up the training and reducing memory consumption, we used the mixedprecision (16-bits) training mode. 4 The rest of the hyperparameters are the same as for RoBERTa .
Baseline Language Models In Addition To Our
proposed CD-LM model and the state-of-the-art models detailed in the next sections, we considered the following LM variations in our evaluations, as ablations for our model:
âą The plain LONGFORMER-base model, without further pretraining.
âą The RAND CD-LM model based on the Longformer-base model, with the additional CD-MLM pretraining but using random, unrelated documents from various clusters. The amounts of data and pretraining hypyerparameters are the same as the ones of CD-LM. This baseline model can asses whether pretraining using related documents is beneficial.
When finetuning each one of the above models, we restrict each input segment (document/abstract/passage) to include a maximal length of 2,047 tokens, so that the entire input length, including the CLS token, will have no more than 4,096 input tokens.
5 Evaluations And Results
This section presents the intrinsic and extrinsic experiments conducted to evaluate our CD-LM. For the intrinsic evaluation we measure the perplexity of the models, while for extrinsic evaluations we considered the event and entity cross-document coreference resolution, paper citation recommendation, and the document plagiarism detection tasks.
5.1 Cross-Document Perplexity
First, we conduct a cross-document perplexity experiment, in a task-independent manner, to asses the contribution of the pretraining process. We used the MultiNews validation and test sets, each of them containing 5,622 documents-summary clusters, to construct the evaluation corpora. Then we followed the same protocol from the pretraining phase -random 15% of the input tokens are masked and assigned with global attention, and the challenge is to predict the masked token given all documents in the input sequence. The perplexity is then measured by computing exponentiation of the loss.
The results are depicted in Table 1 . The CD-LM model outperforms the baselines. In particular, the advantage over RAND CD-LM, which was pretrained equivalently over an equivalent amount of (unrelated) cross-document data, confirms that cross-document masking, in pretraining over related documents, indeed helps for cross-document masked token prediction across such document. The CD-LM is encouraged to look at the full sequence, when predicting a masked token. Therefore, it exploits related information in other documents as well, and not just local context. The RAND CD-LM is inferior since, in its pretraining phase, it was not exposed to such overlapping useful information. The plain LONGFORMER model, which is reported just as a reference point, is expected to have difficulty to predict cross-document tokens, in addition to the reason above, since the document separators we used are not part of its embedding set and are randomly initialized during this task. Moreover, recall that the CD-LM and the RAND CD-LM models have two pretrained sets of linear projection weights -one for local attention and one for global attention. The plain LONGFORMER model uses the same weights for the two modes, and therefore it is reasonable that it will fail at long-range mask prediction.
5.2 Cross-Document Coreference Resolution
Cross-document (CD) coreference resolution deals with identifying and clustering together textual mentions across multiple documents that refer to the same concept (see examples in Doc 1 and Doc 2 in Fig. 1 ). The considered mentions can be both entity mentions, usually noun phrases, e.g., "Obama" and "President Obama", and event mentions, which are mostly verbs or nominalizations that appear in the text, e.g., "name" and "nominates".
Benchmark. For assessing our CD-LM on CD coreference resolution, we utilized it for an evaluation over the ECB+ corpus (Cybulska and Vossen, 2014a) , which is the most commonly used dataset for the task. ECB+ consists of within-and crossdocument coreference annotations for entities and events. The ECB+ dataset statistics are described in Appendix A.2. Following previous work, for comparison, we conduct our experiments on gold event and entity mentions. For evaluating the performance of coreference clustering we follow the standard coreference resolution evaluation metrics: MUC (Vilain et al., 1995), B 3 (Bagga and Baldwin, 1998), CEAFe (Luo, 2005) , their average CoNLL F1, and the more recent LEA metric (Moosavi and Strube, 2016).
Algorithm. Recent approaches for CD coreference resolution train a pairwise scorer to learn the likelihood that two mentions are coreferring. At inference time, an agglomerative clustering based on the pairwise scores is applied, to form the coreference clusters. Next, we detail our proposed modifications for the pairwise scorer. The current state-ofthe-art models (Zeng et al., 2020; Yu et al., 2020) train the pairwise scorer by including only the local contexts (containing sentences) of the candidate mentions. They concatenate the two input sentences and feed them into a transformer-based LM. Then, part of the resulting tokens representations are aggregated into a single feature vector which is passed into an additional MLP-based scorer to produce the coreference probability estimate. To accommodate our proposed CD-LM model, we modify this modeling, as illustrated in Fig 3. We include the entire documents containing the two candidate mentions, instead of just their containing sentences. We concatenate the relevant documents using the special document separator tokens, then encode them using our CD-LM along with the token (corresponding to the CLS token) at the beginning of this sequence, as suggested in Section 3. For within-document coreference candidate examples, we use just the single containing document with the document separator. Inspired by Yu et al. 2020, we use candidate mention marking: we wrap the mentions with special tokens ,
m t (i, j) = s t , m i t , m j t , m i t âą m j t ,
where [âą] denotes the concatenation operator, s t is the cross-document contextualized representation vector of the CLS token, and each of m i t and m j t is the sum of candidate tokens of the corresponding mentions (i and j). Then, we train the pairwise scorer according to the suggested finetuning scheme. At test time, similar to most recent works, we apply agglomerative clustering to merge the most similar cluster pairs. The hyperparameters and further details are elaborated in Appendix B.1.
Baselines. We consider recent, state-of-the-art baselines that reported results over the ECB+ benchmark. The following baselines were used for both event and entity coreference resolution:
Same Head-Lemma is a simple baseline that merges mentions sharing the same syntactic headlemma into the same coreference cluster. Barhom et al. (2019) is a model trained jointly for solving both event and entity coreference as a single task. Cattan et al. (2020) is a model trained in an endto-end manner (jointly learning mention detection and coreference), employing the RoBERTa-large model to encode separately each document and to train a pair-wise scorer on top of these representations.
The following baselines were used only for event coreference resolution. They all integrate external linguistic information as additional features to the model: Meged et al. (2020) is an extension of Barhom et al. (2019) , leveraging additional side-information acquired by a paraphrase resource (Shwartz et al., 2017) . Zeng et al. (2020) is an end-to-end model, encoding the concatenated two sentences containing the two mentions, by the BERT-large model. Similarly to our proposed algorithm, they feed a MLP-based pairwise scorer with the CLS contextualized token representation and an attentive function of the contextualized representation vectors of the candidate mentions. Yu et al. (2020) is an end-to-end model similar to Zeng et al. (2020) , but uses rather RoBERTa-large and does not consider the CLS contextualized token representation for the pairwise classification. This is a non-attentive version of Zeng et al's mechanism for paraphrase detection. Results. The results on event and entity CD coreference resolution are depicted in Tables 2 and 3 . All results are statistically significant using bootstrap and permutation tests with p < 0.001 (Dror et al., 2018) . Our CD-LM outperforms the sentence based models (Zeng et al., 2020; Yu et al., 2020) on event coreference (+1.2 CoNLL F1) and largely surpasses state-of-the-art results on entity coreference (+9.8 CoNLL F1), even though these models utilize external linguistic argument information, 5 and include many more parameters (large models vs our base model). Finally, the RAND CD-LM is inferior to the plain LONGFORMER model, despite the fact that it already has pretrained document separator embeddings. This emphasizes the requirement of pretraining on related documents rather than random ones, which allows better alignment and paraphrasing capabilities, required for coreference detection.
5.3 Paper Citation Recommendation & Plagiarism Detection
We evaluate our CD-LM over citation recommendation and plagiarism detection benchmarks Zhou et al. (2020) , a recently released benchmark for cross-document tasks. These tasks share the same objective -categorizing whether a particular relationship holds between two input documents, and therefore, correspond to binary classification problems. Citation recommendation deals with detecting whether one reference document should cite the other one, while the plagiarism detection task infers whether one document plagiarizes the other one. To compare with recent state-of-the-art models, we utilized the setup and data selection from Zhou et al. (2020) , which provides three datasets for citation recommendation and one for plagiarism detection.
Benchmarks. For citation recommendation, we used the ACL Anthology Network Corpus (AAN; Radev et al., 2013), the Semantic Scholar Open Corpus (OC; Bhagavatula et al., 2018) , and the Semantic Scholar Open Research Corpus (S2ORC; Lo et al., 2020) . For plagiarism detection, we used the Plagiarism Detection Challenge (PAN; Potthast et al., 2013) .
AAN is composed of computational linguistics papers which were published on the ACL Anthology from 2001 to 2014, OC is composed of computer science and neuroscience papers, S2ORC is composed of open access papers across broad domains of science, and PAN is composed of web documents that contain several kinds of plagiarism phenomena. For further dataset prepossessing details and statistics, see Appendix A.3.
Algorithm. For our models we added the CLS token at the beginning of the input sequence and concatenated the pair of texts together, according to the finetuning setup discussed in Section 3. The hyperparameters are further detailed in Appendix B.2.
Baselines. We consider the reported results of the following recent baselines: SMASH (Jiang et al., 2019) is an attentive hierarchical RNN model, used for tasks related to long-document.
SMITH ) is a BERTbased hierarchical model, similar to the previously suggested hierarchical attentive networks (HANs (Yang et al., 2016) ).
BERT-HAN+CDA (Zhou et al., 2020) is a cross-document attentive mechanism (CDA) built on top of Hierarchical Attention Networks (HANs), based on BERT. For more details, see Section 6. We report the results of their finetuned model over the datasets (Zhou et al., 2020, Section 5.2) .
Note that both SMASH and SMITH reported results only over the ANN benchmark. In addition, they used a slightly different version of the AAN dataset, and included the full documents, unlike the dataset that BERT-HAN+CDA used, which we utilized as well, that considers only the documents' abstracts.
Results. The results on the citation recommendation over the AAN dataset are depicted in Table 4 . We observe that even though several baselines reported results using the full documents, our model outperforms them, using the partial version of the dataset, as in (Zhou et al., 2020) . Moreover, unlike our model, the CDA is task-specific since it trains new cross-document weights for each task, yet it is still inferior to our model. The results on the rest of the benchmarks are reported in Table 5 , and as can be seen, our CD-LM consistently outperforms both the prior baseline as well as the LONGFORMER and RAND CD-LM models.
6 Related Work
Recently, several works proposed equipping LMs with cross-document processing capabilities, mostly by harnessing sequence-to-sequence architectures. Lewis et al. (2020) suggested to pretrain a LM by means of reconstructing a document, given, and conditioned on, related documents. They showed that this technique forced the model to learn how to paraphrase the original reconstructed document, leading to significant performance gains on multi-lingual document summarization and retrieval. This work considers a basic retrieval model, that does not consider cross-document interactions at all. proposed an end-to-end architecture for improving abstractive summarization. Unlike standard LMs, in their pretraining, several sentences (and not just tokens) are removed from documents, and the model's task is to recover them. A similar approach was also suggested for single document summarization . The advantage of such self-supervision approaches is that they were proved to produce high-quality summaries without any human annotation, often the bottleneck in purely supervised summarization systems. While these approaches advanced the stateof-the-art sequence-to-sequence tasks, the encoders they employed support the encoding of a single document at a time. In our work, we allow inputs comprised of multiple documents in each sample, to support cross-document contextualization. Nevertheless, the main drawback of such sequence-tosequence architectures is that they require a massive amount of data and training time in order to obtain a plausibly trained model, while we used a relatively small corpus.
The closest work to our proposed model is the recent Cross-Document Attention model (CDA) (Zhou et al., 2020) . They introduced a crossdocument component, that enables document-todocument and sentence-to-document alignments. This model is set on top of existing hierarchical document encoding models (Sun et al., 2018; Liu and Lapata, 2019; Guo et al., 2019) , that do not consider information across documents by themselves. CDA suggests influencing the document and sentence representations, by those of other documents, without considering word-to-word information across documents (which might require an additional quadratic number of parameters). This makes such modeling unsuitable for token-level alignment tasks, such as cross-document coreference resolution. Moreover, unlike our proposed model, which employs a generic cross-document pretraining, the CDA mechanism requires learning from scratch the cross-document parameters for each downstream task. Further, they support crossdocument attention between two documents, while our method does not restrict the number of input documents, as long as they fit the input length of the Longformer.
7 Conclusion
We presented a novel pretraining strategy and technique for cross-document language modeling, providing better encoding for cross-document downstream tasks. Our primary contributions include cross-document masking over clusters of related documents, driving the model to encode crossdocument relationships. This was achieved by extending the use of the global attention mechanism of the Longformer model (Beltagy et al., 2020) in pretraining, attending to long-range information across and within documents. Our experiments assess that leveraging our cross-document language model yields new state-of-the-art results over several cross-document benchmarks, including the fundamental task of cross-document entity and event coreference, while, in fact, employing substantially smaller models. We suggest the attractiveness of our CD-LM for neural encoding in cross-document tasks, and propose future research to extend this framework to support sequence-to-sequence crossdocument tasks, such as multi-document abstractive summarization.
A Dataset Statistics And Details
In this section, we provide more details about the datasets of the corpus and benchmarks we used during our experiments.
A.1 Multinews Corpus
In Table 6 we list the number of related documents articles per cluster. This follows the original dataset construction. Note that the datasets and the statistics are taken from Fabbri et al. (2019a
A.2 Ecb+ Dataset
In Table 7 we list the statistics about training, development and test splits regarding the topics, documents, mentions and coreference clusters. We follow the data split by previous works (Cybulska and Vossen, 2015; Kenyon-Dean et al., 2018; Barhom et al., 2019): Training topics: 1, 3, 4, 6-11, 13-17, 19-20, 22, 24-33; Validation topics: 2, 5, 12, 18, 21, 23, 34, 35
A.3 Paper Citation Recommendation & Plagiarism Detection Datasets
In Table 8 we list the statistics about training, development and test splits for each benchmark, and in Table 9 we list the document and examples counts for each benchmark. The statistics are taken from Zhou et al. (2020) . The preprocessing of the datasets performed by Zhou et al. (2020) includes the following steps: Dataset Train Validation Test AAN 106,592 13,324 13,324 OC 240,000 30,000 30,000 S2ORC 152,000 19000 19000 PAN 17,968 2,908 2,906 Table 8 : Document-to-Document benchmarks statistics: Details regrading the training, validation, and test splits. Table 9 : Document-to-Document benchmarks statistics: The reported numbers are the count of document pairs and the count of unique documents.
For AAN, only pairs of documents that include abstracts are considered, and only their abstracts are used. For OC, only one citation per paper is considered and the dataset was downsampled significantly. For S2ORC, formed pairs of citing sections and the corresponding abstract in the cited paper are included, and the dataset was downsampled significantly. For PAN, pairs of relevant segments out of the test were extracted. For all the datasets, negative pairs are sampled randomly. Then, a standard preprocessing that includes filtering out characters that are not digits, letters, punctuation, or white space in the texts was performed. The complete dataset statistics are described in Appendix A.3.
B Hyperparameters Setting And Training Details
In this section, we elaborate the hyperparameter choices for our experiments.
B.1 Cross-Document Coreference Resolution
We adopt the same protocol as suggested in Cattan et al. (2020) 7 : Our training set is composed of positive instances which consist of all the pairs of mentions that belong to the same coreference cluster, while the negative examples are randomly sampled. We fine-tune our models for 10 epochs, with an effective batch size of 128. The feature vector is passed through a MLP pairwise scorer
For more details, see the masking procedure of BERT(Devlin et al., 2019)
HuggingFace implmentation, https://github. com/huggingface/transformers.4 The pretraining took 8 days, using eight 48GB RTX8000 GPUs.
They utilized semantic role labeling to add features related to the arguments of each event mention.
Following the most recent work ofZhou et al. (2020), we evaluate our model on their version of the dataset. We also quote the results of SMASH and SMITH methods, even though they used a somewhat different version of this dataset, hence their results are not fully comparable to the results of our model and those of BERT-HAN+CDA.
we used the implementation taken from https:// github.com/ariecattan/cross_encoder
we used the script https://github.com/ XuhuiZhou/CDA/blob/master/BERT-HAN/run_ ex_sent.sh