A Novel Challenge Set for Hebrew Morphological Disambiguation and Diacritics Restoration
One of the primary tasks of morphological parsers is the disambiguation of homographs. Particularly difficult are cases of unbalanced ambiguity, where one of the possible analyses is far more frequent than the others. In such cases, there may not exist sufficient examples of the minority analyses in order to properly evaluate performance, nor to train effective classifiers. In this paper we address the issue of unbalanced morphological ambiguities in Hebrew. We offer a challenge set for Hebrew homographs — the first of its kind — containing substantial attestation of each analysis of 21 Hebrew homographs. We show that the current SOTA of Hebrew disambiguation performs poorly on cases of unbalanced ambiguity. Leveraging our new dataset, we achieve a new state-of-the-art for all 21 words, improving the overall average F1 score from 0.67 to 0.95. Our resulting annotated datasets are made publicly available for further research.
It is a known phenomenon that the distribution of linguistic units, or words, in a language follows a Zipf law distribution (Zipf, 1949) , wherein a relatively small number of words appear frequently, and a much larger number of items appear in a long tail of words, as rare events (Czarnowska et al., 2019) . Significantly, this also applies to the distribution of analyses of a given homograph. Take for instance the simple POS-tag ambiguity in En glish between noun and verb (Elkahky et al., 2018) . The word "fair" can be used as an adjective ("a fair price") or as a noun ("she went to the fair"). Yet, the distribution of these two analyses is certainly not fair; the adjectival usage is far more frequent than the nominal usage (e.g., in Bird et al. (2008) the latter is six times more frequent than the former). We will call such cases "unbalanced homographs".
Cases of unbalanced homographs pose a formidable challenge for automated morphological parsers and segmenters. In tagged training corpora, the frequent option will naturally dominate the overwhelming majority of the occurrences. If the training corpus is not sufficiently large, then the sparsity of the minority analysis will prevent generalization by machine-learning models. By the same token, it can be difficult to evaluate the per formance of tagging systems regarding unbalanced homographs, because the sparsity of the minority analysis prevents computation of adequate scoring.
The empirical consequences of unbalanced homographs are magnified in morphologically rich languages (MRLs), including many Semitic lan guages, where distinct morphemes are often affixed to the word itself, resulting in additional ambiguity (Fabri et al., 2014; Habash et al., 2009) . Furthermore, in many Semitic MRLs, the letters are almost entirely consonantal, omitting vowels. This results in a particularly high number of homographs, each with a different pronunciation and meaning.
In this paper, we focus upon unbalanced homographs in Hebrew, a highly ambiguous MRL in which vowels are generally omitted (Itai and Wintner, 2008; Adler and Elhadad, 2006) . Take for example the Hebrew word .מדינה This frequent word is generally read as a single nominal morpheme, ה י ְד ,מ meaning "country". However, it can also be read as הּ י ִדּ ,מ "from the law/judgment of her", wherein the initial and final letters both serve as distinct morphemes. This last usage is far less common, and, in an overall distribution, it would be relegated to the long tail, with very few attestations in any given corpus.
Hebrew is a low resource language, and as such, the problem of unbalanced homographs is particularly acute. Existing tagged corpora of Hebrew are of limited size, and in most cases of unbalanced homographs, the corpora do not provide sufficient examples to evaluate performance regarding minority analyses, nor to train an effective classifier.
Here, we propose to overcome this difficulty by means of a challenge set: a group of specialized training sets which each focus upon one particular homograph, offering substantial attestations of the competing analysis. Designing such contrast sets that expose particularly hard unbalanced cases was recently proposed as a complementary evaluation effort for a range of NLP tasks by Gardner et al. (2020) . Notably, all tasks therein focus exclusively on English, and do not make any reference to morphology. Another, particularly successful, instance of this approach is the Noun/Verb challenge set for English built by Elkahky et al. (2018) . Yet, heretofore, no challenge sets have been built to address cases of unbalanced homographs in Hebrew.
In order to fill this lacuna, we built a challenge set for 12 frequent cases of unbalanced Hebrew homographs. Each of these words admits of two possible analyses, each with its own diacritization and interpretation. 1 For each of the possible analyses, we gather 400 2,500 sentences exemplifying such usage, from a varied corpus consisting of news, books, and Wikipedia. Furthermore, in order to highlight the particular problem regarding unbalanced homographs, we add an additional 9 cases of balanced homographs, for contrast and comparison. All in all, the corpus contains over 56K sentences. 2
2 Description Of The Corpus
In Table 1 we list the 21 homographs addressed in our challenge set. For each case, we specify the frequency of each analysis in naturally-occurring Hebrew text, and the ratio between them. 3 The 21 homographs include a wide range of homograph types. Some are cases of different POS types: Adj vs. Prep (13), Noun vs. Verb (15, 18), Pronoun vs. Prep (2,4), Noun vs. Prep (9), etc. Other cases differ in terms of whether the final letter should be segmented as a suffix (10, 13, 20) . In some instances, the morphology is the same, but the difference lies in the stem/lexeme (5, 7, 8, 11) .
In choosing our 21 homographs, we first assembled a list of the most frequent homographs in the Hebrew language. For the simplicity of this initial proof of concept, we constrained our list to homographs with only two primary analyses. We also constrained our list to cases where the two analyses represent different lexemes, skipping over cases in which the difference is only one of inflection. Further, some cases were filtered out due to data sparsity. Finally, we also included a number of less frequent homographs, to allow for a comparison between frequent and infrequent homographs.
In order to gather sentences for the contrast sets, we first sampled 5000 sentences for each target word, and sent them to student taggers. For balanced homographs, with ratios of 1:3 or less, this process handily provided a sufficiently large number of sentences for each of the two analyses. How ever, regarding cases of unbalanced homographs, wherein the naturally occurring ratio of the minor ity analysis can be 30:1 or even 129:1, this initial corpus was far from adequate. We used two methods to identify additional candidate sentences: 1We ran texts through an automated Hebrew diacritizer (Shmidman et al., 2020) and took the cases where the word was diacritized as the minority analysis. (2) Where relevant, we leveraged optional Hebrew orthographic variations which indicate that a given word is intended in one specific way. These candidate sentences were then sent to student taggers to confirm that the minority analysis was in fact intended. Our student taggers tagged approxi mately 300 sentences per hour. Evaluation of their work revealed that they averaged an accuracy of 98 percent. In order to overcome this margin of error, we employed a Hebrew language expert who proofread the resulting contrast sets. In our final corpus, each analysis of each homograph is attested in at least 400 sentences, and usually in 800-2.5K sentences (full details in Appendix Table 1 ).
One issue we encountered when collecting naturally-occurring Hebrew sentences is that a small number of specific word-neighbors and collocations tend to dominate the examples. As an example: the word אפשר can be vocalized as ר ְשׁ ֶפ א ("possible", the majority case), or ר ְשׁ ִפ א ("he allowed"). However, over one third of the naturally occurring cases of the majority case boil down to some 90 frequently occurring collocations, such as אפשר אי ("impossible") or אפשר הא ("is it possible?"). As such, a machine-learning model would overfit to those specific collocations, rather than learning more generic overarching patterns of nine cases of balanced homographs. As expected, YAP does considerably better here: all F1 scores are above .5, and four of the cases are above .8. The weakest cases are those in which YAP has to differentiate between an unsegmented noun and a case of a noun plus possessive suffix (cases 14,20) . In both of these cases, YAP scores an F1 of approximately .56 (which, interestingly, is precisely on par with the analogous unbalanced case  ).
In Table 3 , we display results regarding our specialized classifiers. In most cases, using a biLSTM over the entire sentence context performs better than a concatenation of the three neighbor words on each side. In terms of the encoding method for the context words, word2vec performs better than the morphological lattice. This may be because word2vec can better represent the regularly expected usage of the neighboring words, while the morphology lattice represents all possible anal yses with equal likelihood. A second possibility is that the contrast sets were not sufficiently large to optimally train the embeddings of the morphological characteristics, whereas word2vec embeddings have the benefit of pretraining on over 100M words. The combination of the latter two methods overall outperforms each one of them individually; thus, although word2vec succeeds in encoding most of what is needed to differentiate between the options, the information provided by the morph lattice sometimes helps to make the correct call.
In Table 4 , we compare the results of our composite-method with those of YAP. Our special ized classifiers set a new SOTA for all the cases.
5 Related Work
Many recent papers have proposed global or unsu pervised methods for homograph disambiguation in English (e.g. Liu et al. (2018) ; Wilks and Stevenson (1997) ; Chen et al. (2009) ). While such meth ods have obvious advantages, they have limited applicability to Hebrew. As noted, in Hebrew the majority of the words are ambiguous, including the core building blocks of the language; without these anchors, global approaches tend to result in poor performance regarding unbalanced homographs.
The problem of Hebrew diacritization is analogous to that of Arabic diacritization; Arabic, like Hebrew, is a morphologically-rich language written without diacritics, resulting in high ambigu ity. Many recent studies have proposed machinelearning approaches for the prediction of Arabic diacritics across a given text (e.g. Bebah et al. (2014) ; Belinkov and Glass (2015) ; Neme and Paumier (2019); Fadel et al. (2019a,b) ; Darwish et al. (2020). However, these studies all perform evaluations on standard Arabic textual datasets, and do not evaluate accuracy regarding minority options of unbalanced homographs. We believe that these models would likely benefit from specialized chal lenge sets of the sort presented here to overcome the specific hurdle of unbalanced homographs.
Due to high morphological ambiguity, as well as the lack of diacritics, Semitic languages pose a par ticularly difficult disambiguation task, especially when it comes to unbalanced homographs. For such cases, specialized contrast sets are needed, both in order to evaluate performance of existing tools, as well as in order to train effective classifiers. In this paper, we construct a new challenge set for Hebrew disambiguation, offering comprehensive contrast sets for 21 frequent Hebrew homographs. These contrast sets empirically demonstrate the lim itations of reported SOTA results when it comes to unbalanced homographs; a model may report a SOTA for a benchmark, yet fail miserably on real world rare-but important cases. Our new cor pus will allow Hebrew NLP researchers to test their models in an entirely new fashion, evaluating the ability of the models to predict minority homograph analyses, as opposed to existing Hebrew benchmarks which tend to represent the lan guage in terms of its majority usage. Furthermore, our corpus will allow researchers to train their own classifiers and leverage them within a pipeline architecture. We envision the classifiers positioned at the beginning of the pipeline, disambiguating frequent forms from the get go, and yielding im provement down the line, ultimately improving results for downstream tasks (e.g. NMT). Indeed, as we have demonstrated, neural classifiers trained on our contrast sets handily achieve a new SOTA for all of the homographs in the corpus.
In some of the cases, additional analyses are theoretically possible, but exceedingly rare.2 In cases where a given sentence contains more than one instance of the target word, the sentence is included multiple times, once for each instance.3 All statistics in this paper regarding the distribution of Hebrew word analyses are based upon an in-house annotated 2.4M word corpus maintained by DICTA.