APPLICATION OF THE METHOD OF SEARCHING FOR SOURCES OF INTERLANGUAGE BORROWINGS IN TYPOLOGICALLY DIFFERENT LOW-RESOURCE LANGUAGES
Keywords:
cross-language borrowing, candidate retrieval, plagiarism, natural language processing, text uniqueness, low-resource languagesAbstract
Despite the existence of many approaches to detecting cross-language borrowings and, in particular, for candidate retrieval, their application and effectiveness for low-resource languages remain poorly studied. This paper presents the results of applying the method for searching cross-language borrowing candidates developed by the authors, which was initially applied exclusively to the Armenian-English language pair, to several typologically different low-resource languages, such as Georgian, Greek, Finnish and Romanian. The main idea of the method is to compare multilingual texts by different parts of speech separately, in particular, by nouns, verbs, and adjectives. It is assumed that such a comparison will produce better results than comparing all words, for other languages too, due to the focus on the most informative lexical units. This assumption comes from the hypothesis that nouns have the greatest semantic importance in a sentence, adjectives and verbs follow. The application of the method to languages that differ in morphological and syntactic structure aims to confirm this hypothesis for other languages too and demonstrate the generalizability of the proposed method. Experiments in this work were conducted on texts taken from Wikipedia. The paper also presents the results of applying the method with synonym recognition via WordNet across all considered language pairs, aiming to improve the result quality. These results can support the development of cross-language borrowing detection systems adapted to specific low-resource languages.



