information others than those usually found in machine readable dictionaries or manually encoded by lexicographers are urgently needed. Different sources must be exploited if we want to overcome the “lexical bottleneck” of Natural Language Processing. Very interesting data can be found by processing large textual corpora, where the actual usage of the language can be truly investigated. These data refer, typically, to various kinds of syntagmatic relations, which are particularly problematic in many NLP applications. The paper describes how this data can be at least partially extracted by processing and analysing large text corpora, with quantitative/statistic methods. We describe two types of quantitative analyses whose aim is to extract information on the strength of association between two words, and on fixed phrases and idioms. We observe how the measure of the association ratio provides quantitative evidence to a number of lexical, syntactic and semantic relationships between word-pairs. One of the claims is that the linguistic information embodied in all these quite different types of lexical collocations can be helpful for lexical disambiguation in analysis and crucial for lexical selection in generation. This is a step towards a more objective lexicography and a more “data-based” linguistics.

Acquisition of lexical information from a large textual Italian corpus

Bindi R
2003

Abstract

information others than those usually found in machine readable dictionaries or manually encoded by lexicographers are urgently needed. Different sources must be exploited if we want to overcome the “lexical bottleneck” of Natural Language Processing. Very interesting data can be found by processing large textual corpora, where the actual usage of the language can be truly investigated. These data refer, typically, to various kinds of syntagmatic relations, which are particularly problematic in many NLP applications. The paper describes how this data can be at least partially extracted by processing and analysing large text corpora, with quantitative/statistic methods. We describe two types of quantitative analyses whose aim is to extract information on the strength of association between two words, and on fixed phrases and idioms. We observe how the measure of the association ratio provides quantitative evidence to a number of lexical, syntactic and semantic relationships between word-pairs. One of the claims is that the linguistic information embodied in all these quite different types of lexical collocations can be helpful for lexical disambiguation in analysis and crucial for lexical selection in generation. This is a step towards a more objective lexicography and a more “data-based” linguistics.
2003
Istituto di linguistica computazionale "Antonio Zampolli" - ILC
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/37647
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact