Contributo in atti di convegno, 2024, ENG

Towards the Automated Population of Thesauri Using BERT: A Use Case on the Cybersecurity Domain

E. Cardillo (1); A. Portaro (1); M. Taverniti (1); C. Lanza (1); R. Guarasci (2)

IIT-CNR, Cosenza, Italy (1); Università della Calabria, ICAR CNR (2)

The present work delves into innovative methodologies leveraging the widely used BERT model to enhance the population and enrichment of domainoriented controlled vocabularies as Thesauri. Starting from BERT's embeddings, we extracted information from a sample corpus of Cybersecurity related documents and presented a novel Natural Language Processing-inspired pipeline that combines Neural language models, knowledge graph extraction, and natural language inference for identifying implicit relations (adaptable to thesaural relationships) and domain concepts to populate a domain thesaurus. Preliminary results are promising, showing the effectiveness of using the proposed methodology, and thus the applicability of LLMs, BERT in particular, to enrich specialized controlled vocabularies with new knowledge.

The 12-th International Conference on Emerging Internet, Data & Web Technologies (EIDWT-2024), Napoli, Italia, 21-23/02/2024

Keywords

Thesauri, Domain-specific language modeling, Semantic analysis, Knowledge Extraction, LLMs

CNR authors

Lanza Claudia, Guarasci Raffaele, Portaro Alessio, Taverniti Maria, Cardillo Elena

CNR institutes

IIT – Istituto di informatica e telematica

ID: 492270

Year: 2024

Type: Contributo in atti di convegno

Creation: 2024-01-30 11:46:49.000

Last update: 2024-01-30 16:48:39.000

External links

OAI-PMH: Dublin Core

OAI-PMH: Mods

OAI-PMH: RDF

External IDs

CNR OAI-PMH: oai:it.cnr:prodotti:492270