Multilingual, Multi-Domain NLP for Structured Data
SCROLL is a special-purpose, production-level, Java-based natural language processing framework.
The name stands for Semantic CROss-Lingual Label parser:
- label parser because it is a special-purpose NLP tool tuned to the analysis of the type of short block language commonly appearing in metadata fields, catalogues, classifications, or other structured data records;
- semantic because the goal of its operation is to perform semantic analysis by linking common words to concepts (word sense disambiguation) and named entities to individuals (named entity disambiguation);
- cross-lingual because the framework is designed to be generic enough to integrate processing pipelines for multiple languages, with some components shared across languages while others being language-specific;
- cross-domain because the framework is able to detect the domain(s) underlying the pieces of text (e.g., healthcare, finance, tourism) and is able to adapt its semantic processing accordingly.
SCROLL currently supports: Arabic, Chinese, English, German, Italian, Mongolian, Spanish.
SCROLL Vision and Mission
Natural language text is pervasive in structured data sets—relational database tables, spreadsheets, XML documents, RDF graphs, etc.—requiring data processing applications to possess some level of natural language understanding capability. While these formal or semi-formal data models were designed to ease the processing of data by machines, they still tend to contain a large amount of informal text expressed in natural language within schema elements, data values, and metadata. Extraction of meaning from such text is a key step in solving well-known tasks such as data analysis, data integration, linguistic and semantic processing, or information retrieval. This, however, is made difficult by the heterogeneity of languages, domains, syntax, orthography, or naming conventions, all typical of natural language text. To tackle these issues, SCROLL provides cross-lingual and cross-domain natural language processing tailored to the specific needs of structured data. The fundamental idea behind the platform is that linguistic processing is shared across languages and across application domains. This is achieved by building linguistic resources and tools collaboratively, using a shared infrastructure.
SCROLL is a collaborative research and development project and is intended to be carried out as a shared effort among partners worldwide. The source code and documentation of SCROLL are available to partners who are free to use the tool for their own projects. In return, they are encouraged to join our efforts in one or more of the following areas:
- the development of language-specific components and resources;
- multilingual and multi-domain semantic analysis;
- the block language of structured data.
The ultimate goal is to build a community of NLP researchers and developers who participate to joint R&D projects on the reuse of techniques and resources across languages and domains, and on the conception of new multilingual and multi-domain approaches to known NLP problems.
You can download here our Guidelines for bootstrapping cross-lingual NLP development for structured data.
- Abed Alhakim Freihat, Gábor Bella, Hamdy Mubarak, and Fausto Giunchiglia. A Single-Model Approach for Arabic Segmentation, POS Tagging, and Named Entity Recognition.
Proceedings of the International Conference on Natural Language and Speech Processing, Algiers, Algeria, 2018.
- Gábor Bella, Fausto Giunchiglia, and Fiona McNeill. Language and Domain Aware Lightweight Ontology Matching. Journal of Web Semantics, vol. 43, March 2017, pp. 1-17.
- Gábor Bella, Alessio Zamboni, and Fausto Giunchiglia. Domain-Based Sense Disambiguation on Multilingual Structured Data. Proceedings of the ECAI 2016 workshop on Diversity Aware Artificial Intelligence, The Hague, Netherlands.
- Zoljargal Munkhjargal, Gábor Bella, Altangerel Chagnaa, and Fausto Giunchiglia. Named Entity Recognition for the Mongolian Language. Proceedings of the 18th International Conference on Text, Speech, and Dialogue, Pilsen, Czech Republic, 2015.