Research Object Components and Services in Earth Sciences

Find more relevant research objects and structure them through semantic intelligence.

Developed for the European Virtual Research Environment in Earth Sciences (EVER-EST). More information available here

Semantic Enrichment Service

The semantic enrichment process is in charge of generating new metadata out of the content of research objects. This metadata comprise the main concepts found in resources containing text in the research object, the main knowledge areas in which these concepts are most frequently used, the main expressions, known in computational linguistics as noun phrases, found in the text, and named entities that are further classified in people, organization and places. The core of the semantic enrichment process is Expert System Cogito software. Cogito uses a proprietary semantic network, where words are grouped into concepts with other words sharing the same meaning, and the concepts are related between them by linguistic relations such as hypernyms or hyponyms among many others. Therefore, the semantics of the generated metadata is explicit since the concepts are grounded to the semantic network.

Information retrieval processes, including search engines and recommender systems, can benefit of working with concepts instead of character strings representing words, mainly to provide a more complete and accurate set of results, and enabling the exploration of the research object collection by means of facets where the semantic metadata is available.

Enrichment API

The semantic enrichment service goal is to enhance the research object findability by adding to the user-generated annotations new semantic metadata that is automatically gathered from research object content, more specifically from the resources containing textual content. To elicit the metadata the research object content is extracted and resources that may contain potentially text are identified according to the following resource types: Document (wf4ever:Document), BibliographicResource (dcterms: BibliographicResource), Conclusions (roterms:Conclusions), Hypothesis (roterms:Hypothesis), ResearchQuestion (roterms:ResearchQuestion), and Paper (roterrms:Paper). The files associated with these resources must be of any of the following types: Word documents, PDF documents, Text files, or PowerPoint Presentations. Once the resource files are identified their text is extracted by using open source tools such as apache PDFBOX to read PDF documents and apache POI to read Word documents and Powerpoint presentations.

All these pieces of text plus the title and description of the research object are fed into COGITO to generate the metadata representing the research object content. COGITO is Expert System patented technology to process natural language text. COGITO is able to identified the following metadata types in the text:

     •   Main Concepts: Most frequent sensigrafo concepts mentioned in the text.
     •   Main Domains: Fields of knowledge in which the main concepts are most commonly used
     •   Main Lemmas: Most frequent lemmas found in the text.
     •   Main Compound Terms: Most relevant phrases or collocations found in the text.
     •   Named entities: all the named entities found in the text classified into People, Organizations and Places.

All this metadata types are added to the research object as annotations. The annotations are written according to the content description vocabulary ( annotations are leveraged by ROHUB search engine to produce more accurate results and to provide new facets to explore the research object collection. In addition the recommender service uses these structured metadata to suggest research objects of interest to the users. For a complete description of the recommender and the collaborations spheres the reader is referred to deliverable D4.4 section 3.