You are here

Extracting Quality Knowledge from Natural Language Text

Scientific Area:

  • Knowledge Extraction (from textual resources)
  • Semantic Web 

Description of Activities:

Aim of this research activity is to investigate techniques to accurately extract knowledge from natural language text, and to expose the extracted knowledge according to Semantic Web best practices (e.g., ontological schemas) and formats (e.g., RDF/OWL). Focus of the work is not on improving some specific Natural Language Processing (NLP) techniques or tasks, but rather to effectively exploit them to extract quality and coherent knowledge from a textual document.

The research and development activities will be carried on within the context of a state-of-the-art knowledge extraction framework: PIKES ( Different internship/thesis opportunities are available, to be discussed and agreed with the candidate:

  • KE1: Research and development of techniques to interpret the output of different NLP tools in order to extract coherent knowledge from text. Knowledge extraction is typically the result of the combination of different NLP techniques, implemented in different tools, extracting various kinds of information: e.g., named entities, semantic frames, temporal expressions, entity types. As this information is independently derived, contradicting content may be produced from the same piece of text: e.g., for the same textual span, a tool may extract reference to an entity of the 21st century, but another may ground the content in the first century B.C.. Goal of this activity is to research and develop techniques to coherently select/harmonize/complement the content returned by the various NLP tools used, in order to improve the overall quality of the knowledge extracted from text.

  • KE2: Interfacing PIKES with the KnowledgeStore. The KnowledgeStore ( is a framework for storing, managing, retrieving, and querying both unstructured (e.g. text) and structured (e.g., RDF triples) content in an integrated and interlinked way. Goal of this activity is to properly interface PIKES with the Knowledge Store, so that the input, intermediate results, and final output of PIKES can be directly injected into a running KnowledgeStore instance.

  • KE3: Adding additional NLP module in PIKES. PIKES is developed in a modular way, so that additional NLP tools can be integrated to perform further NLP analyses. Goal of this activity is to add existing NLP modules (e.g., for relation extraction, opinion extraction, etc.) to PIKES. Investigation of modules based on deep-learning technologies is possible and encouraged. Besides integrating the code of these tools, this will require to adapt PIKES data model in order to represent this additional content, and to provide a principled way of representing this content among the knowledge returned by the tool.

  • KE4: Developing a richer frame-based ontological schema for KE, based on FrameBase and Yago. Frame-based KE tools like PIKES rely on the detection of predicate-argument frames in text based on predicate models like FrameNet, PropBank, NomBank, VerbNet. These frames are then converted to an ontological representation, for which the FrameBase ontology has recently been proposed. Goal of this activity is to extend the FrameBase ontology, which currently models predicates and their arguments, with additional ontological constraints about the expected Yago types of arguments (e.g., the agent of a 'speak' predicate has YAGO type 'human') and the alignment between FrameBase predicates and Yago types, aiming at producing a richer and consistent ontological schema for frame-based KE.

Required Skills and knowledge:

  • Solid JAVA programming skills;

  • Basic knowledge of Semantic Web formats, languages, and technologies: RDF, SPARQL, OWL, LOD;

  • Basic knowledge of Natural Language Processing (e.g., what is NLP, what are the typical NLP tasks for knowledge extraction);

  • Willingness to study new, challenging research topics and technologies;

  • Commitment to work in a research-driven environment;

  • Problem solving attitude.

Competencies to be Acquired:

  • Participation to the R&D activities of a leading EU research institute;

  • Acquisition of advanced knowledge and skills in Semantic Web and knowledge extraction techniques and technologies;

  • Acquisition of advanced knowledge and skills in knowledge representation and ontological modelling;

  • Contribution to the development of a state-of-the-art research-driven tool.

Duration: 3 to 12 months approximately, based on the planned activities.

Preferential Background: Computer Science, ICT

Selection Procedure: A short task-based assessment will be conducted at the beginning of the internship/thesis to assess the skills and capabilities of the student in accomplishing the planned activity.

Contact Person: Dr. Marco Rospocher ( )