EntityMentionEnhancementEngine

From IKS Project

Jump to: navigation, search

This engine creates fise:EntityAnnotations based on an embedded Lucence index of the top 10000 DBpedia entities (e.g. famous persons, places and organisations) ranked by counting the number of incoming links in the graph of wikipedia articles.

It does not directly work on the parsed content, but processes named entities extracted by some NLP (natural language processing) engine. This engine creates EntityAnnotations to suggest match to famous persons, location and organizations as referenced in DBpedia (and hence Wikipedia).

Processed Annotations (Input)

This engine consumes typed fise:TextAnnotations. The of the occurrence must match of the types as defined in the DBpedia ontology. More concrete it filters for enhancements that confirm to the following two requirements:

 ?enhancement rdf:type fise:TextAnnotation .
 ?enhancement dc:type ?dbpediatype .
 ?enhancement fise:selected-text ?name .
 ?enhancement fise:selection-context ?context .

And then perform a lucene query on its internal index for entities of type ?dbpediatype, with name that fuzzy match ?name and ordered by similarity context match between ?context and the text of the wikipedia abstract of the matching entities. The context info is hence used to perform entity disambiguation using the MoreLikeThis similarity matching tool provided by the Lucene project:

 http://lucene.apache.org/java/3_0_2/api/contrib-queries/org/apache/lucene/search/similar/MoreLikeThisQuery.html

Output

Each fise:TextAnnotation can be related to up to three suggestion of type fise:EntityAnnotation using the dc:relation property.

The quality of the suggestion is available as a fise:confidence score, which is a positive real number describing the quality of the match (fuzzy name + context similarity). The absolute value is meaning less and confidence scores for fise:EntityAnnotations related to different instance of fise:TextAnnotation cannot be compared.

In the following are examples for Enhancements created by this engine for the TextAnnotations as described in the example for the OpenNLP-NamedEntityExtractionEnhancementEngine

The following example shows a EntityAnnotation for a TextAnnotation selecting the text "New Zealand" and the dc:type dbpedia:Place

urn:enhancement:entity-annotation:id1
     a       fise:EntityAnnotation> , fise:Enhancement> ;
     dc:creator
             "eu.iksproject.fise.engines.autotagging.impl.EntityMentionEnhancementEngine"^^xsd:string ;
     dc:created
             "2010-06-22T08:41:01.810+02:00"^^xsd:dateTime ;
     fise:entity-label>
             "New Zealand"^^xsd:string ;
     fise:entity-reference
             http://dbpedia.org/resource/New_Zealand ;
     fise:entity-type
             dbpedia:Country , owl:Thing , dbpedia:Place , dbpedia:PopulatedPlace ;
     fise:confidence
             "7.657061576843262"^^xsd:double ;
     fise:extracted-from
             urn:content-item:id1 ;
     dc:relation
             urn:enhancement:text-annotation:id1 .

This enhancement links to the identified entity New Zealand (http://dbpedia.org/resource/New_Zealand). It also includes the label and the RDF types of the referred entity.

Configuration

Right now the configuration of the index used by the EntityMentionEnhancementEngine is delegated to the ConfiguredAutotaggerProvider service. This is likely to change in the future.

  1. eu.iksproject.fise.engines.autotagging.indexPath class loading path of the FSDirectory index to be used for entity lookups

Instructions to build a custom entity index are given in the iks-autotagging project:

 http://code.google.com/p/iks-project/source/browse/sandbox/iks-autotagging/trunk/README.txt