LangId-LanguageIdentificationEnhancementEngine

From IKS Project

Jump to: navigation, search

Overview

The LangId engine determines the language of text. The provided FISE engine is based on the TextCat library (http://textcat.sourceforge.net/). The text to be checked must be provided in plain text format in one of two forms:

  • A plain text content item
  • by the content item's metadata as the string value of the property
http://www.semanticdesktop.org/ontologies/2007/01/19/nie#plainTextContent

The result of language identification is added as FISE TextAnnotation to the content item's metadata as string value of the property

http://purl.org/dc/terms/language

This RDF snippet illustrates the output:

  <fise:TextAnnotation rdf:about="urn:enhancement-a147957b-41f9-58f7-bbf1-b880b3aa4b49">
    <dc:language>en</dc:language>
    <dc:creator rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
    >eu.iksproject.fise.engines.langid.LangIdEnhancementEngine</dc:creator>
  </fise:TextAnnotation>


Languages

By default the FISE language identifier distinguishes the languages listed below. After the colon the value of the language label in the metadata is given.

  • German: de
  • English: en
  • French: fr
  • Spanish: es
  • Italian: it
  • Swedish: sv
  • Polish: pl
  • Dutch: nl
  • Norwegian: no
  • Finnish: fi
  • Albanian: sq
  • Slovak (ASCII): sk
  • Slovenian (ASCII): sl
  • Danish: da
  • Hungarian: hu

Customization

  1. Property eu.iksproject.fise.engines.langid.probe-length: an integer specifying how many characters will be used for identification. A value of 0 or below means to use the complete text. Otherwise only a substring of the specified length taken from the middle of the text will be used. The default value is 400.
  1. The set of supported languages can only be changed through the underlying TextCat system. The used languages as well as the mapping to language labels are defined through a file textConf.txt.