LangId-LanguageIdentificationEnhancementEngine
From IKS Project
Overview
The LangId engine determines the language of text. The provided FISE engine is based on the TextCat library (http://textcat.sourceforge.net/). The text to be checked must be provided in plain text format in one of two forms:
- A plain text content item
- by the content item's metadata as the string value of the property
-
http://www.semanticdesktop.org/ontologies/2007/01/19/nie#plainTextContent
The result of language identification is added as FISE TextAnnotation to the content item's metadata as string value of the property
-
http://purl.org/dc/terms/language
This RDF snippet illustrates the output:
<fise:TextAnnotation rdf:about="urn:enhancement-a147957b-41f9-58f7-bbf1-b880b3aa4b49">
<dc:language>en</dc:language>
<dc:creator rdf:datatype="http://www.w3.org/2001/XMLSchema#string"
>eu.iksproject.fise.engines.langid.LangIdEnhancementEngine</dc:creator>
</fise:TextAnnotation>
Languages
By default the FISE language identifier distinguishes the languages listed below. After the colon the value of the language label in the metadata is given.
- German: de
- English: en
- French: fr
- Spanish: es
- Italian: it
- Swedish: sv
- Polish: pl
- Dutch: nl
- Norwegian: no
- Finnish: fi
- Albanian: sq
- Slovak (ASCII): sk
- Slovenian (ASCII): sl
- Danish: da
- Hungarian: hu
Customization
- Property eu.iksproject.fise.engines.langid.probe-length: an integer specifying how many characters will be used for identification. A value of 0 or below means to use the complete text. Otherwise only a substring of the specified length taken from the middle of the text will be used. The default value is 400.
- The set of supported languages can only be changed through the underlying TextCat system. The used languages as well as the mapping to language labels are defined through a file textConf.txt.

