Metaxa-MetadataTextExtractionEnhancementEngine

From IKS Project
Jump to: navigation, search

Overview

Metaxa extracts embedded metadata and textual content from a large variety of document types and formats. The engine is based on the Aperture framework (http://aperture.sourceforge.net/) with new extensions to handling structured content embedded in HTML web content, such as Microformats (http://microformats.org/) and RDFa (http://www.w3.org/TR/rdfa-syntax/). The text extraction functionality also makes Metaxa suitable as a pre-processor for other components, especially NLP processors and indexing for search.

Metaxa introduces a FISE Enhancement instance with its metadata but the extracted metadata are ascribed directly to the document, represented by the content ID, since they represent document properties and not text annotations. Various ontologies are employed to describe various types of metadata. An overview will be given below (link).

An example extraction result for an HTML page embedding a hCard-microformat is shown here to illustrate the major annotation structures.

1. The top-level FISE Enhancement instance:

<urn:enhancement-03c9e85e-2681-21b7-a5af-6da62d67ef6b>
      a       <http://fise.iks-project.eu/ontology/TextAnnotation> , <http://fise.iks-project.eu/ontology/Enhancement> ;
      <http://fise.iks-project.eu/ontology/confidence>
              "1.0"^^<http://www.w3.org/2001/XMLSchema#double> ;
      <http://fise.iks-project.eu/ontology/extracted-from>
              <http://localhost:8080/store/content/mf_example.htm> ;
      <http://purl.org/dc/terms/created>
              "2010-09-22T09:06:53.056+02:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
      <http://purl.org/dc/terms/creator>
              "eu.iksproject.fise.engines.metaxa.MetaxaEngine"^^<http://www.w3.org/2001/XMLSchema#string> .

2. The top-level document metadata from 'Metaxa:

<http://localhost:8080/store/content/mf_example.htm>
      a       <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#HtmlDocument> ;
      <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#contains>
              <urn:rnd:-9e25553:12b3843df43:-7ffe> ;
      <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#description>
              "Cheap Flights to Tenerife, Arrecife, Paphos, Mahon, Las Palmas, Malaga, Alicante, Faro, Heraklion, Palma and the rest of the World. Flightline searches over 100 Airlines and 30,000 Hotels. ABTA, IATA, ATOL Bonded." ;
      <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#keyword>
              "travel" , "bargain flights" , "late deals" , "hotels" , "air tickets" , "air fares" , "discount travel" , "last minute flights" , "cheap airlines" , "cheap holidays" , "cheap flights" , "flightline" , "hotel reservations" , "discount flights" , "air travel" , "package holidays" ;
      <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#plainTextContent>
              "More Than Just Cheap Flights ..." ;
      <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#title>
              "Flightline | Cheap Flights, Package Holidays, Hotels, Travel Insurance & More" .

3. Embedded hCard microformat data referenced at the top-level via the nie:contains property:

<urn:rnd:-9e25553:12b3843df43:-7ffe>
      a       <http://www.w3.org/2006/vcard/ns#VCard> ;
      <http://www.w3.org/2006/vcard/ns#adr>
              <urn:rnd:-9e25553:12b3843df43:-7ffc> ;
      <http://www.w3.org/2006/vcard/ns#fn>
              "Flightgeoline Essex Limited" ;
      <http://www.w3.org/2006/vcard/ns#geo>
              <urn:rnd:-9e25553:12b3843df43:-7ffb> ;
      <http://www.w3.org/2006/vcard/ns#org>
              <urn:rnd:-9e25553:12b3843df43:-7ffd> ;
      <http://www.w3.org/2006/vcard/ns#photo>
              <https://www.flightline.co.uk/common/images/building_banner_sm.jpg> ;
      <http://www.w3.org/2006/vcard/ns#url>
              <http://www.flightline.co.uk> ;
      <http://www.w3.org/2006/vcard/ns#workTel>
              <tel:0800541541> .

<urn:rnd:-9e25553:12b3843df43:-7ffd>
      a       <http://www.w3.org/2006/vcard/ns#Organization> ;
      <http://www.w3.org/2006/vcard/ns#organization-name>
              "Flightline Essex Limited" .

<urn:rnd:-9e25553:12b3843df43:-7ffc>
      a       <http://www.w3.org/2006/vcard/ns#Address> ;
      <http://www.w3.org/2006/vcard/ns#countryName>
              "UK" ;
      <http://www.w3.org/2006/vcard/ns#extendedAddress>
              "Flightline House" ;
      <http://www.w3.org/2006/vcard/ns#locality>
              "Westcliff-on-Sea" ;
      <http://www.w3.org/2006/vcard/ns#postalCode>
              "SS0 7JE" ;
      <http://www.w3.org/2006/vcard/ns#region>
              "Essex" ;
      <http://www.w3.org/2006/vcard/ns#streetAddress>
              "32-38 Milton Road" .

<urn:rnd:-9e25553:12b3843df43:-7ffb>
      a       <http://www.w3.org/2006/vcard/ns#Location> ;
      <http://www.w3.org/2006/vcard/ns#latitude>
              "51.53894902845868" ;
      <http://www.w3.org/2006/vcard/ns#longitude>
              "0.700753927230835" .

Supported document types

The default set of document formats supported by Metaxa:

  • Office documents
    • MS-Works
    • MS-Office
    • Excel
    • PowerPoint
    • Word
    • Visio
    • OpenDocument
    • OpenXml
    • Publisher
    • Corel-Presentations
    • QuattroPro
    • WordPerfect
  • Multimedia documents
    • JPG
    • MP3
  • Other
    • PDF
    • RTF
    • Plain Text
    • XML
    • (X)HTML, supporting also these types of embedded structures/microformats:
      • RDFa
      • geo
      • hAtom
      • hCal
      • hCard
      • hReview
      • rel-license
      • rel-tag
      • xFolk

Textual Content

The plain text content is represented in the Metaxa result by the property:

http://www.semanticdesktop.org/ontologies/2007/01/19/nie#plainTextContent

Metadata

Metaxa uses a set of vocabularies ("ontologies") for structured data representation.

Aperture Core Ontologies

These ontologies belong to the underlying Aperture subsystem, contained in the package org.semanticdesktop.aperture.vocabulary. The most important ones with respect to document properties are

  • NIE (Nepomuk Information Element): http://www.semanticdesktop.org/ontologies/2007/01/19/nie#
  • NFO (Nepomuk File Object): http://www.semanticdesktop.org/ontologies/2007/01/19/nfo#

Documentation of Aperture's core ontologies is provided in Aperture's Javadoc (http://aperture.sourceforge.net/doc/javadoc/1.5.0/index.html) for the packages in org.semanticdesktop.aperture.vocabulary.

HTML Microformat Extractors

The following table describes which vocabularies are used for represeenting microformat data in Metaxa:

MF Vocabulary (Namespace)
geo wgs84 (http://www.w3.org/2003/01/geo/wgs84_pos#)
hAtom atom (http://www.w3.org/2005/Atom#) |- | | tagging (<tt><nowiki>http://aperture.sourceforge.net/ontologies/tagging#)
hCal ical (http://www.w3.org/2002/12/cal/icaltzd#)
vcard (http://www.w3.org/2006/vcard/ns#)
hCard vcard (http://www.w3.org/2006/vcard/ns#)
hReview review (http://www.purl.org/stuff/rev#)
wgs84 (http://www.w3.org/2003/01/geo/wgs84_pos#)
dc (http://purl.org/dc/elements/1.1/)
dcterms (http://purl.org/dc/dcmitype/)
foaf (http://xmlns.com/foaf/0.1/)
vcard (http://www.w3.org/2006/vcard/ns#)
tag (http://www.holygoat.co.uk/owl/redwood/0.1/tags/)
rel-license dc (http://purl.org/dc/elements/1.1/)
rel-tag tagging (http://aperture.sourceforge.net/ontologies/tagging#)
xFolk nfo (http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#)
dc (http://purl.org/dc/elements/1.1/)
tagging (http://aperture.sourceforge.net/ontologies/tagging#)


Customization

The set of extractors used by Metaxa can be customized via the file src/main/resources/extractionregistry.xml that lists the ExtractionFactorys to be used. One can add additional extractors, add new ones or specify different extractors. An extractor in Aperture consists of 2 classes, one implementing the ExtractorFactory interface that defines which mime types the extractor is responsible for and that creates the instances of the class implementing the Extractor interface. For details see the Aperture documentation at http://aperture.sourceforge.net/.

For the Metaxa HTML extractor an additional registry exists in the file src/main/resources/htmlextractors.xml that specifies a set of subextractors for HTML pages.