RickGeonamesIndexer

From IKS Project

Jump to: navigation, search

This indexing utility uses the database dump of [http:geonames.org geonames.org] to create a full index (that can be used as Cache with CacheStrategy "full") by using a SolrYard. To be used you must first install and configure a Solr Server as described on the [[configuration page for the SolrYard and create/configure a new Core that is used for storing the index created by this utility.

Contents

Building this Utility

To run this utility from the command line one needs to create a jar file that includes all the necessary dependencies. To do that you need first to download and install RICK. Next you need to change to the "/indexing/geonames" directory within the main folder of the RICK and use the

mvn clean package assembly:assembly

command to assembly the java archive that includes all the needed dependencies because this utility is not build by the std. maven build process of the RICK. If this commands finishes successful you need to change in the "/target" directory that now contains a runable java archive. You can use

java -jar eu.iksproject.indexing.geonames-0.1-SNAPSHOT-jar-with-dependencies.jar

to check if everything has gone well and to print the help screen of the indexing tool. But before you can start to index the geonames.org data you need to do some other steps explained below.

Data needed for indexing:

This indexing utilities needs several source files that are all available from the geonames.org download page. Note that the indexing utility provides defaults for the names as used by this site. So it is highly recommended to do not change the file names when downloading the needed files described below. It is best to create a new folder used to store all the needed files.

List of Files:

All that files MUST BE present in the same folder and the indexing utility must be configured to use this folder as datasource.

  • "<archive>.zip: This files provides the dump of the main table of the geonames.org database. The site provides downloads for all countries (e.g. "DE.zip" for all features in germany), for cities with a population greater than 1k/5k and 15k as well as a archive with all the 7+ millions of features known by geonames.org called "allCountries.zip". The "allCountries.zip" is the default value of the indexing utility however one can configure the utility to use a different archive by using the "-a, --archive" option.
  • countryInfo.txt: this file includes information for all countries of the world. The indexing utility uses only the mapping of the 2-letter country code to the geonames.org ID provided by this file.
  • admin1CodesASCII.txt: This file provides the same information for the first level of administrative regions as the "contryInfo.txt" for countries. Note the the full ID of such regions is the 2-letter country code + '.' + the code of the region. Do not use the Admin1Codes.txt file because this file is missing some needed information
  • admin2Codes.txt: The same kind of information but for the second level administrative regions. The unique key is the key of the level 1 region + '.' + id of the level two region
  • hierarchy.zip: This file contains all hierarchy relations provided by geonames.org. This file is used in addition to the country and admin 1/2 infos to index parent features.
  • alternateNames.zip: This archive includes all the naming information. This includes mainly preferred names, short names, official names in different languages. However this file also includes postal codes, airport codes, links to the wikipedia pages for features and some more information.
  • ontology_v2.2.1.rdf: This file is only needed of the indexing of the geonames.org ontology is enabled by using the "-o,--ontology" option" (currently unimplemented because the genericRdf indexer is not yet available).

Please note also the information provided by this file about the files used by the geonmaes.org indexer.

If any of these files is missing, that the indexing utility will throw an error during initialization that provide more infos.

Using the Indexing Utility

As mentioned above running this command

java -Xmx1024M -jar eu.iksproject.indexing.geonames-0.1-SNAPSHOT-jar-with-dependencies.jar 

will show the help page for the indexing utility. The following Code segment includes the provided help screen

usage: java -Xmx1024M -jar
           eu.iksproject.indexing.geonames-0.1-SNAPSHOT-jar-with-dependen
           cies.jar [options] solrServerUri geonamesDataDumpDir
Description:
This Utility creates a full Yard for geonames.org by using the SolrYard
implementation.
Parameter:
- "-Xmx": This implementation loads alternate labels into memory.
Therefore it needs a lot of memory during indexing. Parse at least
"-Xmx1024M" to provide 1GByte memory to the Java Vm. In case of
OutOfMemory errors you need to increase this value! - solrServerUri : The
URL of the Solr Server used to index the data. Make sure to use the
schema.xml as needed by the SolrYard!
- geonamesDataDumpDir: The relative or absolute path to the Dir with the
geonames.org data required for indexing
Options:
 -a,--archive <arg>     file name of the archive within the data directory
                        (default: 'allCountries.zip')
 -c,--chunksize <arg>   the number of documents stored in one chunk
                        (default: 1000
 -d,--debug             show debug stacktrace upon error
 -h,--help              display this help and exit
 -io,--indexOnt         index also the geonames ontology
 -n,--name <arg>        the id and name used for the Yard (default:
                        'geonames')
 -o,--ontology <arg>    file name of the ontology within the data
                        directory (default: 'ontology_v2.2.1.rdf')
 -s,--start <arg>       the line number of the geonames table to
                        start(default: 0
Required data:
- archive with the toponyms (default 'allCountries.zip', see option 'a'
- countryInfo.txt : additional infos for country codes
- admin1CodesASCII.txt : leval 1 administrative regions
- admin2Codes.txt: Level 2 administrative regions
- alternateNames.zip or .txt: names of features in different languages
- geonames ontology: only needed if '-io' (default 'ontology_v2.2.1.rdf',
see option 'o')

Using the Index

After the indexing completes (takes around 1h) all the information needed are in the data directory of the core configured by the solrServerUri. If you want to use the index within an other SolrServer just copy and paste this directory (or the whole core).

Configuring the Solr Server to use the index

After completing this step add the core to the solr.xml (located in the root directory of the solr server) and configure the core with the geonames.org index by added the line

<core name="geonames" instanceDir="geonames">

The

  • name is the sub path of the core: http://localhost:8983/solr/<name>
  • instanceDir is the name of the subfolder in the file system that holds the configuration of the core. The directory with the index will default to /<instanceDir>/data. The configuration defaults to /<instanceDir>/conf.

Now restart the Solr server and the new core should be up and running. You can use http://localhost:8983/solr/<name>/admin/ to validate the configuration of the core.

Configuring Rick to use the geonames.org Index

Note: This assumes that you already have a running SolrServer that is configured to be used for RICK SolrYard. If not see this page for instructions.

Yard Configuration for using the geonames.org index

To use the geonames.org index within RICK you need to configure two components.

  1. a SolrYard that access the data stored in the index
  2. a Cache that manages the data stored in the Yard.

First you need to configure the SolrYard to be used by the Cache to access the geonames.org Index Important Parameters:

  • ID: the default ID of a Yard used as cache is the ID of the referenced site + "Cache". because of that "geonamesCache" is the preferred ID.
  • SolrServerURL: This MUST BE the URL to the core of the SolrServer with the geonames.org index.

See also the screenshot of the configuration on the right side.

Cache Configuration for using the geonames.org index

Now you can configure the Cache: Important Parameters:

  • Yard: Here you MUST configure the ID of the Yard. Should be "geonamesCache" if you used the default.
  • Cache Mappings: This can be used to configure what information are stored in the Cache. The geonames.org Indexing Utility already configures the cache with the so called "Base Mappings". So there is no need to provide any "Additional Mappings". However it would be possible to state additional fields to be stored for Features that are updated via the geonames.org webservice.

Configuration of the Referenced Site

Referenced Sites are the components that provide access to site that provide entity information within the RICK. So to use geonames.org as an entity information source within the RICK you need to configure a referenced Referenced Site.

Referenced Site Configuration for geonames.org
.

For the full description on how to configure Referenced Sites see the this Page. Here only the important properties for configuring a referenced site for geonames.org are described.

  • Entity Prefix: All entities provided by geonames.org start with "http://sws.geonames.org/". So this URI MUST BE configured in this field. Otherwise queries via the Sites RESTful Service Endpoint or lookup Requests by the RICK will not work for geonames.org
  • Access URI and Dereferencer Implementation: Geonames.org provides CoolURI based lookup of entities via "http://sws.geonames.org/". If you would like to allow updates of Entities in the index by directly accessing the geonames.org database you can configure "http://sws.geonames.org/" as Access URI and "Cool URI" as Dereferencer Implementation. If you want to work only locally (by using the full index), than configure "NONE" for the Dereferencer Implementation. In that case the Access URI is ignored.
  • Query Service URI and Searcher Implementation: Geonames.org does not provide any query service that is supported by the RICK. Luckily we can use the full index for search. So configure "NONE" for the Searcher Implementation. The Query Service URI is ignored in that case.
  • Cache Strategy: Here use the "All" options, because we want to use the prepared index with all the data of geonames.org
  • Cache ID: Here you MUST use the ID of the SolrYard with the geonames.org index. If you used the defaults the ID is "geonamesCache".
  • Field Mappings: This are the mappings used in addition to the mappings already defined globally for the Rick. Using the name of geonames.org features as name and importing all the properties defined in the "geonames" namespace is a simple, but sufficient configuration. Mappings for the "geo" namspace (like lat, long and alt) are usually already defined by the global configuration of the Rick. However you might want to check that.

Verification of the geonames.org configuration

There are several possibilities to check if geonames.org was successfully added as Referenced Site to the RICK, but the easiest is to use the following three requests to RESTful services.

  1. Check if the referenced site for geonames.org is active and registered with the RICK by calling
 curl "http://localhost:8080/sites/referenced"

This should give you a JSON Array with the access URIs of all the referenced sites such as

[
  "http:\/\/localhost:8080\/site\/dbPedia\/",
  "http:\/\/localhost:8080\/site\/geonames\/"
]
  1. Search for some spatial object by using the access uri for geonames.org
curl -X POST -d "name=Untersberg&limit=10&offset=0" http://localhost:8080/site/geonames/find

This searches for a Mountain near Salzburg and gives the following response

{
   "query": {
       "selected": ["http:\/\/www.w3.org\/2000\/01\/rdf-schema#label"],
       "constraints": [{
           "type": "text",
           "patternType": "wildcard",
           "text": "Untersberg",
           "field": "http:\/\/www.w3.org\/2000\/01\/rdf-schema#label"
       }]
   },
   "results": [{
       "id": "http:\/\/sws.geonames.org\/2818796\/",
       "http:\/\/www.w3.org\/2000\/01\/rdf-schema#label": [{
           "type": "text",
           "value": "Untersberg"
       }]
   }]
}
  1. Retrieve the Entity Information for the found Entity by calling
curl -X GET -H "Accept: application/json" http://localhost:8080/site/geonames/entity?id=http://sws.geonames.org/2818796/

This should give you a Response that the "Untersberg" is a Mountain with an altitude of 1845m in Germany.

{
   "id": "http:\/\/sws.geonames.org\/2818796\/",
   "site": "geonames",
   "representation": {
       "id": "http:\/\/sws.geonames.org\/2818796\/",
       "http:\/\/www.geonames.org\/ontology#name": [{
           "type": "text",
           "value": "Untersberg"
       }],
       "http:\/\/www.iks-project.eu\/ontology\/rick\/model\/signSite": [{
           "type": "reference",
           "value": "geonames"
       }],
       "http:\/\/purl.org\/dc\/terms\/creator": [{
           "type": "value",
           "value": "http:\/\/www.geonames.org\/"
       }],
       "http:\/\/www.geonames.org\/ontology#alternateName": [{
           "type": "text",
           "value": "Untersberg"
       }],
       "http:\/\/www.w3.org\/2000\/01\/rdf-schema#label": [{
           "type": "text",
           "value": "Untersberg"
       }],
       "http:\/\/www.geonames.org\/ontology#wikipediaArticle": [{
           "type": "reference",
           "value": "http:\/\/en.wikipedia.org\/wiki\/Untersberg"
       }],
       "http:\/\/www.geonames.org\/ontology#featureClass": [{
           "type": "reference",
           "value": "http:\/\/www.geonames.org\/ontology#T"
       }],
       "http:\/\/www.geonames.org\/ontology#parentCountry": [{
           "type": "reference",
           "value": "http:\/\/sws.geonames.org\/2921044"
       }],
       "http:\/\/purl.org\/dc\/terms\/date": [{
           "type": "value",
           "value": "Tue May 13 00:00:00 CEST 1997"
       }],
       "http:\/\/www.geonames.org\/ontology#countryCode": [{
           "type": "value",
           "value": "DE"
       }],
       "http:\/\/www.w3.org\/2003\/01\/geo\/wgs84_pos#long": [{
           "type": "value",
           "value": "12.98333"
       }],
       "http:\/\/www.w3.org\/2003\/01\/geo\/wgs84_pos#alt": [{
           "type": "value",
           "value": "1845"
       }],
       "http:\/\/www.w3.org\/1999\/02\/22-rdf-syntax-ns#type": [{
           "type": "reference",
           "value": "http:\/\/www.geonames.org\/ontology#Feature"
       }],
       "http:\/\/www.geonames.org\/ontology#featureCode": [{
           "type": "reference",
           "value": "http:\/\/www.geonames.org\/ontology#T.MTS"
       }],
       "http:\/\/www.w3.org\/2003\/01\/geo\/wgs84_pos#lat": [{
           "type": "value",
           "value": "47.7"
       }]
   }
}