Content-knowledge-reference-infrastructure
From IKS Project
This is a proposal to define a "Content and Knowledge Reference Infrastructure" as a part of the IKS Stack. The intention of such an infrastructure would be to support organizations running an IKS enabled CMS system to empower there business by using publicly available data.
The big picture
A lot of information is available on the Web. Only to give some examples of publicly available data sets:
- 3 million+ entries for the english wikipedia
- 800.000 albums and 500.00 artists available via musicbrainz.org
- 1,5 million titles and nearly 1 million actors available via imdb.com
- 7 million GIS entries available via geonames.org
- 350 million nodes and 30 million ways available from linkedGeoData.org and openstreetmap.org
A single organization running a CMS can never ever create proprietary data sets with such a coverage and the overall quality. So for some application cases there are only two options: (1) no support or (2) using such open data sets. Take BBC as an example: Offerings like BBC music (http://www.bbc.co.uk/music) or the BBC wild life finder (http://www.bbc.co.uk/wildlifefinder/) would not be possible - at least not in that quality - without using publicly available data.
But there is still a problem: simply referring to available such data is not enough. For each of the above web presentations the external data just helps the BBC to get their message to their customers. The external information provides more detailed background , and fills gaps which would be hard to fill using in-house resources.The message of the content is still controlled by the BBC (e.g. featured broadcasts, Most Played Artists On The BBC, his Week's BBC Music Reviews ...). But after some clicks further down, more and more information is provided by content originating from public data sets. In my opinion the following two blog posts provide a good insights on this issue :
- Seth Gottlieb - Content is not Data (http://www.contenthere.net/2008/05/content-is-not-data.html): This describes very well the difference between data and content. Everyone intending to use public data to enhance their content should consider Gottlieb's points.
- Silver Oliver - The importance of curation in a metadata driven information architecture (http://blockslabpillar.com/2010/03/06/the-importance-of-curation-in-a-metadata-data-driven-information-architecture/): This Blog tells a similar story but from a very different point of view. Oliver writes: "Because the things that live in our model are associated with assets and data, the journalist, in selecting a thing to include in a collection pulls data through the system." This indicates the existence of a domain model. The domain model is filled with data - this hopefully includes data from public data sets. The domain model supports the presentation of the content to the consumers. Journalists can also create collections to bring special stories to their audience. People interested in the methodology how the team at BBC builds such web pages should also have a look at http://www.bbc.co.uk/blogs/radiolabs/2009/01/how_we_make_websites.shtml.
All this shows that referring to public data and use it within a CMS is not only a technical challenge, but it has also a strong methodological and organizational aspect. So when adding support to the IKS stack, for referring to content and knowledgewe need to answer questions about knowledge modelling, organizational aspects and methodology:
- Knowledge modelings: How to link data sets? Access them on demand and/or cache them locally? Solve Entity identification problems e.g. by asking the user or using some contextual information)? Are there common domain models or at least partial structures (e.g. Persons working for Organizations, Actors performing in Movies, People participating in Meetings/Events ....)? Define common sets of attributes to describe Persons, Organizations, Places, ...?
- Organizational aspects: does the content agree with the preferred interpretation of the organization (e.g. An Israeli newspaper and an Iranian one would probably not agree a common description about the "State of Palestine")? Is the quality of the data sufficient? Is the data set trustworthy?
- Methodology aspects: Defining the Workflows: e.g. Users that suggest new entities or additional content for existing ones; Supervisors that need to confirm entities that are suggested; Supervisors that enable/disable whole data sets; Methodologies about content creation and content organization by using public data: e.g. defining relations to entities rather than using tags as suggested by http://stdout.be/2010/tags-dont-cut-it/; usage of publicly available content to provide detailed background information (like the biography of artists on BBC music); usage of public available knowledge to provide better organization of content based on the domain model (like the scientific classification of species as used by the BBC Wild Life Finder).
The Content and Knowledge Reference Infrastructure
Within such a setting a "Content and Knowledge Reference Infrastructure" would need to support the following things: (sorted top down - consumer level down to administration and storage; this are only some examples and not a complete list)
- Entity suggestion + according interaction patterns: e.g. FISE detects named Entity "Paris", UI suggests to User to tag this content with Paris as defined in Wikipedia, User accepts the suggestion, CMS manages new Entity "Paris" and the mapping to Wikipedia). Note that for suggestions it is central to also suggest entities that are not yet imported by the system.
- Management and Lifecycle support for referred/used Entities: e.g. A user tags a content with "Paris", the CMS creates a "Paris" as a proposed Entity, an archivist checks the list of all proposed Tags, checks the metadata and linked content and if ok confirms it)
- Mapping/Identity management: e.g. An NLP component of FISE detects a place, but no mapping entry can be found in any linked data set. So the CMS creates a local Entity by using the marked String as a "label" and the context in the text as a description. An Editor checks the list of proposed entities and maps the place manually to a GIS entry of geonames.org. Geonames.org also knows about the appropriate Wikipedia entry, so the editor is also asked if he wishes to map the Content/Knowledge available from Wikipedia.
- linked site management and support for common data sets: Administrators need to manage (add/enable/disable/configure) linked sites. The infrastructure should provide components that support common sites out of the box. It should support knowledge extractions based on the IKS ontology (e.g. Persons, organizations, countries, regions, landmarks ...). The same would be desirable for content (e.g.text form wikipedia and pictures form wikimedia.org).
- use existing CMS as storage: Such an infrastructure needs to store its data within the CMS. Support for CMIS, JCR should be out of the box but it should be also possible to use CMS which are not based on CMIS or JCR. For search one would need to check if standard search capabilities of CMIS, JCR can be used. As an alternative this infrastructure may use its own indexing facilities.
The Symbol Service Specification
The intension of the Symbol Service is to cover the basic services needed by such an infrastructure. This would include the following features
- entity identification
- mapping/referencing public data
- management linked sites (holding the data sets)
The symbol service specification is work in progress and can be viewed at https://docs.google.com/Doc?docid=0AXkrJWeeMbEfZGd0OHI1anNfMTJnNW53bTdnNw&hl=en.
The document gives two motivating examples of possible usages of the service. Then, it presents a grounding for symbols that is based on semiotic principles. The intension of this section is to clarify the definitions of symbol and entity and to show that the problems that need solvingare at technical and organizational levels. The third section applies the findings to the content management domain. The fourth section discusses a first attempt of a data model for the symbol service. The last section is currently a skeleton that needs to be filled with the detailed interface and service specification which will require discussion amongst practitioners.

