A mock-up demonstrating delivery of search results and documents directly to users

Sigfrid Lundberg (slu@kb.dk)
Digital Development and Production
The Royal Library
Post box 2149
1016 Copenhagen K
Denmark

Background

Any project or service that aim at the sharing of data across platforms run into the problem of finding methods of the encoding data in a way that can be readable on all systems.

This problem arises even in a completely homogenous system having, for instance, multiple and incompatible database schemas for data that are fundamentally similar from a semantic point of view. This is the situation for a range of services ours here at the Royal Library. The problem may arise for many reasons. One is ignorance, but the most common one is that the data in the different databases have been collected for different purposes or reasons each of which calls for a different data model.

The most common solution to the problem is to map the database fields into a set of common elements, and encode the result in some general format. Today this general format is more often than not XML.

This mock-up attempts at presenting a way to do this for a special, i.e., a concordance search result. The document presenting the concordance is machine readable, such that software can use it for many different purposes, of which only one is to present it visually to a user.

Details

Search result

This directory contains a mockup showing the encoding of a concordance search result as a TEI P5 document, which links to a single "text context" which is also a TEI document. All these documents are rendered directly in the browser, using client-side XSL. Just click on concordance_encoding.xml to view the "result set".

The details needed for encoding a search results is based on OpenSearch, which is meant for delivering search results using syndication xml formats (like RSS and Atom). However, for this purpose I felt that presenting the search result directly in TEI was a better choice.

The OpenSearch section is for implementing functions such as navigation of search results.

Text (source or text context) document

The texts are mainly stolen, from DSL. I used the CST parts of speech tagger (POS) to tag the text context document.

Each word in the document has its own element, with an pointer to the explanation of the POS tag (the explanations are in Swedish and most likely incorrect since I've guessed their meaning). The text mark-up locks as the following snippet:

      <p>
          <w ana="#N_DEF_SING">fællesmængden</w>
          <w ana="#PRÆP">af</w>
          <w ana="#N_INDEF_PLU">adjektiver</w>
          <w ana="#SKONJ">og</w>
	  ...
      </p>
    

The ana attributes are references to the explanations which are to be found in the <sourceDesc> in the TEI-header.

      <interpGrp>
          <interp xml:id="ADJ">Adjektiv</interp>
	  <interp xml:id="ADV">Adverb</interp>
	  <interp xml:id="EGEN">Egennamn</interp>
	  <interp xml:id="N_DEF_SING">Substantiv, bestämd form singular</interp>
	  ...
      </interpGrp>
    

The rest of the content here is either irrelevant or self explanatory.