A demonstration of parts of speech tagging of XML encoded text using an external POS web service accessed from inside an XSL transformation

Bart Jongejan (bart@mail.cst.dk)1 & Sigfrid Lundberg (slu@kb.dk)2

1. Center for Sprogteknologi
Det Humanistiske Fakultet
Københavns Universitet
Njalsgade 140-142, bygn. 25
2300 København S
Denmark

2. Digital Development and Production
The Royal Library
Post box 2149
1016 Copenhagen K
Denmark

Background

Parts of speech tagging (POS) is requiring some computational power, more than were generally available some years ago. It does, however require specialised know-how to perform, and (usually being based on machine-learning algorithms) a lot of data. Furthermore, the data required is dependent on language, style and other features such as when texts were written. So, even if in principle students researchers and others may have the tools needed installed locally it is still sensible to provide this as a web service.

XML is today the lingua franca of text encoding. Still the data formats used by popular POS taggers for the reading as well as writing of data are more often than not highly proprietary, and it is not a trivial task to access any tagger and use that for tagging an arbitrary XML. We provide a demonstration on how that problem can be solved.

This proof of concept contains two main parts:

  1. A pos tagging web service, which is basically a version of the CST extended Brill tagger which returns upon a post request the submitted text as text/plain;charset=UTF-8
  2. A pos tagging client, which is small utility comprising one script in perl and another in XSLT. This utility incorporate POS mark-up into it by a transformation

Details

The CST Brill tagger reads a text, such as this

Nej, min egen Kære, du var saa modtagelig for Sindsbevægelser. Vi frygtede, saa ophidset som du var, at det skulde gøre for stærkt Ind tryk paa dig. Derfor har du ogsaa gjort langt bedre i at blive hjemme idag.

and writes upon successful return the following

Nej/INTERJ ,/TEGN min/PRON_POSS egen/ADJ Kære/ADJ ,/TEGN du/PRON_PERS var/V_PAST saa/N_INDEF_SING modtagelig/ADJ for/PRÆP Sindsbevægelser*MISCNAMEX/EGEN ./TEGN Vi/PRON_PERS frygtede/V_PAST ,/TEGN saa/N_INDEF_SING ophidset/V_PARTC_PAST som/UNIK du/PRON_PERS var/V_PAST ,/TEGN at/UKONJ det/PRON_DEMO skulde/N_INDEF_SING gøre/V_INF for/PRÆP stærkt/ADV Ind/ADV tryk/N_INDEF_SING paa/V_INF dig/PRON_PERS ./TEGN Derfor/ADV har/V_PRES du/PRON_PERS ogsaa/N_INDEF_SING gjort/V_PARTC_PAST langt/ADV bedre/ADJ i/PRÆP at/UNIK blive/V_INF hjemme/ADV idag/ADV ./TEGN

That is, the text fragment is first tokenized, the tokens are then tagged. The result is written as a space delimited list of tokens, which consist of content and tag delimited by a slash ('/').

The client can parse a XML document using XSLT. There is a specific template matching text nodes:

      <xsl:template match="text()">
	<xsl:apply-templates select="pos:cst_tagger(.)"/>
      </xsl:template>
    

The pos:cst_tagger(.) is a custom xpath function defined in the perl script. It performs the following:

  1. It issues a HTTP POST request to the CST web service
  2. the service returns the text tagged as described above
  3. the script parses the text
  4. the script generates a XML fragment using XML Document Object Model (DOM)
  5. the dom object is returned to the XSL processor which is subjected to some cosmetic modification and after that copied into the document as a replacement for the original text.