Parts of speech tagging (POS) is requiring some computational power, more than were generally available some years ago. It does, however require specialised know-how to perform, and (usually being based on machine-learning algorithms) a lot of data. Furthermore, the data required is dependent on language, style and other features such as when texts were written. So, even if in principle students researchers and others may have the tools needed installed locally it is still sensible to provide this as a web service.
XML is today the lingua franca of text encoding. Still the data formats used by popular POS taggers for the reading as well as writing of data are more often than not highly proprietary, and it is not a trivial task to access any tagger and use that for tagging an arbitrary XML. We provide a demonstration on how that problem can be solved.
This proof of concept contains two main parts:
The CST Brill tagger reads a text, such as this
Nej, min egen Kære, du var saa modtagelig for Sindsbevægelser. Vi frygtede, saa ophidset som du var, at det skulde gøre for stærkt Ind tryk paa dig. Derfor har du ogsaa gjort langt bedre i at blive hjemme idag.
and writes upon successful return the following
Nej/INTERJ ,/TEGN min/PRON_POSS egen/ADJ Kære/ADJ ,/TEGN du/PRON_PERS var/V_PAST saa/N_INDEF_SING modtagelig/ADJ for/PRÆP Sindsbevægelser*MISCNAMEX/EGEN ./TEGN Vi/PRON_PERS frygtede/V_PAST ,/TEGN saa/N_INDEF_SING ophidset/V_PARTC_PAST som/UNIK du/PRON_PERS var/V_PAST ,/TEGN at/UKONJ det/PRON_DEMO skulde/N_INDEF_SING gøre/V_INF for/PRÆP stærkt/ADV Ind/ADV tryk/N_INDEF_SING paa/V_INF dig/PRON_PERS ./TEGN Derfor/ADV har/V_PRES du/PRON_PERS ogsaa/N_INDEF_SING gjort/V_PARTC_PAST langt/ADV bedre/ADJ i/PRÆP at/UNIK blive/V_INF hjemme/ADV idag/ADV ./TEGN
That is, the text fragment is first tokenized, the tokens are then tagged. The result is written as a space delimited list of tokens, which consist of content and tag delimited by a slash ('/').
The client can parse a XML document using XSLT. There is a specific template matching text nodes:
<xsl:template match="text()">
<xsl:apply-templates select="pos:cst_tagger(.)"/>
</xsl:template>
The pos:cst_tagger(.) is a custom xpath function defined in the perl script. It performs the following: