Implementing OTMI
From OpenTextMining
OTMI makes use of the Atom Syndication Format, and specifically presents OTMI files as Atom Entry Documents. This defines a concrete implementation of OTMI using the Atom Entry Document serialization, while the abstract schema for OTMI is presented in the following.
See OTMI Script for Nature's OTMI generator script.
Both vectors and snippets are generated from the raw text using a regex to split the text into the appropriate parts. The raw text itself is generated by a simple process of stripping all mark-up and normalising whitespace. The regex that has been applied to the text is presented in the OTMI file for each text type. This will typically look something like:
<otmi:vectors>
<otmi:split-regex>(?-mix:\s \W |\W \s |\s |\/)</otmi:split-regex>
...
</otmi:vectors>
