OTMI Principles
From OpenTextMining
These points were picked up from the BioNLP 2007 talk and blogged here:
Comments:
- TH - Tony Hammond
- General Seems that there is not enough appreciation that OTMI is being proposed as a standard framework and methodology for disclosing subscription full text for text mining. That is, most of the features are parametrized and it is up to individual publishers to determine e.g. whether a snippet is a paragraph or a phrase, whether snippets are randomized or not, etc.
- With further publicity we might be able to more effectively communicate this message. - TH
- Random order Questions asked about need to shuffle the order or can the size of the snippets be made larger, e.g. paragraph units? (See point above re publisher choice.)
- This is a publisher choice. Publishers can choose to provide full text within the OTMI ramework if they wish. Our feeling is that this represents a fair compromise which will enable a lot of useful things. If that's insufficient then users will have to resort to current practice - i.e. ask the publisher for the full text. - TH
- Stopwords Feeling is that omitting stopwords is just needlessly destructive. Do we need to inflict this lossy transformation on the full text? (It is proper that the OTMI framework allows for this, but do we want to cripple the text thus effectively rendering certain text mining techniques inoperative?)
- We (Nature) have decided to not exclude stopwords. This was a poor design decision on our part but nevertheless the possibility remains there for publishers who would otherwise feel uneasy about releasing text 'in the clear'. - TH
- Word vectors Immediate feeling was that these are pretty much useless as anybody can count, but more practiced hands conceded that these could be a useful 'entry level' for non-specialists, i.e. the vectors could be used to determine a rough and ready document categorization. Related to this were questions on word vectors being made available for a document corpus rather than just the document in question, so that the document could be guaged against a corpus.
- Agreed. Word vectors are to some extent obsoleted by the inclusion of snippets. But they still provide an entry level facility which can simplify some tasks such as a rough and ready document classification. We need to consider what could be done with corpuses. - TH
- Sections There was positive feedback re our picking out key sections (methods, conclusions) although there are still questions about section ordering and section naming.
- We need to understand the requirements better and to see how to deliver what people want. - TH
- Tables Do we include table cations? Answer is no, and here I really don't understand why not. Had we thought of making the actual table data available? I don't know but probably represents an extra level of complexity because some kind of row/column ordering would need to be preserved.
- Initially we excluded these just for simplicity but we can always add these in. Once we figure out how best to represent them. - TH
- Figures We include figure captions, but did we think about adding in (i.e. referencing) the figures themselves? (The figures are currently maintained behind a subscrition firewall.
- OTMI is primarily about opening up subscription text for machine processing. Maybe thumbnails would meet some kind of need. This point needs more feednback. - TH
- References Are references linked back to the original text? I don't think they are properly marked up to allow the reference to be paired off with the source text snippets. This makes a lot of sense.
- Seems reasonable although we need to consider how this plays towards document reassembly. Do references need to be related to snippets, or is it sufficient to relate them back to sections? - TH
- Reuse policy Are snippets of full text able to be reproduced along with annotations on a third-party website?
- Agreed. There should be licensing information included along with the raw OTMI files. We would be open to suggestions. In principle, document sampling should be supported although we would want to discourage third-party syndication of the full OTMI files. These should be retrieved from source. - TH
