Reference Grammar for OTMI
From OpenTextMining
Return to OTMI Specification
This annex defines an Augmented Backus-Naur Form (ABNF) specification for OTMI.
; ####################################################################
;
; OTMI - Open Text Mining Interface
;
; Editor: Tony Hammond <mailto:t.hammond@nature.com>
; Authors: Tony Hammond, Timo Hannay, Ben Lund
; Date: July 20, 2007
; Rights: Copyright (c) 2006, 2007 Nature Publishing Group
;
; This specification uses the Augmented Backus-Naur Form (ABNF)
; notation of RFC 4234 [RFC4234] to define OTMI - the Open Text Mining
; Interface proposed by Nature Publishing Group as a means to publish
; content without disclosing narrative intent in order to enable and
; facilitate text-processing applications.
;
; OTMI files are currently serialized as Atom Entry Documents as
; defined by RFC 4287 [RFC4287] but could be serialized variously
; according to other apppropriate schemas.
;
; The following core ABNF production is used by this specification
; as defined by Appendix B.1 of RFC 4234: DIGIT.
;
; We also define the following generic datatype productions:
DATETIME =
; see RFC 3339 [RFC3339], e.g. 2006-08-03T00:00:00Z
URI =
; see RFC 3986 [RFC3986] for generic URI syntax
UTF-8 =
; see RFC 3629 [RFC3629] for UTF-8 transformation format encoding
;
; ####################################################################
; Main production
OTMI = key-data bib-data data
;
; ####################################################################
; 1. 'key-data' defines key properties that should be disclosed by all
; 'OTMI' files
key-data = title id link published updated 1*( author ) rights
title = UTF-8
; typically for STM assets the URI in the following production would
; make use of the 'info' URI scheme as defined by RFC 4452 [RFC4452]
id = URI
link = href
href = URI
published = DATETIME
updated = DATETIME
author = name
name = UTF-8
; 'author' is currently only granulated to 'name' level (but could
; make use of further granular breakdown a la FOAF - or 'Friend Of
; A Friend' vocabulary)
rights = UTF-8
;
; ####################################################################
; 2. 'bib-data' defines bibliographic metadata specific to product
bib-data = ; bibliographic metadata specific to product
; product-specific properties, e.g. for a journal article one might
; require the following properties (here defined in terms of PRISM -
; or 'Publisher Requirements for Industry Standard Metadata' [PRISM]
;
; bib-data = publicationName volume number startingPage \
; endingPage issn eIssn
;
; publicationName = ; e.g. prism:publicationName
;
; volume = ; e.g. prism:volume
;
; number = ; e.g. prism:number
;
; startingPage = ; e.g. prism:startingPage
;
; endingPage = ; e.g. prism:endingPage
;
; issn = ; e.g. prism:issn
;
; eIssn = ; e.g. prism:eIssn
;
; ####################################################################
; 3. 'data' defines the real payload for an 'OTMI' file
; 'data' is the actual payload and has a 'version' attribute
; minimum 'data' content is 'version' followed by one 'section'
data = version [ stoplist ] sections [ floats ] [ references ]
; A - Version
; 'version' follows the standard formula - major/minor/revision,
; e.g. '0.0.0'
version = DIGIT '.' DIGIT '.' DIGIT
; B - Stoplist
; 'stoplist' references an XML file listing stopwords
stoplist = URI
; C - Sections
; 'sections' content is one or more 'section' elements
sections = 1*( section name )
; 'section' content is a 'section' hierarchy or 'otmi-text'
section = 1*( section name ) / otmi-text
; name of section
name = UTF-8
; D - Floats
; 'floats' are floating elements - 'figure' or 'table' elements
floats = 1*( figure / table )
; 'figure' contains 'title' and/or 'caption' elements
figure = title caption / title / caption
title = full-text
caption = full-text
; 'table' contains 'title' element
table = title
title = full-text
; E - References
; ID's for refs and/or count of refs with no ID
references = 1*( ref-id ) refs-noid / 1*( ref-id ) / refs-noid
; ID for ref
ref-id = URI
; count of refs with no ID
refs-noid = 1*( DIGIT )
; F - OTMI Text
otmi-text = [ vectors ] [ snippets ] [ full-text ]
vectors = number split-regex 1*( vector )
; number of elements in 'vectors' or 'snippets' tables
number = DIGIT
; regex pattern used to split 'reduced-text' (for 'vectors' or
; 'snippets')
split-regex = UTF-8
; word (or properly token) vectors from 'reduced-text'
vector = term count
; token from 'reduced-text'
term = UTF-8
; frequency of term occcurrence
count = 1*( DIGIT )
; phrase from 'reduced-text'
snippets = number split-regex 1*( snippet )
snippet = phrase
phrase = UTF-8
; 'full-text' is uninterrupted run of text
full-text = reduced-text / raw-text
; 'reduced-text' is 'raw-text' with all stopwords removed
reduced-text = UTF-8
; 'raw-text' is flattened text with entities replaced
raw-text = UTF-8
; ####################################################################
;
; Normative References
;
; [RFC3339] Klyne, G. and C. Newman, "Date and Time on the Internet:
; Timestamps", RFC 3339, July 2002.
; (Fetched from <http://ietf.org/rfc/rfc3339.txt>.)
; [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO
; 10646", STD 63, RFC 3629, November 2003.
; [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
; Resource Identifier (URI): Generic Syntax", STD 66,
; RFC 3986, January 2005.
; (Fetched from <http://ietf.org/rfc/rfc3986.txt>.)
; [RFC4234] Crocker, D. and P. Overell, "Augmented BNF for Syntax
; Specifications: ABNF", RFC 4234, October 2005.
;
; Informative References
;
; [PRISM] Publisher Requirements for Industry Standard Metadata.
; (Fetched from <http://www.prismstandard.org/>.)
; [RFC4287] Nottingham, M. and R. Sayre, "The Atom Syndication
; Format", RFC 4287, December 2005.
; (Fetched from <http://ietf.org/rfc/rfc4287.txt>.)
; [RFC4452] Van de Sompel, H., Hammond, T., Neylon, E. and S. Weibel,
; "The "info" URI Scheme for Information Assets with
; Identifiers in Public Namespaces, RFC 4452, April 2006.
; (Fetched from <http://ietf.org/rfc/rfc4452.txt>.)
;
; ####################################################################
; __END__
