Reference Grammar for OTMI

From OpenTextMining

Jump to: navigation, search

Return to OTMI Specification

This annex defines an Augmented Backus-Naur Form (ABNF) specification for OTMI.


; #################################################################### 
; 
; OTMI - Open Text Mining Interface 
; 
;   Editor:  Tony Hammond <mailto:t.hammond@nature.com>
;   Authors: Tony Hammond, Timo Hannay, Ben Lund 
;   Date:    July 20, 2007 
;   Rights:  Copyright (c) 2006, 2007 Nature Publishing Group  
; 
; This specification uses the Augmented Backus-Naur Form (ABNF) 
; notation of RFC 4234 [RFC4234] to define OTMI - the Open Text Mining 
; Interface proposed by Nature Publishing Group as a means to publish 
; content without disclosing narrative intent in order to enable and 
; facilitate text-processing applications. 
; 
; OTMI files are currently serialized as Atom Entry Documents as 
; defined by RFC 4287 [RFC4287] but could be serialized variously 
; according to other apppropriate schemas. 
; 
; The following core ABNF production is used by this specification 
; as defined by Appendix B.1 of RFC 4234: DIGIT. 
; 
; We also define the following generic datatype productions: 
 
  DATETIME  = 
 
    ; see RFC 3339 [RFC3339], e.g. 2006-08-03T00:00:00Z 
 
  URI  = 
 
    ; see RFC 3986 [RFC3986] for generic URI syntax 
 
  UTF-8  = 
 
    ; see RFC 3629 [RFC3629] for UTF-8 transformation format encoding 
 
; 
; #################################################################### 
; Main production 
 
  OTMI  =  key-data  bib-data  data 
 
; 
; #################################################################### 
; 1. 'key-data' defines key properties that should be disclosed by all 
;    'OTMI' files 
 
  key-data  =  title  id  link  published  updated  1*( author )  rights 
 
    title  =  UTF-8 
 
    ; typically for STM assets the URI in the following production would 
    ; make use of the 'info' URI scheme as defined by RFC 4452 [RFC4452] 
 
    id  =  URI 
 
    link  =  href 
 
      href  =  URI 
 
    published  =  DATETIME 
 
    updated  =  DATETIME 
   
    author  =  name 
 
      name  =  UTF-8 
 
    ; 'author' is currently only granulated to 'name' level (but could 
    ; make use of further granular breakdown a la FOAF - or 'Friend Of 
    ; A Friend' vocabulary)  
 
    rights  =  UTF-8 
 
; 
; #################################################################### 
; 2. 'bib-data' defines bibliographic metadata specific to product 
 
  bib-data =  ; bibliographic metadata specific to product 
 
  ; product-specific properties, e.g. for a journal article one might 
  ; require the following properties (here defined in terms of PRISM - 
  ; or 'Publisher Requirements for Industry Standard Metadata' [PRISM] 
  ; 
  ; bib-data  =  publicationName  volume  number  startingPage \ 
  ;              endingPage  issn  eIssn  
  ; 
  ;   publicationName  =  ; e.g. prism:publicationName 
  ; 
  ;   volume  =  ; e.g. prism:volume 
  ; 
  ;   number  =  ; e.g. prism:number 
  ; 
  ;   startingPage  =  ; e.g. prism:startingPage 
  ; 
  ;   endingPage  =  ; e.g. prism:endingPage 
  ; 
  ;   issn  =  ; e.g. prism:issn 
  ; 
  ;   eIssn  =  ; e.g. prism:eIssn 
 
; 
; #################################################################### 
; 3. 'data' defines the real payload for an 'OTMI' file 
 
; 'data' is the actual payload and has a 'version' attribute
; minimum 'data' content is 'version' followed by one 'section' 

  data  =  version  [ stoplist ]  sections  [ floats ]  [ references ] 
 

; A - Version

  ; 'version' follows the standard formula - major/minor/revision,
  ; e.g. '0.0.0'

  version  =  DIGIT  '.'  DIGIT  '.'  DIGIT


; B - Stoplist
 
  ; 'stoplist' references an XML file listing stopwords 

  stoplist  =  URI 
  

; C - Sections

  ; 'sections' content is one or more 'section' elements

  sections  =  1*( section name )

    ; 'section' content is a 'section' hierarchy or 'otmi-text'
 
    section  =  1*( section name )  /  otmi-text

    ; name of section 

    name  =  UTF-8 
 

; D - Floats
 
  ; 'floats' are floating elements - 'figure' or 'table' elements

  floats  =  1*( figure / table )

    ; 'figure' contains 'title' and/or 'caption' elements 

    figure  =  title  caption  /  title  /  caption 
  
      title  =  full-text 
 
      caption  =  full-text  

    ; 'table' contains 'title' element 

    table  =  title 
  
      title  =  full-text 


; E - References  

  ; ID's for refs and/or count of refs with no ID 

  references  =  1*( ref-id )  refs-noid  /  1*( ref-id )  /  refs-noid 
 
    ; ID for ref 
 
    ref-id  =  URI 
 
    ; count of refs with no ID 

     refs-noid  =  1*( DIGIT ) 
 

; F - OTMI Text

  otmi-text  =  [ vectors ]  [ snippets ]  [ full-text ] 

    vectors  =  number  split-regex  1*( vector ) 
 
      ; number of elements in 'vectors' or 'snippets' tables

      number  =  DIGIT
 
      ; regex pattern used to split 'reduced-text' (for 'vectors' or 
      ; 'snippets') 

      split-regex  =  UTF-8 
  
      ; word (or properly token) vectors from 'reduced-text' 
 
     vector  =  term count 
  
       ; token from 'reduced-text' 

        term  =  UTF-8 
 
        ; frequency of term occcurrence 
 
        count  =  1*( DIGIT ) 
  
     ; phrase from 'reduced-text' 

     snippets  =  number  split-regex  1*( snippet ) 
 
       snippet  =  phrase 
 
       phrase  =  UTF-8 
 
    ; 'full-text' is uninterrupted run of text

     full-text  =  reduced-text  /  raw-text
 
       ; 'reduced-text' is 'raw-text' with all stopwords removed 
 
       reduced-text  =  UTF-8 

       ; 'raw-text' is flattened text with entities replaced 
  
       raw-text  =  UTF-8 


; #################################################################### 
; 
; Normative References 
; 
; [RFC3339] Klyne, G. and C. Newman, "Date and Time on the Internet: 
;           Timestamps", RFC 3339, July 2002. 
;           (Fetched from <http://ietf.org/rfc/rfc3339.txt>.) 
; [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 
;           10646", STD 63, RFC 3629, November 2003. 
; [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 
;           Resource Identifier (URI): Generic Syntax", STD 66, 
;           RFC 3986, January 2005. 
;           (Fetched from <http://ietf.org/rfc/rfc3986.txt>.) 
; [RFC4234] Crocker, D. and P. Overell, "Augmented BNF for Syntax 
;           Specifications: ABNF", RFC 4234, October 2005. 
; 
; Informative References 
; 
; [PRISM]   Publisher Requirements for Industry Standard Metadata. 
;           (Fetched from <http://www.prismstandard.org/>.) 
; [RFC4287] Nottingham, M. and R. Sayre, "The Atom Syndication 
;           Format", RFC 4287, December 2005. 
;           (Fetched from <http://ietf.org/rfc/rfc4287.txt>.) 
; [RFC4452] Van de Sompel, H., Hammond, T., Neylon, E. and S. Weibel, 
;           "The "info" URI Scheme for Information Assets with 
;           Identifiers in Public Namespaces, RFC 4452, April 2006. 
;           (Fetched from <http://ietf.org/rfc/rfc4452.txt>.) 
; 
; #################################################################### 
; __END__ 
Personal tools