Quantity Expressions Dataset

Quantity Expressions Dataset

This dataset contains the results of Ulrich Rabenstein’s master thesis, in which he developed a framework for the detection of quantity expressions in STEM documents.

Accessibility and License

The content of this Dataset is licensed to SIGMathLing members for research and tool development purposes.

Access is restricted to SIGMathLing members under the SIGMathLing Non-Disclosure-Agreement as for most arXiv articles, the right of distribution was only given (or assumed) to arXiv itself.

Contents

Remarks on Annotation Format

The annotations are stored as RDF in a way suitable for the Kwarc Annotation Tool (KAT). For more information on KAT and the KAT format consider reading this and this paper. In the example annotation below,

cse(%2F%2F*%5B%40id%3D'S1.p10.1'%5D%2C%2F%2F*%5B%40id%3D'S1.p10.1.w270'%5D%2C%2F%2F*%5B%40id%3D'S1.p10.1.w272'%5D)

describes the annotated quantity expression. URL-decoding the expression in the parentheses, we can obtain the three comma-separated XPaths

//*[@id='S1.p10.1'],//*[@id='S1.p10.1.w270'],//*[@id='S1.p10.1.w272']

where the first path is the common parent, the second path is the start of the annotated range, and the third path is the end of the annotated range.

<rdf:Description rdf:nodeID="KAT_5764208381">
  <kat:run rdf:nodeID="kat_run"/>
  <kat:kannspec rdf:nodeID="KAT_1_QuantityExpression"/>
  <kat:concept>QuantityExpression</kat:concept>
  <kat:type rdf:resource="http://kwarc.info/semanticextraction/KAnnSpec#quantityexpression"/>
  <kat:annotates rdf:resource="http://localhost/procl.html#cse(%2F%2F*%5B%40id%3D'S1.p10.1'%5D%2C%2F%2F*%5B%40id%3D'S1.p10.1.w270'%5D%2C%2F%2F*%5B%40id%3D'S1.p10.1.w272'%5D)"/>
  <kat:contentmathml rdf:parseType="Literal" score="1">
    <apply>
      <times/>
      <cn>21</cn>
      <apply>
        <times/>
        <apply>
          <csymbol cd="Prefix">Prefix</csymbol>
          <csymbol cd="centi">c</csymbol>
          <csymbol cd="meter">m</csymbol>
        </apply>
      </apply>
    </apply>
  </kat:contentmathml>
</rdf:Description>

Download

Evaluation

According to the thesis, a manual validation of 50 randomly selected documents containing in total 646 quantity expressions yielded the following values: