This dataset contains the results of Ulrich Rabenstein’s master thesis, in which he developed a framework for the detection of quantity expressions in STEM documents.
The content of this Dataset is licensed to SIGMathLing members for research and tool development purposes.
Access is restricted to SIGMathLing members under the SIGMathLing Non-Disclosure-Agreement as for most arXiv articles, the right of distribution was only given (or assumed) to arXiv itself.
Annotations.zip
: All quantity expressions detected by the spotter in a format suitable for the Kwarc Annotation Tool (KAT).Documents.zip
: The documents in which quantity expressions were searched. These are modified arXMLiv documents in which each word is wrapped by a <span>
. This was required by KAT to annotate words.Harvest.zip
: Data for math web search.screen-reader-documents.zip
: The documents prepared in a way that enables screen readers to read out units (“two kilometers” instead of “two k m” for “2km”).The annotations are stored as RDF in a way suitable for the Kwarc Annotation Tool (KAT). For more information on KAT and the KAT format consider reading this and this paper. In the example annotation below,
cse(%2F%2F*%5B%40id%3D'S1.p10.1'%5D%2C%2F%2F*%5B%40id%3D'S1.p10.1.w270'%5D%2C%2F%2F*%5B%40id%3D'S1.p10.1.w272'%5D)
describes the annotated quantity expression. URL-decoding the expression in the parentheses, we can obtain the three comma-separated XPaths
//*[@id='S1.p10.1'],//*[@id='S1.p10.1.w270'],//*[@id='S1.p10.1.w272']
where the first path is the common parent, the second path is the start of the annotated range, and the third path is the end of the annotated range.
<rdf:Description rdf:nodeID="KAT_5764208381">
<kat:run rdf:nodeID="kat_run"/>
<kat:kannspec rdf:nodeID="KAT_1_QuantityExpression"/>
<kat:concept>QuantityExpression</kat:concept>
<kat:type rdf:resource="http://kwarc.info/semanticextraction/KAnnSpec#quantityexpression"/>
<kat:annotates rdf:resource="http://localhost/procl.html#cse(%2F%2F*%5B%40id%3D'S1.p10.1'%5D%2C%2F%2F*%5B%40id%3D'S1.p10.1.w270'%5D%2C%2F%2F*%5B%40id%3D'S1.p10.1.w272'%5D)"/>
<kat:contentmathml rdf:parseType="Literal" score="1">
<apply>
<times/>
<cn>21</cn>
<apply>
<times/>
<apply>
<csymbol cd="Prefix">Prefix</csymbol>
<csymbol cd="centi">c</csymbol>
<csymbol cd="meter">m</csymbol>
</apply>
</apply>
</apply>
</kat:contentmathml>
</rdf:Description>
According to the thesis, a manual validation of 50 randomly selected documents containing in total 646 quantity expressions yielded the following values: