Scientific statement classification dataset from arXMLiv 08.2018

Part of the arXMLiv project at the KWARC research group

Author

Current release

Accessibility and License

The content of this Dataset is licensed to SIGMathLing members for research and tool development purposes.

Access is restricted to SIGMathLing members under the SIGMathLing Non-Disclosure-Agreement as for most arXiv articles, the right of distribution was only given (or assumed) to arXiv itself.

Contents

file name MD5 size size unpacked
statement_paragraphs_arxmliv_08_2018.tar ff48316737b41c13fbaa786eef8d1b6e 22 GB 45 GB
nomath_statement_paragraphs_arxmliv_08_2018.tar e214eacb3b73fa3e7416f00673f9c298 12 GB 40 GB

Description

For the full details, please read our paper on announcing the statement classification task.

This is a first public release of an annotated statement dataset derived from arXMLiv, a machine-readable representation of the arXiv corpus of scientific articles.

This resource contains 10,555,689 paragraphs with associated statement labels, realized as one paragraph per file, one sentence per line. Each file is placed in a subdirectory named after its annotated class. The statements were extracted from author-annotated environments, where we only selected the first paragraph,immediately following the heading. Headings include both structural sections (e.g. Introduction), as well as scholarly statement annotations, (e.g. Definition, Proof, Remark).

We also include a control dataset of the same statements with all mathematical symbolism omitted (nomath), numbering 10,137,007 paragraphs. This math-free resource is smaller as omitting the formulas results in fewer unique paragraphs. We consider it a useful benchmark when trying to evaluate the specific impact of mathematical expressions on classification performance.

We welcome community feedback on all of: data quality, representation issues, as well as organization and archival best practices. We plan on jointly release new versions of this data together with new releases of the arXMLiv corpus.

Examples

Definition with math lexemes (main data, single sentence, linebreaks for readability):

a directed quantum turing automaton is a quadruple
  italic_T RELOP_equals OPEN_( caligraphic_H PUNCT_, caligraphic_K PUNCT_, caligraphic_L PUNCT_, italic_tau CLOSE_) PUNCT_,
where
  caligraphic_H caligraphic_K and caligraphic_L
are finite dimensional hilbert spaces over the complex field blackboard_C and
  italic_tau METARELOP_colon caligraphic_H MULOP_tensor_product caligraphic_K ARROW_rightarrow
    caligraphic_H MULOP_tensor_product caligraphic_L
is an isometry in fdhilb

source: definition/1e4a1aea317bbf363c5314fb25eaf72c8a350a1007bb8aafc542e188405b93d5.txt

Same definition without math lexemes (nomath data, single sentence, linebreaks for readability):

a directed quantum turing automaton is a quadruple
  where and are finite dimensional hilbert spaces over the complex field and
  is an isometry in fdhilb

nomath source: definition/35b170bae4259a5c430846116142d4e4a45097e52daf818b78ea378d94d14a21.txt

Citing this Resource

pure bibTeX

@MISC{SML:statement-classification:08.2018,
  author = {Deyan Ginev},
  title = {Statement classification dataset, 10.5 million plain-text paragraphs from {arXMLiv:08.2018}},
  howpublished = {\url{https://sigmathling.kwarc.info/resources/arxmliv-statements-082018/}},
  note = {SIGMathLing -- Special Interest Group on Math Linguistics},
  year = {2019}

bibTeX for the bibLaTeX package (preferred)

@online{SML:statement-classification:08.2018,
  author = {Deyan Ginev},
  title = {Statement classification dataset, 10.5 million plain-text paragraphs from {arXMLiv:08.2018}},
  url = {https://sigmathling.kwarc.info/resources/arxmliv-statements-082018/},
  note = {SIGMathLing -- Special Interest Group on Math Linguistics},
  year = {2019}

EndNote

%0 Generic
%T Statement classification dataset, 10.5 million plain-text paragraphs from {arXMLiv:08.2018}
%A Ginev, Deyan
%D 2019
%I hosted at https://sigmathling.kwarc.info/resources/arxmliv-statements-082018/
%F SML:statement-classification:08.2018b
%O SIGMathLing – Special Interest Group on Math Linguistics

Download

Generated via

Contents Breakdown

statement class frequency frequency (nomath)
abstract 1,030,774 1,030,691
acknowledgement 162,230 162,220
affirmation 36 22
answer 40 39
assumption 29,577 26,890
bound 47 37
case 3,256 2,208
claim 89,737 75,778
comment 325 322
conclusion 284,585 284,536
condition 3,950 3,508
conjecture 44,893 41,780
constraint 753 731
convention 2,176 2,160
corollary 436,768 402,728
criterion 236 219
definition 686,717 667,797
demonstration 23,043 22,842
discussion 116,650 116,643
example 295,152 289,005
exercise 404 404
expansion 5 2
expectation 13 13
experiment 154 153
explanation 16 16
fact 17,737 16,473
hint 9 9
introduction 688,530 688,187
issue 41 28
keywords 1,565 1,565
lemma 1,320,646 1,162,559
method 50,968 50,947
notation 16,611 16,077
note 4,462 4,415
notice 4 4
observation 18,776 18,013
overview 11,279 11,277
principle 236 232
problem 30,369 29,221
proof 2,125,750 2,096,644
proposition 829,068 763,268
question 27,240 26,673
relatedwork 26,300 26,299
remark 639,038 635,180
result 239,905 239,639
rule 775 712
solution 163 144
step 6,910 6,536
summary 117 117
theorem 1,287,653 1,212,044