Scientific statement classification dataset from arXMLiv 08.2018

Part of the arXMLiv project at the KWARC research group

Author

Deyan Ginev

Current release

08.2018

Accessibility and License

The content of this Dataset is licensed to SIGMathLing members for research and tool development purposes.

Access is restricted to SIGMathLing members under the SIGMathLing Non-Disclosure-Agreement as for most arXiv articles, the right of distribution was only given (or assumed) to arXiv itself.

10.5 million plain-text paragraphs associated with a statement class
50 directories, each containing entries from the same class of scientific statements
each filename is a SHA-256 hash of its contents, as a guarantee for uniqueness and random order
two separate tar bundles over the same data, one with and one without lexemes for mathematical expressions
data is extracted from the separately distributed arXMLiv 08.2018 dataset.
see the bottom of this page for a full statement frequency breakdown

file name	MD5	size	size unpacked
`statement_paragraphs_arxmliv_08_2018.tar`	`ff48316737b41c13fbaa786eef8d1b6e`	22 GB	45 GB
`nomath_statement_paragraphs_arxmliv_08_2018.tar`	`e214eacb3b73fa3e7416f00673f9c298`	12 GB	40 GB

Description

For the full details, please read our paper on announcing the statement classification task.

This is a first public release of an annotated statement dataset derived from arXMLiv, a machine-readable representation of the arXiv corpus of scientific articles.

This resource contains 10,555,689 paragraphs with associated statement labels, realized as one paragraph per file, one sentence per line. Each file is placed in a subdirectory named after its annotated class. The statements were extracted from author-annotated environments, where we only selected the first paragraph,immediately following the heading. Headings include both structural sections (e.g. Introduction), as well as scholarly statement annotations, (e.g. Definition, Proof, Remark).

We also include a control dataset of the same statements with all mathematical symbolism omitted (nomath), numbering 10,137,007 paragraphs. This math-free resource is smaller as omitting the formulas results in fewer unique paragraphs. We consider it a useful benchmark when trying to evaluate the specific impact of mathematical expressions on classification performance.

We welcome community feedback on all of: data quality, representation issues, as well as organization and archival best practices. We plan on jointly release new versions of this data together with new releases of the arXMLiv corpus.

Examples

Definition with math lexemes (main data, single sentence, linebreaks for readability):

a directed quantum turing automaton is a quadruple
  italic_T RELOP_equals OPEN_( caligraphic_H PUNCT_, caligraphic_K PUNCT_, caligraphic_L PUNCT_, italic_tau CLOSE_) PUNCT_,
where
  caligraphic_H caligraphic_K and caligraphic_L
are finite dimensional hilbert spaces over the complex field blackboard_C and
  italic_tau METARELOP_colon caligraphic_H MULOP_tensor_product caligraphic_K ARROW_rightarrow
    caligraphic_H MULOP_tensor_product caligraphic_L
is an isometry in fdhilb

source: definition/1e4a1aea317bbf363c5314fb25eaf72c8a350a1007bb8aafc542e188405b93d5.txt

Same definition without math lexemes (nomath data, single sentence, linebreaks for readability):

a directed quantum turing automaton is a quadruple
  where and are finite dimensional hilbert spaces over the complex field and
  is an isometry in fdhilb

nomath source: definition/35b170bae4259a5c430846116142d4e4a45097e52daf818b78ea378d94d14a21.txt

Citing this Resource

pure bibTeX

@MISC{SML:statement-classification:08.2018,
  author = {Deyan Ginev},
  title = {Statement classification dataset, 10.5 million plain-text paragraphs from {arXMLiv:08.2018}},
  howpublished = {\url{https://sigmathling.kwarc.info/resources/arxmliv-statements-082018/}},
  note = {SIGMathLing -- Special Interest Group on Math Linguistics},
  year = {2019}

bibTeX for the bibLaTeX package (preferred)

@online{SML:statement-classification:08.2018,
  author = {Deyan Ginev},
  title = {Statement classification dataset, 10.5 million plain-text paragraphs from {arXMLiv:08.2018}},
  url = {https://sigmathling.kwarc.info/resources/arxmliv-statements-082018/},
  note = {SIGMathLing -- Special Interest Group on Math Linguistics},
  year = {2019}

EndNote

%0 Generic
%T Statement classification dataset, 10.5 million plain-text paragraphs from {arXMLiv:08.2018}
%A Ginev, Deyan
%D 2019
%I hosted at https://sigmathling.kwarc.info/resources/arxmliv-statements-082018/
%F SML:statement-classification:08.2018b
%O SIGMathLing – Special Interest Group on Math Linguistics

Download

Download links. This is a temporary solution as we are in the process of migrating the data.
SIGMathLing members only. Joining is free and mostly a legal checkmark on our end - all researchers welcome!

Generated via

llamapun 0.3.2

Contents Breakdown

statement class	frequency	frequency (nomath)
abstract	1,030,774	1,030,691
acknowledgement	162,230	162,220
affirmation	36	22
answer	40	39
assumption	29,577	26,890
bound	47	37
case	3,256	2,208
claim	89,737	75,778
comment	325	322
conclusion	284,585	284,536
condition	3,950	3,508
conjecture	44,893	41,780
constraint	753	731
convention	2,176	2,160
corollary	436,768	402,728
criterion	236	219
definition	686,717	667,797
demonstration	23,043	22,842
discussion	116,650	116,643
example	295,152	289,005
exercise	404	404
expansion	5	2
expectation	13	13
experiment	154	153
explanation	16	16
fact	17,737	16,473
hint	9	9
introduction	688,530	688,187
issue	41	28
keywords	1,565	1,565
lemma	1,320,646	1,162,559
method	50,968	50,947
notation	16,611	16,077
note	4,462	4,415
notice	4	4
observation	18,776	18,013
overview	11,279	11,277
principle	236	232
problem	30,369	29,221
proof	2,125,750	2,096,644
proposition	829,068	763,268
question	27,240	26,673
relatedwork	26,300	26,299
remark	639,038	635,180
result	239,905	239,639
rule	775	712
solution	163	144
step	6,910	6,536
summary	117	117
theorem	1,287,653	1,212,044