arXMLiv 08.2019 - An HTML5 dataset for arXiv.org

Part of the arXMLiv project at the KWARC research group

Author

Release

Accessibility and License

The content of this Dataset is licensed to SIGMathLing members for research and tool development purposes.

Access is restricted to SIGMathLing members under the SIGMathLing Non-Disclosure-Agreement as for most arXiv articles, the right of distribution was only given (or assumed) to arXiv itself.

Contents

subset ID number of documents size archived size unpacked
no_problem 150,701 7.4 GB 57 GB
warning_1 500,000 75 GB 641 GB
warning_2 328,127 50 GB 429 GB
error 395,711 60 GB 521 GB
subset file name MD5
arXMLiv_08_2019_no_problem.zip b70535d607ec916d9f6456b2b1fef421
arXMLiv_08_2019_warning_1.zip fd4496504020a256f4e4f4200cb731fc
arXMLiv_08_2019_warning_2.zip 5d3ce062a768ce439bd7447f8f011e2b
arXMLiv_08_2019_error.zip 74c91c3b187d151f8bce7bb9936c050f

Description

This is the third public release of the arXMLiv dataset generated by the KWARC research group. It contains 1,374,539 HTML5 scientific documents from the arXiv.org preprint archive, converted from their respective TeX sources. An 11% increase in available articles over the 08.2018 release.

The dataset is segmented in 4 subsets, corresponding to three severity levels of the HTML conversion.

This version of the dataset has had minimal manual quality control, and we offer no additional warranty beyond the latexml severity reported.

We welcome community feedback on all of: data quality, representation issues, need for auxiliary resources (e.g. figures, token models), as well as organization and archival best practices. The conversion, build system, and data redistribution efforts are all ongoing projects at the KWARC research group.

Citing this Resource

The dataset should be referenced in all academic publications that present results obtained with its help. The reference should contain the identifier arXMLiv:08.2019 in the title, the author, year, a reference to SIGMathLing, and the URL of the resource description page. For convenience, we supply some records for bibTeX and EndNote below. To cite a particular part of the dataset use the subset identifiers in the ciation; e.g. \cite[no_problem subset]{arXMLiv:08.2019} or just explain it in the text using the concrete identifier.

pure bibTeX

@MISC{SML:arXMLiv:08.2019,
  author = {Deyan Ginev},
  title = {arXMLiv:08.2019 dataset, an HTML5 conversion of arXiv.org},
  howpublished = {hosted at \url{https://sigmathling.kwarc.info/resources/arxmliv-dataset-082019/}},
  note = {SIGMathLing -- Special Interest Group on Math Linguistics},
  year = {2019}

bibTeX for the bibLaTeX package (preferred)

@online{SML:arXMLiv:08.2019,
  author = {Deyan Ginev},
  title = {arXMLiv:08.2019 dataset, an HTML5 conversion of arXiv.org},
  url = {https://sigmathling.kwarc.info/resources/arxmliv-dataset-082019/},
  note = {SIGMathLing -- Special Interest Group on Math Linguistics},
  year = {2019}

EndNote

%0 Generic
%T arXMLiv:08.2019 dataset, an HTML5 conversion of arXiv.org
%A Ginev, Deyan
%D 2019
%I hosted at https://sigmathling.kwarc.info/resources/arxmliv-dataset-082019/
%F SML:arXMLiv:08.2019b
%O SIGMathLing – Special Interest Group on Math Linguistics

Download

Generated via