arXMLiv 08.2017 - An HTML5 dataset for arXiv.org

Part of the arXMLiv project at the KWARC research group

Author

Release

Accessibility and License

The content of this Dataset is licensed to SIGMathLing members for research and tool development purposes.

Access is restricted to SIGMathLing members under the SIGMathLing Non-Disclosure-Agreement as for most arXiv articles, the right of distribution was only given (or assumed) to arXiv itself.

Contents

subset ID number of documents size archived size unpacked
no_problem 112,088 5 GB 37 GB
warning 574,638 71 GB 595 GB
error 401,644 50 GB 421 GB
subset file name MD5
arXMLiv_08_2017_no_problem.zip 036945755c7cc75ea1577cf04ca4fead
arXMLiv_08_2017_warning.zip c0d5c1baf626225b48264510ac4c6bd5
arXMLiv_08_2017_error.zip 2f4e60b993d85d30523b064c19e45733

Description

This is a first public release of the arXMLiv dataset generated by the KWARC research group. It contains 1,088,370 HTML5 scientific documents from the arXiv.org preprint archive, converted from their respective TeX sources.

The dataset is segmented in 3 different subsets, each corresponding to a severity level of the LaTeXML software responsible for the HTML5 conversion.

This version of the dataset has had minimal manual quality control, and we offer no additional warranty beyond the latexml severity reported.

We welcome community feedback on all of: data quality, representation issues, need for auxiliary resources (e.g. figures, token models), as well as organization and archival best practices. The conversion, build system, and data redistribution efforts are all ongoing projects at the KWARC research group.

A following release is planned for mid-2018, with an up-to-date arXiv dataset and community feedback incorporated. We anticipate annual dataset releases going forward.

Citing this Resource

The dataset should be referenced in all academic publications that present results obtained with its help. The reference should contain the identifier arXMLiv:08.2017 in the title, the author, year, a reference to SIGMathLing, and the URL of the resource description page. For convenience, we supply some records for bibTeX and EndNote below. To cite a particular part of the dataset use the subset identifiers in the ciation; e.g. \cite[no_problem subset]{arXMLiv:08.2017} or just explain it in the text using the concrete identifier.

pure bibTeX

@MISC{SML:arXMLiv:08.2017,
  author = {Deyan Ginev},
  title = {arXMLiv:08.2017 dataset, an HTML5 conversion of arXiv.org},
  howpublished = {hosted at \url{https://sigmathling.kwarc.info/resources/arxmliv/}},
  note = {SIGMathLing -- Special Interest Group on Math Linguistics},
  year = {2018}

bibTeX for the bibLaTeX package (preferred)

@online{SML:arXMLiv:08.2017,
  author = {Deyan Ginev},
  title = {arXMLiv:08.2017 dataset, an HTML5 conversion of arXiv.org},
  url = {https://sigmathling.kwarc.info/resources/arxmliv/},
  note = {SIGMathLing -- Special Interest Group on Math Linguistics},
  year = {2018}

EndNote

%0 Generic
%T arXMLiv:08.2017 dataset, an HTML5 conversion of arXiv.org
%A Ginev, Deyan
%D 2018
%I hosted at https://sigmathling.kwarc.info/resources/arxmliv/
%F SML:arXMLiv:08.2017b
%O SIGMathLing – Special Interest Group on Math Linguistics

Download

Generated via