arXMLiv 08.2017 - An HTML5 dataset for

Part of the arXMLiv project at the KWARC research group



Accessibility and License

The content of this Dataset is licensed to SIGMathLing members for research and tool development purposes.

Access is restricted to SIGMathLing members under the SIGMathLing Non-Disclosure-Agreement as for most arXiv articles, the right of distribution was only given (or assumed) to arXiv itself.


subset ID number of documents size archived size unpacked
no_problem 112,088 5 GB 37 GB
warning 574,638 71 GB 595 GB
error 401,644 50 GB 421 GB
subset file name MD5 036945755c7cc75ea1577cf04ca4fead c0d5c1baf626225b48264510ac4c6bd5 2f4e60b993d85d30523b064c19e45733


This is a first public release of the arXMLiv dataset generated by the KWARC research group. It contains 1,088,370 HTML5 scientific documents from the preprint archive, converted from their respective TeX sources.

The dataset is segmented in 3 different subsets, each corresponding to a severity level of the LaTeXML software responsible for the HTML5 conversion.

This version of the dataset has had minimal manual quality control, and we offer no additional warranty beyond the latexml severity reported.

We welcome community feedback on all of: data quality, representation issues, need for auxiliary resources (e.g. figures, token models), as well as organization and archival best practices. The conversion, build system, and data redistribution efforts are all ongoing projects at the KWARC research group.

A following release is planned for mid-2018, with an up-to-date arXiv dataset and community feedback incorporated. We anticipate annual dataset releases going forward.

Citing this Resource

The dataset should be referenced in all academic publications that present results obtained with its help. The reference should contain the identifier arXMLiv:08.2017 in the title, the author, year, a reference to SIGMathLing, and the URL of the resource description page. For convenience, we supply some records for bibTeX and EndNote below. To cite a particular part of the dataset use the subset identifiers in the ciation; e.g. \cite[no_problem subset]{arXMLiv:08.2017} or just explain it in the text using the concrete identifier.

pure bibTeX

  author = {Deyan Ginev},
  title = {arXMLiv:08.2017 dataset, an HTML5 conversion of},
  howpublished = {hosted at \url{}},
  note = {SIGMathLing -- Special Interest Group on Math Linguistics},
  year = 2018}

bibTeX for the bibLaTeX package (preferred)

  author = {Deyan Ginev},
  title = {arXMLiv:08.2017 dataset, an HTML5 conversion of},
  url = {},
  note = {SIGMathLing -- Special Interest Group on Math Linguistics},
  year = 2018}


%0 Generic
%T arXMLiv:08.2017 dataset, an HTML5 conversion of
%A Ginev, Deyan
%D 2018
%I hosted at
%F SML:arXMLiv:08.2017b
%O SIGMathLing – Special Interest Group on Math Linguistics


Download link (SIGMathLing members only)

Generated via