ar5iv 04.2024 - An HTML5 dataset for arXiv.org

Release

Contents

subset ID number of documents size archived size unpacked
no_problem 366,232 20 GB 155 GB
warning 1,304,052 216 GB 2 TB
error 500,515 82 GB 753 GB

Download and License

Description

This is the first public release of the ar5iv dataset generated by the KWARC research group. It contains HTML5+MathML conversions of the scientific documents from the arXiv.org preprint server, upto the start of April 2024.

As of April 2024, the provided HTML here also seeds the live ar5iv Lab site, maintained by the same author.

For articles with multiple published versions, the underlying TeX sources are the newest ones available, updated as of February 22, 2024.

MD5 file integrity

6ffa80fa273f29716527db36e1841abf  ar5iv-04-2024-no-problem.zip
51582b218f55286e5fe08431eb5e299d  ar5iv-04-2024-warnings.zip
9178d9635085a657956402077b4f8301  ar5iv-04-2024-errors.zip

Citing this Resource

pure bibTeX

@MISC{SML:ar5iv:04:2024,
  author = {Deyan Ginev},
  title = {ar5iv:04.2024 dataset, an HTML5 conversion of arXiv.org},
  howpublished = {hosted at \url{https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/}},
  note = {SIGMathLing -- Special Interest Group on Math Linguistics},
  year = {2024} }

bibTeX for the bibLaTeX package (preferred)

@online{SML:ar5iv:04:2024,
  author = {Deyan Ginev},
  title = {ar5iv:04.2024 dataset, an HTML5 conversion of arXiv.org},
  url = {https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/},
  note = {SIGMathLing -- Special Interest Group on Math Linguistics},
  year = {2024} }

EndNote

%0 Generic
%T ar5iv:04:2024 dataset, an HTML5 conversion of arXiv.org
%A Ginev, Deyan
%D 2024
%I hosted at https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/
%F SML:ar5iv:04:2024b
%O SIGMathLing – Special Interest Group on Math Linguistics

Generated via

About

This release is part of the arXMLiv project at the KWARC research group. We are also the team which created and maintains the ar5iv Lab.

The dataset is distributed through hosting provided by the University of Erlangen-Nuremberg (FAU).

Author: Deyan Ginev