318 GB
packaged, and 2.9 TB
unpacked.
df -ih .
)subset ID | number of documents | size archived | size unpacked |
---|---|---|---|
no_problem | 366,232 | 20 GB | 155 GB |
warning | 1,304,052 | 216 GB | 2 TB |
error | 500,515 | 82 GB | 753 GB |
This is the first public release of the ar5iv dataset generated by the KWARC research group. It contains HTML5+MathML conversions of the scientific documents from the arXiv.org preprint server, upto the start of April 2024.
As of April 2024, the provided HTML here also seeds the live ar5iv Lab site, maintained by the same author.
For articles with multiple published versions, the underlying TeX sources are the newest ones available, updated as of February 22, 2024.
6ffa80fa273f29716527db36e1841abf ar5iv-04-2024-no-problem.zip
51582b218f55286e5fe08431eb5e299d ar5iv-04-2024-warnings.zip
9178d9635085a657956402077b4f8301 ar5iv-04-2024-errors.zip
@MISC{SML:ar5iv:04:2024,
author = {Deyan Ginev},
title = {ar5iv:04.2024 dataset, an HTML5 conversion of arXiv.org},
howpublished = {hosted at \url{https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/}},
note = {SIGMathLing -- Special Interest Group on Math Linguistics},
year = {2024} }
@online{SML:ar5iv:04:2024,
author = {Deyan Ginev},
title = {ar5iv:04.2024 dataset, an HTML5 conversion of arXiv.org},
url = {https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/},
note = {SIGMathLing -- Special Interest Group on Math Linguistics},
year = {2024} }
%0 Generic
%T ar5iv:04:2024 dataset, an HTML5 conversion of arXiv.org
%A Ginev, Deyan
%D 2024
%I hosted at https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/
%F SML:ar5iv:04:2024b
%O SIGMathLing – Special Interest Group on Math Linguistics
This release is part of the arXMLiv project at the KWARC research group. We are also the team which created and maintains the ar5iv Lab.
The dataset is distributed through hosting provided by the University of Erlangen-Nuremberg (FAU).
Author: Deyan Ginev