arXMLiv 2020 - An HTML5 dataset for arXiv.org

Release

Contents

Download

Description

This is the fourth public release of the arXMLiv dataset generated by the KWARC research group. It contains HTML5+MathML conversions of the scientific documents from the arXiv.org preprint server, upto and including the end of 2020. It offers a 15% increase in available articles over our 08.2019 release.

The release also provides the associated conversion metadata under meta/grouped_by_severity.zip. The severity information allows to filter by whether the latexml process completed cleanly, with warnings or with recoverable errors.

A unique feature of the arXMLiv generation process is latexml’s cross-referenced and lexematized MathML representation for math syntax. Scroll to the bottom of the page for an example snippet.

This version of the dataset has had minimal manual quality control, and we offer no additional warranty beyond the latexml severity reported.

Citing this Resource

pure bibTeX

@MISC{SML:arXMLiv:2020,
  author = {Deyan Ginev},
  title = {arXMLiv:2020 dataset, an HTML5 conversion of arXiv.org},
  howpublished = {hosted at \url{https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/}},
  note = {SIGMathLing -- Special Interest Group on Math Linguistics},
  year = {2020}

bibTeX for the bibLaTeX package (preferred)

@online{SML:arXMLiv:2020,
  author = {Deyan Ginev},
  title = {arXMLiv:2020 dataset, an HTML5 conversion of arXiv.org},
  url = {https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/},
  note = {SIGMathLing -- Special Interest Group on Math Linguistics},
  year = {2020}

EndNote

%0 Generic
%T arXMLiv:2020 dataset, an HTML5 conversion of arXiv.org
%A Ginev, Deyan
%D 2020
%I hosted at https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/
%F SML:arXMLiv:2020b
%O SIGMathLing – Special Interest Group on Math Linguistics

Accessibility and License

The content of this Dataset is licensed to SIGMathLing members for research and tool development purposes.

Access is restricted to SIGMathLing members under the SIGMathLing Non-Disclosure-Agreement as for most arXiv articles, the right of distribution was only given (or assumed) to arXiv itself.

Generated via

About

Part of the arXMLiv project at the KWARC research group. Author: Deyan Ginev

Appendix

MathML formula example:

<math id="Sx2.p1.1.m1.1" class="ltx_Math" alttext="\mathbb{E}_{x}" display="inline">
  <semantics id="Sx2.p1.1.m1.1a">
    <msub id="Sx2.p1.1.m1.1.1" xref="Sx2.p1.1.m1.1.1.cmml">
      <mi id="Sx2.p1.1.m1.1.1.2" xref="Sx2.p1.1.m1.1.1.2.cmml">𝔼</mi>
      <mi id="Sx2.p1.1.m1.1.1.3" xref="Sx2.p1.1.m1.1.1.3.cmml">x</mi>
    </msub>
    <annotation-xml encoding="MathML-Content" id="Sx2.p1.1.m1.1b">
      <apply id="Sx2.p1.1.m1.1.1.cmml" xref="Sx2.p1.1.m1.1.1">
        <csymbol cd="ambiguous" id="Sx2.p1.1.m1.1.1.1.cmml" xref="Sx2.p1.1.m1.1.1">subscript</csymbol>
        <ci id="Sx2.p1.1.m1.1.1.2.cmml" xref="Sx2.p1.1.m1.1.1.2">𝔼</ci>
        <ci id="Sx2.p1.1.m1.1.1.3.cmml" xref="Sx2.p1.1.m1.1.1.3">𝑥</ci>
      </apply>
    </annotation-xml>
    <annotation encoding="application/x-tex" id="Sx2.p1.1.m1.1c">
      \mathbb{E}_{x}
    </annotation>
    <annotation encoding="application/x-llamapun" id="Sx2.p1.1.m1.1d">
      blackboard_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT
    </annotation>
  </semantics>
</math>