First Data Sets (1.1 Million scientific HTML5 documents from arXiv and token models)

SIGMathLing has published the first data sets. They also act as templates for future data sets. The content of these data sets are licensed to SIGMathLing members for research and tool development purposes subject to the SIGMathLing Non-Disclosure-Agreement.

This collection of 1.1 Million HTML5 documents has been developed as part of the arXMLiv project at the KWARC research group. It was created by converting the arXiv collection of scientific preprints until August 2017 via LaTeXML using the CorTeX corpus management system.

The token models are generated from this document collection via the LLaMaPuN and GloVe libraries.

Details can be found on the SIGMathLing Resource page.