First Data Sets (1.1 Million scientific HTML5 documents from arXiv and token models)

24 Jan 2018

SIGMathLing has published the first data sets. They also act as templates for future data sets. The content of these data sets are licensed to SIGMathLing members for research and tool development purposes subject to the SIGMathLing Non-Disclosure-Agreement.

This collection of 1.1 Million HTML5 documents has been developed as part of the arXMLiv project at the KWARC research group. It was created by converting the arXiv collection of scientific preprints until August 2017 via LaTeXML using the CorTeX corpus management system.

The token models are generated from this document collection via the LLaMaPuN and GloVe libraries.

Details can be found on the SIGMathLing Resource page.

First Data Sets (1.1 Million scientific HTML5 documents from arXiv and token models)

Related Posts

arXMLiv 2020 Dataset Released 25 Jan 2021

SIGMathLing has 9 Datasets and 22 Members 29 Sep 2019

arXiv 2019 Data Set and Embeddings Released 19 Sep 2019