Part of the arXMLiv project at the KWARC research group
The content of this Dataset is licensed to SIGMathLing members for research and tool development purposes.
Access is restricted to SIGMathLing members under the SIGMathLing Non-Disclosure-Agreement as for most arXiv articles, the right of distribution was only given (or assumed) to arXiv itself.
token_model.zipglove.arxmliv.5B.300d.zip and vocab.arxmliv.zipglove.subsets.zip| subset | documents | paragraphs | sentences |
|---|---|---|---|
| no_problem | 112,088 | 3,760,015 | 17,684,762 |
| warning | 574,638 | 35,215,866 | 144,166,524 |
| error | 401,644 | 28,555,173 | 111,798,273 |
| complete | 1,088,370 | 67,531,054 | 273,649,559 |
| subset | words | formulas | inline cite | numeric literals |
|---|---|---|---|---|
| no_problem | 355,253,671 | 17,020,161 | 2,991,053 | 9,913,009 |
| warning | 2,514,340,590 | 219,167,820 | 20,163,304 | 65,294,846 |
| error | 1,946,207,151 | 169,247,016 | 14,458,082 | 51,730,645 |
| complete | 4,815,801,412 | 405,434,997 | 37,612,439 | 126,938,500 |
| subset | tokens | unique words | unique words (freq 5+ ) |
|---|---|---|---|
| no_problem | 384,951,086 | 490,134 | 170,615 |
| warning | 2,817,734,902 | 1,200,887 | 422,524 |
| error | 2,180,119,361 | 1,889,392 | 518,609 |
| complete | 5,382,805,349 | 2,573,974 | 746,673 |
Please cite the main dataset when using the word embeddings, as they are generated and distributed jointly. Instructions here
token model distributed as 3 subsets - no_problem, warning and error. complete model is derived via:
cat token_model_no_problem.txt \
token_model_warning.txt \
token_model_error.txt > token_model_complete.txt
corpus_token_model example used for token model extraction
mathformula tokencitationelement tokenNUM tokenEvaluation note: These in-built evlauation runs are provided as a sanity check that the generated GloVe models pass a basic baseline against the non-expert tasks in the default GloVe suite. One would need a scienctific discourse tailored set of test cases to evaluate the arXiv-based models competitively.
In a cloned GloVe repository, start via:
python eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.5B.300d.txt
abelian is to group as disjoint is to ?
union, cosine distance 0.644784convex is to concave as positive is to ?
negative, cosine distance 0.802866finite is to infinite as abelian is to ?
nonabelian, cosine distance 0.664235quantum is to classical as bottom is to ?
top, cosine distance 0.719843eq is to proves as figure is to ?
shows, cosine distance 0.674743In a cloned GloVe repository, start via:
python eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.5B.300d.txt
lattice
Word: lattice Position in vocabulary: 311
Word Cosine distance
-----------------------------------------------------
lattices 0.811057
honeycomb 0.657262
finite 0.625146
triangular 0.608218
spacing 0.605435
entanglement
Word: entanglement Position in vocabulary: 1293
Word Cosine distance
-----------------------------------------------------
entangled 0.763964
multipartite 0.730231
fidelity 0.653443
concurrence 0.652454
environemtnal 0.646705
negativity 0.646165
quantum 0.639032
discord 0.624222
nonlocality 0.610661
tripartite 0.609896
Word: forgetful Position in vocabulary: 10697
Word Cosine distance
-----------------------------------------------------
functor 0.723019
functors 0.653969
morphism 0.626222
Word: eigenvalue Position in vocabulary: 1212
Word Cosine distance
-----------------------------------------------------
eigenvalues 0.878527
eigenvector 0.766371
eigenfunction 0.761923
eigenvectors 0.747451
eigenfunctions 0.707346
eigenspace 0.661539
corresponding 0.629746
laplacian 0.627187
operator 0.627130
eigen 0.620933
Word: riemannian Position in vocabulary: 2026
Word Cosine distance
-----------------------------------------------------
manifold 0.766196
manifolds 0.745785
metric 0.714120
curvature 0.672975
metrics 0.670006
finsler 0.665079
ricci 0.657058
euclidean 0.650198
endowed 0.626307
riemmanian 0.621626
riemanian 0.618022