Part of the arXMLiv project at the KWARC research group
The content of this Dataset is licensed to SIGMathLing members for research and tool development purposes.
Access is restricted to SIGMathLing members under the SIGMathLing Non-Disclosure-Agreement as for most arXiv articles, the right of distribution was only given (or assumed) to arXiv itself.
token_model.zip
glove.arxmliv.5B.300d.zip
and vocab.arxmliv.zip
glove.subsets.zip
subset | documents | paragraphs | sentences |
---|---|---|---|
no_problem | 112,088 | 3,760,015 | 17,684,762 |
warning | 574,638 | 35,215,866 | 144,166,524 |
error | 401,644 | 28,555,173 | 111,798,273 |
complete | 1,088,370 | 67,531,054 | 273,649,559 |
subset | words | formulas | inline cite | numeric literals |
---|---|---|---|---|
no_problem | 355,253,671 | 17,020,161 | 2,991,053 | 9,913,009 |
warning | 2,514,340,590 | 219,167,820 | 20,163,304 | 65,294,846 |
error | 1,946,207,151 | 169,247,016 | 14,458,082 | 51,730,645 |
complete | 4,815,801,412 | 405,434,997 | 37,612,439 | 126,938,500 |
subset | tokens | unique words | unique words (freq 5+ ) |
---|---|---|---|
no_problem | 384,951,086 | 490,134 | 170,615 |
warning | 2,817,734,902 | 1,200,887 | 422,524 |
error | 2,180,119,361 | 1,889,392 | 518,609 |
complete | 5,382,805,349 | 2,573,974 | 746,673 |
Please cite the main dataset when using the word embeddings, as they are generated and distributed jointly. Instructions here
token model distributed as 3 subsets - no_problem, warning and error. complete model is derived via:
cat token_model_no_problem.txt \
token_model_warning.txt \
token_model_error.txt > token_model_complete.txt
corpus_token_model
example used for token model extraction
mathformula
tokencitationelement
tokenNUM
tokenEvaluation note: These in-built evlauation runs are provided as a sanity check that the generated GloVe models pass a basic baseline against the non-expert tasks in the default GloVe suite. One would need a scienctific discourse tailored set of test cases to evaluate the arXiv-based models competitively.
In a cloned GloVe repository, start via:
python eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.5B.300d.txt
abelian
is to group
as disjoint
is to ?
union
, cosine distance 0.644784
convex
is to concave
as positive
is to ?
negative
, cosine distance 0.802866
finite
is to infinite
as abelian
is to ?
nonabelian
, cosine distance 0.664235
quantum
is to classical
as bottom
is to ?
top
, cosine distance 0.719843
eq
is to proves
as figure
is to ?
shows
, cosine distance 0.674743
In a cloned GloVe repository, start via:
python eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.5B.300d.txt
lattice
Word: lattice Position in vocabulary: 311
Word Cosine distance
-----------------------------------------------------
lattices 0.811057
honeycomb 0.657262
finite 0.625146
triangular 0.608218
spacing 0.605435
entanglement
Word: entanglement Position in vocabulary: 1293
Word Cosine distance
-----------------------------------------------------
entangled 0.763964
multipartite 0.730231
fidelity 0.653443
concurrence 0.652454
environemtnal 0.646705
negativity 0.646165
quantum 0.639032
discord 0.624222
nonlocality 0.610661
tripartite 0.609896
Word: forgetful Position in vocabulary: 10697
Word Cosine distance
-----------------------------------------------------
functor 0.723019
functors 0.653969
morphism 0.626222
Word: eigenvalue Position in vocabulary: 1212
Word Cosine distance
-----------------------------------------------------
eigenvalues 0.878527
eigenvector 0.766371
eigenfunction 0.761923
eigenvectors 0.747451
eigenfunctions 0.707346
eigenspace 0.661539
corresponding 0.629746
laplacian 0.627187
operator 0.627130
eigen 0.620933
Word: riemannian Position in vocabulary: 2026
Word Cosine distance
-----------------------------------------------------
manifold 0.766196
manifolds 0.745785
metric 0.714120
curvature 0.672975
metrics 0.670006
finsler 0.665079
ricci 0.657058
euclidean 0.650198
endowed 0.626307
riemmanian 0.621626
riemanian 0.618022