Part of the arXMLiv project at the KWARC research group
The content of this Dataset is licensed to SIGMathLing members for research and tool development purposes.
Access is restricted to SIGMathLing members under the SIGMathLing Non-Disclosure-Agreement as for most arXiv articles, the right of distribution was only given (or assumed) to arXiv itself.
token_model.zip
glove.arxmliv.15B.300d.zip
and vocab.arxmliv.zip
subset | documents | paragraphs |
---|---|---|
no_problem | 150,701 | 6,071,920 |
warning_1 | 500,000 | 36,130,694 |
warning_2 | 328,127 | 24,285,351 |
error | 395,711 | 31,155,136 |
complete | 1,374,539 | 97,643,101 |
subset | words | formulas | inline cite |
---|---|---|---|
no_problem | 619,051,536 | 25,210,637 | 4,248,840 |
warning_1 | 2,917,283,935 | 212,113,899 | 18,553,611 |
warning_2 | 1,937,516,458 | 140,094,708 | 12,590,335 |
error | 2,307,007,544 | 163,290,748 | 14,200,445 |
complete | 7,780,859,473 | 540,709,992 | 49,593,231 |
subset | tokens | unique words | unique words (freq 5+ ) |
---|---|---|---|
complete | 15,192,564,807 | 2,782,667 | 989,136 |
Please cite the main dataset when using the word embeddings, as they are generated and distributed jointly. Instructions here
token model distributed as 4 subsets - no_problem, warning_1, warning_2 and error. complete model is derived via:
cat token_model_no_problem.txt \
token_model_warning_1.txt token_model_warning_2.txt \
token_model_error.txt > token_model_complete.txt
corpus_token_model
example used for token model extraction
ltx_ERROR
HTML class); also excluded when words over 25 characters were encountered.citationelement
tokenNUM
token (both in text and formulas)ref
token (e.g. Figure ref
)'s
.In a cloned GloVe repository, start via:
python2 eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.15B.300d.txt
abelian
is to group
as disjoint
is to ?
union
, cosine distance 0.649029
convex
is to concave
as positive
is to ?
negative
, cosine distance 0.812031
finite
is to infinte
as abelian
is to ?
nonabelian
, cosine distance 0.689419
quantum
is to classical
as bottom
is to ?
middle
, cosine distance 0.770132
top
, cosine distance 0.758245
eq
is to proves
as figure
is to ?
shows
, cosine distance 0.675003
italic_x
is to italic_y
as italic_a
is to ?
italic_b
, cosine distance 0.912827
In a cloned GloVe repository, start via:
python2 eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.15B.300d.txt
Word: lattice Position in vocabulary: 515
Word Cosine distance
---------------------------------------------------------
lattices 0.860839
finite 0.662110
honeycomb 0.657155
crystal 0.635061
triangular 0.632298
spacing 0.619840
square 0.613936
sublattice 0.612161
hexagonal 0.606321
hypercubic 0.606101
latter 0.602747
symmetry 0.601192
cubic 0.601035
Word: entanglement Position in vocabulary: 1605
Word Cosine distance
---------------------------------------------------------
entangled 0.795067
multipartite 0.745711
concurrence 0.708164
negativity 0.695089
quantum 0.666254
tripartite 0.653771
fidelity 0.651990
teleportation 0.639430
nonlocality 0.626717
discord 0.622995
qubit 0.622836
bipartite 0.614907
qubits 0.613029
entropy 0.612276
Word: forgetful Position in vocabulary: 12229
Word Cosine distance
---------------------------------------------------------
functor 0.731004
functors 0.667090
morphisms 0.605955
morphism 0.604947
Word: eigenvalue Position in vocabulary: 1527
Word Cosine distance
---------------------------------------------------------
eigenvalues 0.894346
eigenvector 0.775584
eigenfunction 0.772961
eigenvectors 0.762914
eigenfunctions 0.700270
eigenspace 0.686408
eigen 0.664881
laplacian 0.646244
eigenstate 0.629338
eigenmode 0.626229
largest 0.620355
matrix 0.618085
eigenmodes 0.605928
operator 0.602806
smallest 0.600443
Word: riemannian Position in vocabulary: 2428
Word Cosine distance
---------------------------------------------------------
manifolds 0.780788
manifold 0.771704
metric 0.725227
finsler 0.686441
ricci 0.678393
curvature 0.677207
metrics 0.660825
euclidean 0.659125
noncompact 0.647109
conformally 0.643647
riemmanian 0.641671
submanifold 0.632707
kahler 0.623857
geodesic 0.621973
submanifolds 0.617170
endowed 0.616036
riemanian 0.608523
hyperbolic 0.603709
submersion 0.600120