arXMLiv 08.2019 - Word Embeddings; Token Model

Part of the arXMLiv project at the KWARC research group

Author

Release

Accessibility and License

The content of this Dataset is licensed to SIGMathLing members for research and tool development purposes.

Access is restricted to SIGMathLing members under the SIGMathLing Non-Disclosure-Agreement as for most arXiv articles, the right of distribution was only given (or assumed) to arXiv itself.

Contents

Token Model Statistics

subset documents paragraphs
no_problem 150,701 6,071,920
warning_1 500,000 36,130,694
warning_2 328,127 24,285,351
error 395,711 31,155,136
complete 1,374,539 97,643,101
subset words formulas inline cite
no_problem 619,051,536 25,210,637 4,248,840
warning_1 2,917,283,935 212,113,899 18,553,611
warning_2 1,937,516,458 140,094,708 12,590,335
error 2,307,007,544 163,290,748 14,200,445
complete 7,780,859,473 540,709,992 49,593,231

GloVe Model Statistics

subset tokens unique words unique words (freq 5+ )
complete 15,192,564,807 2,782,667 989,136

Citing this Resource

Please cite the main dataset when using the word embeddings, as they are generated and distributed jointly. Instructions here

Download

Generated via

Generation Parameters

Examples and baselines

GloVe in-built evaluation (non-expert tasks e.g. language, relationships, geography)

  1. NEW; 2019 model
    • Total accuracy: 37.76% (7017/18322)
    • Highest score: “gram3-comparative.txt”, 77.33% (1047/1332)
  2. 2018 GloVe embeddings
    • Total accuracy: 35.48% (6298/17750)
    • Highest score: “gram3-comparative.txt”, 76.65% (1021/1332)
  3. demo baseline: text8 demo (first 100M characters of Wikipedia)
    • Total accuracy: 23.62% (4211/17827)
    • Highest score: “gram6-nationality-adjective.txt”, 58.65% (892/1521)

Measuring word analogy

In a cloned GloVe repository, start via:

python2 eval/python/word_analogy.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.15B.300d.txt
  1. abelian is to group as disjoint is to ?
    • Top hit: union, cosine distance 0.649029
  2. convex is to concave as positive is to ?
    • Top hit: negative, cosine distance 0.812031
  3. finite is to infinte as abelian is to ?
    • Top hit: nonabelian, cosine distance 0.689419
  4. quantum is to classical as bottom is to ?
    • Top hit: middle, cosine distance 0.770132
    • Close second: top, cosine distance 0.758245
  5. eq is to proves as figure is to ?
    • Top hit: shows, cosine distance 0.675003
  6. italic_x is to italic_y as italic_a is to ?
    • Top hit: italic_b, cosine distance 0.912827

Nearest word vectors

In a cloned GloVe repository, start via:

python2 eval/python/distance.py --vocab_file vocab.arxmliv.txt --vectors_file glove.arxmliv.15B.300d.txt
  1. lattice
     Word: lattice  Position in vocabulary: 515
    
                            Word       Cosine distance
    
     ---------------------------------------------------------
    
                            lattices		0.860839
    
                              finite		0.662110
    
                           honeycomb		0.657155
    
                             crystal		0.635061
    
                          triangular		0.632298
    
                             spacing		0.619840
    
                              square		0.613936
    
                          sublattice		0.612161
    
                           hexagonal		0.606321
    
                          hypercubic		0.606101
    
                              latter		0.602747
    
                            symmetry		0.601192
    
                               cubic		0.601035
    
  2. entanglement
     Word: entanglement  Position in vocabulary: 1605
    
                            Word       Cosine distance
    
     ---------------------------------------------------------
    
                           entangled		0.795067
    
                        multipartite		0.745711
    
                         concurrence		0.708164
    
                          negativity		0.695089
    
                             quantum		0.666254
    
                          tripartite		0.653771
    
                            fidelity		0.651990
    
                       teleportation		0.639430
    
                         nonlocality		0.626717
    
                             discord		0.622995
    
                               qubit		0.622836
    
                           bipartite		0.614907
    
                              qubits		0.613029
    
                             entropy		0.612276
    
    
    
    
  3. forgetful
     Word: forgetful  Position in vocabulary: 12229
    
                            Word       Cosine distance
    
     ---------------------------------------------------------
    
                             functor		0.731004
    
                            functors		0.667090
    
                           morphisms		0.605955
    
                            morphism		0.604947
    
    
  4. eigenvalue
     Word: eigenvalue  Position in vocabulary: 1527
    
                            Word       Cosine distance
    
     ---------------------------------------------------------
    
                         eigenvalues		0.894346
    
                         eigenvector		0.775584
    
                       eigenfunction		0.772961
    
                       eigenvectors		0.762914
    
                     eigenfunctions		0.700270
    
                         eigenspace		0.686408
    
                               eigen		0.664881
    
                           laplacian		0.646244
    
                         eigenstate		0.629338
    
                           eigenmode		0.626229
    
                             largest		0.620355
    
                             matrix		0.618085
    
                         eigenmodes		0.605928
    
                           operator		0.602806
    
                           smallest		0.600443
    
    
    
  5. riemannian
     Word: riemannian  Position in vocabulary: 2428
    
                            Word       Cosine distance
    
     ---------------------------------------------------------
    
                           manifolds		0.780788
    
                            manifold		0.771704
    
                              metric		0.725227
    
                             finsler		0.686441
    
                               ricci		0.678393
    
                           curvature		0.677207
    
                             metrics		0.660825
    
                           euclidean		0.659125
    
                          noncompact		0.647109
    
                         conformally		0.643647
    
                          riemmanian		0.641671
    
                         submanifold		0.632707
    
                              kahler		0.623857
    
                            geodesic		0.621973
    
                        submanifolds		0.617170
    
                             endowed		0.616036
    
                           riemanian		0.608523
    
                          hyperbolic		0.603709
    
                          submersion		0.600120