SigMathLing - Technical Concerns

Recall that SIGMathLing maintains a bouquet of services; here we air some technical concerns and ideas.

Resource Repositories

We have a SIGMathLing group on the GitLab server gl.kwarc.info, where we will start making repositories on. This allows us to use Git permissions for access control and the GitLab permission UI for management. We estimate that for the first two years SIGMathLing will have below 25 members (reducing the traffic) and below 5 TB data sets. gl.kwarc.info should be able to serve that given that most data sets will be served via Git LFS. Should space or traffic become a problem for the KWARC servers to handle, we will try to raise money for a more scalable solution.

We will also have a close look at Zenodo and see whether we can delegate hosting to them.

Standardizing Datasets and Resources

We will need to develop standards for representing, classifying, describing, and citing data sets and reources.

  1. Representation: file formats, repository layout, data models
  2. Classification/description: is the dataset
    • a corpus (raw, processed, …),
    • a set of annotations to a corpus,
    • automatically/automatically created, by which process/system?
    • an evaluation data set (gold standard)?
    • what is the quality? f-measure,
    • what is the license.
  3. Citation The idea is to have a “landing page per resourcer that address all the points in 1. and 2. as well as the authors that can be cited. The landing page should also have pre-made bibTeX (and possibly EndNote) entries to make citations easier.

Resource Reference Page

Currently, this is just a manually curated page on the SIGMathLing web site, eventually we will statically generate it from an internal data base of resources and/or harvested from the repositories. Licensing should be made transparent.

Suite of Systems and Libraries

Currently, this is just a manually curated page on the SIGMathLing web site, eventually we will statically generate it from an internal data base of resources and/or harvested from the repositories. Licensing should be made transparent.

Math Analysis Blackboard

MK would like develop and publish an annotation schema (using the KAT schema as a starting point) and establish a math result triple store that manages all of these. Technical details are still open how best to do this, but Deyan is quite skeptical.