artifpar - a data set for research artifact parameters

Release

Contents

Download

Description

This is the first release of the (work-in-progress) artifpar dataset.

It contains text segments from parsed arXiv papers with annotated entities and their relations. Specifically, the annotations comprise (1) research artifacts, (2) parameters of those artifacts, and (3) values of those parameters. Annotations are provided in three formats:

The data set in its current form is the result of a first attempt at defining and studying a new information extraction task. Goal of the task is to shed light on how authors of academic literature describe their usage of research artifacts, and to develop information extraction models.

To define the scope of the task, we propose the following definitions.

With annotation guidelines based on above definitions, we heuristically pre-filtered arXiv papers and annotated 151 text segments. Basic statistics on the resulting data set are shown below.

Statistics

Future considerations

Based on the lessons learned during creation, analysis of, and early feedback for the data set, we have the following considerations for future versions.

Citing this Resource

BibTeX

@misc{SML:artifpar:2022-1,
  author       = {Saier, Tarek and Asakura, Takuto and F{\"{a}}rber, Michael},
  title        = {{artifpar - a data set for research artifact parameters}},
  url          = {https://sigmathling.kwarc.info/resources/artifact-parameter-dataset/},
  note         = {SIGMathLing -- Special Interest Group on Math Linguistics},
  year         = {2022},
}

Accessibility and License

The content of this dataset is licensed to SIGMathLing members for research and tool development purposes.

Access is restricted to SIGMathLing members under the SIGMathLing Non-Disclosure-Agreement as for most arXiv articles, the right of distribution was only given (or assumed) to arXiv itself.

Generated from

About

Part of the [KOM,BI] project, with support from KHYS.

Appendix

Data format example:

{
 "context":      "We use the Adam optimizer with a fixed learning rate of \\(10^{-5}\\).",
 "in_doc":       "2012.01573",
 "annotator_id": "tsa",
 "annotation":   {"entity_dict": {"A1": {"entity_group": "a_g1",
                                         "offset": [11, 15],
                                         "surface_text": "Adam",
                                         "type": "a"},
                                  "P1": {"entity_group": "p_g1",
                                         "offset": [39, 52],
                                         "surface_text": "learning rate",
                                         "type": "p"},
                                  "V1": {"entity_group": "v_g1",
                                         "offset": [58, 65],
                                         "surface_text": "10^{-5}",
                                         "type": "v"}},
                  "relation_tuples": [["V1", "P1"],
                                      ["P1", "A1"]]
                  },
 "annotation_w3c_data_model": [{"@context": "http://www.w3.org/ns/anno.jsonld",
                                "type":     "Annotation"
                                ...
                              }],
                              ...
}