artifpar - a data set for research artifact parameters

Release

This page documents: artifpar 2022-1 (latest)

annotated_text_segments.json
131 text segments with entity and relation annotations in the W3C Web Annotation Data Model
annotated_text_segments-semeval2022_task12_format.json
same as above, in the data format of the SemEval-2022 Task 12: Symlink
doc/
- documentation of the sampling and pre-filtering performed prior to the annotation
- the annotation guidelines used

Download

Download link (SIGMathLing members only)

Description

This is the first release of the (work-in-progress) artifpar dataset.

It contains text segments from parsed arXiv papers with annotated entities and their relations. Specifically, the annotations comprise (1) research artifacts, (2) parameters of those artifacts, and (3) values of those parameters. Annotations are provided in three formats:

the W3C Web Annotation Data Model
the SemEval-2022 task 12 format
a custom format (see Appendix)

The data set in its current form is the result of a first attempt at defining and studying a new information extraction task. Goal of the task is to shed light on how authors of academic literature describe their usage of research artifacts, and to develop information extraction models.

To define the scope of the task, we propose the following definitions.

(1) A research artifact is a procedure or piece of data used by researchers.
(2) A parameter is a variable aspect of the artifact.
(3) A value is the description of the state of aforementioned aspect at the time of usage.

With annotation guidelines based on above definitions, we heuristically pre-filtered arXiv papers and annotated 151 text segments. Basic statistics on the resulting data set are shown below.

Statistics

151 annotated text segments (131 unique), annotated by two annotators
1,345 annotated entities
1,110 annotated relations
inter annotator agreement (based on 20 text segments annotated by both annotators)
- entities
  - Cohen’s kappa based on character level entity type: 0.867
  - exact match of entity spans: 105/133 (0.789)
- entities + relations
  - Cohen’s kappa based on character level relation target span : 0.737
  - exact match of relations and their exact entity spans: 88/132 (0.667)

Future considerations

Based on the lessons learned during creation, analysis of, and early feedback for the data set, we have the following considerations for future versions.

relation annotation
- relations should be annotated on an entity level rather than a surface form level
- for each entity a list of surface forms (i.e. co-references) is then given
additional entity type “context”
- as an improved way to deal with the variety of values (numbers, sets, ranges, textual descriptions of dynamic processes, etc.) a new entity type “context” could be introduced
- example: “initially we set λ to 0.01, …” → value: 0.01, value’s context: initially
prevention of recall limitation
- recall values of models evaluated on the data set in its current from are only informative to a certain degree, because no text segments without annotations are provided
- to resolve this, text segments should not be heuristically pre-filtered, but rather taken from whole papers—i.e., a paper is split into paragraphs and all paragraphs are annotated

Citing this Resource

BibTeX

@misc{SML:artifpar:2022-1,
  author       = {Saier, Tarek and Asakura, Takuto and F{\"{a}}rber, Michael},
  title        = {{artifpar - a data set for research artifact parameters}},
  url          = {https://sigmathling.kwarc.info/resources/artifact-parameter-dataset/},
  note         = {SIGMathLing -- Special Interest Group on Math Linguistics},
  year         = {2022},
}

Accessibility and License

The content of this dataset is licensed to SIGMathLing members for research and tool development purposes.

Access is restricted to SIGMathLing members under the SIGMathLing Non-Disclosure-Agreement as for most arXiv articles, the right of distribution was only given (or assumed) to arXiv itself.

Generated from

unarXive (paper pre-selection for annotation based on papers with code)

About

Part of the [KOM,BI] project, with support from KHYS.

Appendix

Data format example:

{
 "context":      "We use the Adam optimizer with a fixed learning rate of \\(10^{-5}\\).",
 "in_doc":       "2012.01573",
 "annotator_id": "tsa",
 "annotation":   {"entity_dict": {"A1": {"entity_group": "a_g1",
                                         "offset": [11, 15],
                                         "surface_text": "Adam",
                                         "type": "a"},
                                  "P1": {"entity_group": "p_g1",
                                         "offset": [39, 52],
                                         "surface_text": "learning rate",
                                         "type": "p"},
                                  "V1": {"entity_group": "v_g1",
                                         "offset": [58, 65],
                                         "surface_text": "10^{-5}",
                                         "type": "v"}},
                  "relation_tuples": [["V1", "P1"],
                                      ["P1", "A1"]]
                  },
 "annotation_w3c_data_model": [{"@context": "http://www.w3.org/ns/anno.jsonld",
                                "type":     "Annotation"
                                ...
                              }],
                              ...
}