annotated_text_segments.json
annotated_text_segments-semeval2022_task12_format.json
doc/
This is the first release of the (work-in-progress) artifpar dataset.
It contains text segments from parsed arXiv papers with annotated entities and their relations. Specifically, the annotations comprise (1) research artifacts, (2) parameters of those artifacts, and (3) values of those parameters. Annotations are provided in three formats:
The data set in its current form is the result of a first attempt at defining and studying a new information extraction task. Goal of the task is to shed light on how authors of academic literature describe their usage of research artifacts, and to develop information extraction models.
To define the scope of the task, we propose the following definitions.
With annotation guidelines based on above definitions, we heuristically pre-filtered arXiv papers and annotated 151 text segments. Basic statistics on the resulting data set are shown below.
Statistics
Future considerations
Based on the lessons learned during creation, analysis of, and early feedback for the data set, we have the following considerations for future versions.
@misc{SML:artifpar:2022-1,
author = {Saier, Tarek and Asakura, Takuto and F{\"{a}}rber, Michael},
title = {{artifpar - a data set for research artifact parameters}},
url = {https://sigmathling.kwarc.info/resources/artifact-parameter-dataset/},
note = {SIGMathLing -- Special Interest Group on Math Linguistics},
year = {2022},
}
The content of this dataset is licensed to SIGMathLing members for research and tool development purposes.
Access is restricted to SIGMathLing members under the SIGMathLing Non-Disclosure-Agreement as for most arXiv articles, the right of distribution was only given (or assumed) to arXiv itself.
Part of the [KOM,BI] project, with support from KHYS.
Data format example:
{
"context": "We use the Adam optimizer with a fixed learning rate of \\(10^{-5}\\).",
"in_doc": "2012.01573",
"annotator_id": "tsa",
"annotation": {"entity_dict": {"A1": {"entity_group": "a_g1",
"offset": [11, 15],
"surface_text": "Adam",
"type": "a"},
"P1": {"entity_group": "p_g1",
"offset": [39, 52],
"surface_text": "learning rate",
"type": "p"},
"V1": {"entity_group": "v_g1",
"offset": [58, 65],
"surface_text": "10^{-5}",
"type": "v"}},
"relation_tuples": [["V1", "P1"],
["P1", "A1"]]
},
"annotation_w3c_data_model": [{"@context": "http://www.w3.org/ns/anno.jsonld",
"type": "Annotation"
...
}],
...
}