Skip to content

Commit df0ee69

Browse files
authored
Merge pull request #1 from EngineeringSoftware/reorg-metadata
reorganize files for release and remove Roosterize-specific data
2 parents c7ac025 + 79e6e6b commit df0ee69

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+21
-2452
lines changed

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Files: data-indexes/*
1+
Files: projects.yml
22
License: MIT
33

44
Files: definitions/*

README.md

Lines changed: 20 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,26 +4,32 @@ A corpus of code for the [Coq proof assistant](https://coq.inria.fr) along with
44
machine-readable representations, derived from verification
55
projects related to the [Mathematical Components][math-comp-website] (MathComp) library.
66
The corpus can be used, e.g., in machine learning and data mining applications.
7-
Machine-readable representations were created using the [SerAPI library][serapi-website].
7+
Machine-readable representations are in the form of [S-expressions][sexp-link] (sexps),
8+
and were created using the [SerAPI library][serapi-website].
89

910
[math-comp-website]: https://math-comp.github.io
1011
[serapi-website]: https://github.com/ejgallego/coq-serapi
12+
[sexp-link]: https://en.wikipedia.org/wiki/S-expression
1113

1214
## Obtaining the corpus
1315

1416
The latest released version of the corpus can be [downloaded][download-link]
15-
from GitHub. The archive includes both the source files written in Coq and files containing
16-
machine-readable representations; the latter were created using [Coq 8.10.2][coq-8102]
17-
and [SerAPI 0.7.1][serapi-071], via the SerAPI programs `sercomp`, `sertok`, and `sername`.
17+
from GitHub. The archive includes both the Coq source files and corresponding
18+
machine-readable representations.
1819

1920
[download-link]: https://github.com/EngineeringSoftware/math-comp-corpus/releases
20-
[coq-8102]: https://github.com/coq/coq/releases/tag/V8.10.2
21-
[serapi-071]: https://github.com/ejgallego/coq-serapi/releases
2221

2322
## Corpus contents
2423

25-
The latest release of the corpus, based on MathComp 1.9.0, consists of 449
26-
source files from 21 Coq projects --- in total over 297k lines of code (LOC).
24+
The latest corpus release is based on [Coq 8.10.2][coq-8102] and [MathComp 1.9.0][mathcomp-190],
25+
and contains 449 source files from 21 Coq projects --- in total over 297k lines of code (LOC).
26+
For each source file (e.g., `theory.v`), there are two corresponding files with lists of
27+
sexps, organized at the Coq sentence level, for tokens (`theory.tok.sexp`)
28+
and abstract syntax trees (`theory.ast.sexp`). Moreover, three sexp representations
29+
are provided for each Coq lemma statement in the corpus: tokens, abstract syntax tree, and
30+
elaborated term. All machine-readable representations were created using [SerAPI 0.7.1][serapi-071],
31+
via the SerAPI-bundled programs `sercomp`, `sertok`, and `sername`.
32+
2733
A [research paper][arxiv-paper] describes the corpus in more detail
2834
and provides additional statistics. The corpus is divided into three tiers based
2935
on how well projects conform to the [MathComp conventions][math-comp-contrib].
@@ -56,12 +62,11 @@ The structure of the corpus is as follows:
5662

5763
| File/directory | Contents |
5864
|:-------------------------------|:---------------------------------------------------------------------------------------|
59-
| `projects-standalone-8.10.yml` | List of projects, along with their URL, SHA, build command, installation command, etc. |
65+
| `projects.yml` | List of projects, along with their URL, SHA, build command, installation command, etc. |
6066
| `raw-files` | Project source files and their machine-readable representations. |
61-
| `lemmas` | Lemmas extracted from the corpus. |
62-
| `lemmas-filtered` | Subset of lemmas obeying restrictions on the maximum sizes of their elaborated terms. |
67+
| `lemmas` | Lemmas for all projects in the corpus and their machine-readable statements. |
68+
| `lemmas-filtered` | Subset of lemmas obeying restrictions on the maximum sizes of elaborated terms. |
6369
| `definitions` | Definitions extracted from the corpus. |
64-
| `data-indexes` | Data splitting used in the research paper. |
6570

6671
[finmap]: https://github.com/math-comp/finmap
6772
[fourcolor]: https://github.com/math-comp/fourcolor
@@ -92,6 +97,9 @@ The structure of the corpus is as follows:
9297
[apache2]: https://spdx.org/licenses/Apache-2.0.html
9398
[bsd2]: https://spdx.org/licenses/BSD-2-Clause.html
9499

100+
[coq-8102]: https://github.com/coq/coq/releases/tag/V8.10.2
101+
[mathcomp-190]: https://github.com/math-comp/math-comp/releases/tag/mathcomp-1.9.0
102+
[serapi-071]: https://github.com/ejgallego/coq-serapi/releases/tag/8.10.0%2B0.7.1
95103
[arxiv-paper]: https://arxiv.org/abs/2004.07761
96104
[math-comp-contrib]: https://github.com/math-comp/math-comp/blob/mathcomp-1.9.0/CONTRIBUTING.md
97105

0 commit comments

Comments
 (0)