-
Notifications
You must be signed in to change notification settings - Fork 27
bug extracting readthedocs badges with regexp. Fixes #860, Fixes #857 #863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,163 @@ | ||
| The following metadata fields can be extracted from a readme.md file. | ||
| Unlike others files formats (pom, cargo, cabal...), README documents do not follow a formal specification. They are free‑form text files, usually written in markdown or restructuredtext, and their structure varies widely across projects. SOMEF applies heuristics to identify common sections (e.g., Title, Description, Installation, Usage, License...) and extracts metadata accordingly. | ||
|
|
||
| | Software metadata category | SOMEF metadata JSON path | README.MD metadata file field | | ||
| |--------------------------------|----------------------------------------|----------------------------------------| | ||
| | acknowledgement | acknowledgement[i].result.value | hearders with acknowledgement | | ||
| | citation | citation[i].result.value | headers with citation, reference, cite. Extract bibtext **(1)** | | ||
| | contact | contact[i].result.value | headers with contact | | ||
| | contributing_guidelines | contributing_guidelines[i].result.value | headers with contributing | | ||
| | contributors | contributors[i].result.value | headers with contributor | | ||
| | description | description[i].result.value | headers with description, introduction, basics, initiation, overview | | ||
| | documentation | documentation[i].result.value | github or gitlab url documentation **(2)**, headers with documentation, readthedocs same name project, readthedocs in badges, wiki links in badges and text | | ||
| | download | download[i].result.value | headers with download | | ||
| | executable_example | executable_example[i].result.value | extracts Binder from badgets **(3)** | | ||
| | faq | faq[i].result.value | headers with faq, errors, problems | | ||
| | full_title | full_title[i].result.value | extract full title **(4)** | | ||
| | homepage | homepage[i].result.value | homepage from badgets **(5)** | | ||
| | identifier | idenfier[i].result.value | extract from badgets directly or get from zenodo with latest doi **(6)**, swh identifiers **(7)** | | ||
| | images | images[i].result.value | other images in the README apart from the logo | | ||
| | installation | installation[i].result.value | headers with installation, install, setup, prepare, preparation, manual, guide | | ||
| | license | license[i].result.value | headers with license | | ||
| | logo | logo[i].result.value | look images in badges and text **(8)** | | ||
| | package_distribution | package_distribution[i].result.value | Pypi or latest Pypi version in badges **(9)** | | ||
| | related_documentation | dorelated_documentationumentation[i].result.value | readthedocs diferent name project | | ||
| | run | run[i].result.value | headers with run, execute | | ||
| | readme_url | readme_url[i].result.value | url in raw githubuser content **(10)** | | ||
| | related_papers | related_papers[i].result.value | look for arXiv reference in all the text **(11)** | | ||
| | repository_status | repository_status[i].result.value | badges with Project status **(12)** | | ||
| | requirements | requirements[i].result.value | headers with requirement, prerequisite, dependency, dependent | | ||
| | support | support[i].result.value | headers with support, help, report | | ||
| | support_channels | support_channels[i].result.value | extract information of gitter, reddit and discord in badges and text **(13)** | | ||
| | usage | usage[i].result.value | headers with usage, example, implement, implementation, demo, tutorial, start, started | | ||
|
|
||
|
|
||
| ------ | ||
|
|
||
| **(1)** | ||
| - Example: | ||
| ```bib | ||
| @inproceedings{garijo2017widoco, | ||
| title={WIDOCO: a wizard for documenting ontologies}, | ||
| author={Garijo, Daniel}, | ||
| booktitle={International Semantic Web Conference}, | ||
| pages={94--102}, | ||
| year={2017}, | ||
| organization={Springer, Cham}, | ||
| doi = {10.1007/978-3-319-68204-4_9}, | ||
| funding = {USNSF ICER-1541029, NIH 1R01GM117097-01}, | ||
| url={http://dgarijo.com/papers/widoco-iswc2017.pdf} | ||
| } | ||
| ``` | ||
| - Result: | ||
| ``` | ||
| { | ||
| "result": { | ||
| "value": "@inproceedings{garijo2017widoco,\n url = {http://dgarijo.com/papers/widoco-iswc2017.pdf},\n funding = {USNSF ICER-1541029, NIH 1R01GM117097-01},\n doi = {10.1007/978-3-319-68204-4_9},\n organization = {Springer, Cham},\n year = {2017},\n pages = {94--102},\n booktitle = {International Semantic Web Conference},\n author = {Garijo, Daniel},\n title = {WIDOCO: a wizard for documenting ontologies},\n}", | ||
| "type": "Text_excerpt", | ||
| "format": "bibtex", | ||
| "doi": "10.1007/978-3-319-68204-4_9", | ||
| "title": "WIDOCO: a wizard for documenting ontologies", | ||
| "author": "Garijo, Daniel", | ||
| "url": "http://dgarijo.com/papers/widoco-iswc2017.pdf" | ||
| }, | ||
| } | ||
| ``` | ||
|
|
||
|
|
||
| **(2)** | ||
| - Example if github: | ||
| ``` | ||
| f"https://github.com/{owner}/{repo_name}/tree/{urllib.parse.quote(repo_default_branch)}/{docs_path}" | ||
| ``` | ||
| - Example if gitlab: | ||
| ``` | ||
| f"https://{domain_gitlab}/{owner}/{repo_name}/-/tree/{urllib.parse.quote(repo_default_branch)}/{docs_path}" | ||
| ``` | ||
|
|
||
| **(3)** | ||
| - Example: `[](https://mybinder.org/v2/gh/user/repo/HEAD)` | ||
| - Result: `"value": "https://mybinder.org/v2/gh/user/repo/HEAD"` | ||
|
|
||
| **(4)** | ||
| - Example: `# WIzard for DOCumenting Ontologies (WIDOCO)` | ||
| - Result: | ||
| ``` | ||
| "full_title": [ | ||
| { | ||
| "result": { | ||
| "type": "String", | ||
| "value": "WIzard for DOCumenting Ontologies (WIDOCO)" | ||
| }, | ||
| "confidence": 1, | ||
| "technique": "regular_expression", | ||
| "source": "https://raw.githubusercontent.com/dgarijo/Widoco/master/README.md" | ||
| } | ||
| ] | ||
| ``` | ||
|
|
||
| **(5)** | ||
| - Example: `[](https://myproject.org)` | ||
| - Result: `"value": "https://myproject.org"` | ||
|
|
||
|
|
||
| **(6)** | ||
| - Example: `[](https://doi.org/10.5281/zenodo.11093793)` | ||
| - Result: `"value": "https://doi.org/10.5281/zenodo.11093793"` | ||
|
|
||
| **(7)** | ||
| - Example: `[](https://archive.softwareheritage.org/swh:1:dir:40d462bbecefc3a9c3e810567d1f0d7606e0fae7;origin=...)` | ||
| - Result: ` "value": "https://archive.softwareheritage.org/swh:1:dir:40d462bbecefc3a9c3e810567d1f0d7606e0fae7",` | ||
|
|
||
|
|
||
| **(8)** | ||
| - Example: `` | ||
| - Result: `"value": "https://raw.githubusercontent.com/dgarijo/Widoco/master/src/main/resources/logo/logo2.png"`` | ||
|
|
||
| **(9)** | ||
| - Example: `[](https://badge.fury.io/py/somef) ` | ||
| - Result: `"value": "https://pypi.org/project/somef"` | ||
|
|
||
|
|
||
| **(10)** | ||
| - Example: | ||
| ``` | ||
| [Yulun Zhang](http://yulunzhang.com/), [Yapeng Tian](http://yapengtian.org/), [Yu Kong](http://www1.ece.neu.edu/~yukong/), [Bineng Zhong](https://scholar.google.de/citations?user=hvRBydsAAAAJ&hl=en), and [Yun Fu](http://www1.ece.neu.edu/~yunfu/), "Residual Dense Network for Image Super-Resolution", CVPR 2018 (spotlight), [[arXiv]](https://arxiv.org/abs/1802.08797) | ||
| ``` | ||
| - Result: `"value": "https://arxiv.org/abs/1802.08797"` | ||
|
|
||
|
|
||
| **(11)** | ||
| - Example: | ||
| ``` | ||
| f"https://raw.githubusercontent.com/{owner}/{repo_name}/{repo_ref}/{urllib.parse.quote(partial)}" | ||
| ``` | ||
|
|
||
| **(12)** | ||
| - Example: | ||
| ``` | ||
| [](https://www.repostatus.org/#active) | ||
| ``` | ||
| - Result: | ||
| ``` | ||
| "value": "https://www.repostatus.org/#active", | ||
| "description": "Active \u2013 The project has reached a stable, usable state and is being actively developed." | ||
| ``` | ||
|
|
||
| **(13)** | ||
| - Example: | ||
| ``` | ||
| [](https://gitter.im/myproject/community) | ||
| [Reddit](https://www.reddit.com/r/myproject) | ||
| [Discord](https://discord.com/invite/xyz789) | ||
| ``` | ||
| - Result: | ||
| ``` | ||
| "value": "https://gitter.im/myproject/community" | ||
| .... | ||
| "value": "https://www.reddit.com/r/myproject" | ||
| ..... | ||
| "value": "https://discord.com/invite/xyz789" | ||
| ``` | ||
|
|
||
|
|
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,33 @@ | ||
|
|
||
| SOMEF recognizes the programming languages used in a software repository by inspecting | ||
| well-known configuration files, dependency descriptors and executable artifacts. | ||
| To know more about the extraction details for each type of file, click on it. | ||
|
|
||
|
|
||
|
|
||
| | Language | Supported Files | | ||
| |-----------|----------------------------| | ||
| | Haskell | [`*.cabal`](./cabal.md) | | ||
| | Java | [`pom.xml`](./pom.md) | | ||
| | JavaScript | [`package.json`](./packagejson.md), [`bower.json`](./bower.md) | | ||
| | Julia | [`Project.toml`](./julia.md) | | ||
| | PHP | [`composer.json`](./composer.md) | | ||
| | Python | [`setup.py`](./setuppy.md), [`pyproject.toml`](./pyprojecttoml.md), [`requirements.txt`](./requirementstxt.md) | | ||
| | R | [`DESCRIPTION`](./description.md) | | ||
| | Ruby | [`*.gemspec`](./gemspec.md) | | ||
| | Rust | [`Cargo.toml`](./cargo.md) | | ||
|
|
||
| --- | ||
|
|
||
| SoMEF also detects the following files to recognize build instructions, workflows or executable examples: | ||
|
|
||
|
|
||
| | Language | Supported Files | Software metadata category | | ||
| |-----------|------------------------------------|-----------------------------| | ||
| | Docker | `Dockerfile`, `docker-compose.yml` | has_built_file | ||
| | Jupyter Notebook | `*.ipynb` | executable_example | | ||
| | Ontologies | `*.ttl`, `*.owl`, `*.nt`, `*.xml`, `*.jsonld` | ontologies | | ||
| | Shell | `*.sh` | has_script_file | | ||
| | YAML | `*.yml`, `*.yaml` | continuous_integration, workflows | ||
|
|
||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,82 +2,6 @@ | |
|
|
||
| This project supports extracting metadata from specific types of files commonly used to declare authorship and contribution in open source repositories. | ||
|
|
||
| ## Supported files of authors. | ||
|
|
||
| The following filenames are recognized and processed automatically: | ||
|
|
||
| * `AUTHORS` | ||
| * `AUTHORS.md` | ||
| * `AUTHORS.txt` | ||
|
|
||
| These files are expected to be located at the root of the repository. Filenames are matched case-insensitively. | ||
|
|
||
| ## Purpose and Format | ||
|
|
||
| These files typically contain a list of individuals and/or organizations that have contributed to the project. While there is no universal standard for formatting, a widely referenced convention is Google's guidance: | ||
|
|
||
| 🔗 [Google Open Source: Authors Files Protocol](https://opensource.google/documentation/reference/releasing/authors/) | ||
|
|
||
| The content may be structured as: | ||
|
|
||
| * Simple plain text, with one contributor per line. | ||
| * Markdown-formatted text (`.md` files). | ||
| * Lines including contributor names, emails (e.g., `Name <email>`), and sometimes affiliations. | ||
|
|
||
| ### Examples of Valid Entries | ||
|
|
||
| ```text | ||
| Jane Doe <[email protected]> | ||
| John Smith | ||
| Acme Corporation <[email protected]> | ||
| Google Inc. | ||
| ``` | ||
|
|
||
| ### Examples of NON Valid Entries | ||
|
|
||
| ```text | ||
| JetBrains <> | ||
| Microsoft | ||
| Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung | ||
| scrawl - Top contributor | ||
| Tom | ||
| ``` | ||
| ## What Is Read vs. Discarded | ||
|
|
||
| When processing these files, the parser will: | ||
|
|
||
| **Include** lines that: | ||
|
|
||
| * Contain person names, optionally with emails (`Name <email>`). | ||
| * Clearly refer to organizations (e.g., "Google LLC", "OpenAI Inc."). | ||
|
|
||
| **Discard** lines that: | ||
|
|
||
| * Are headers, decorative separators, or markdown formatting (`#`, `*`, `=`, etc.). | ||
| * Contain only URLs or links. | ||
| * Are single words with no email and no organizational keyword (e.g., `JetBrains <>`). | ||
| * Are markdown or structured noise (`---`, `{}`, etc.). | ||
| * Contain more than four words and are not recognized as organizations — to avoid capturing generic or descriptive sentences (e.g., This line not is an author). | ||
|
|
||
| ### Special Cases | ||
|
|
||
| * Entries with only a first name and an email are accepted but must not assign an empty `last_name`. | ||
| * Lines starting with `-` or `*` are considered lists, but only parsed if the content matches expected author patterns. | ||
| * Blocks enclosed in `{}` are stripped before parsing. | ||
| * Any line matching known organization suffixes (`Inc.`, `LLC`, `Ltd.`, `Corporation`) is treated as an organization, even if no email is present. | ||
| * Some organization names (e.g., Open Source Initiative) may be mistakenly treated as person names if they do not contain a company designator or email. To improve detection, it is recommended to use names like Open Source Initiative Inc. | ||
| * In such cases, only the meaningful part (typically the name) is extracted before any descriptive annotations. | ||
| For example, the line: | ||
| Tom Smith (Tom) - Project leader 2010-2018 | ||
| Will be interpreted as: | ||
| { | ||
| "type": "Person", | ||
| "name": "Tom Smith", | ||
| "value": "Tom Smith", | ||
| "given_name": "Tom", | ||
| "last_name": "Smith" | ||
| } | ||
|
|
||
|
|
||
| ## Supported Metadata Files in SOMEF | ||
|
|
||
|
|
@@ -90,6 +14,7 @@ SOMEF can extract metadata from a wide range of files commonly found in software | |
| | `bower.json` | JavaScript (Bower) | Package descriptor used for configuring packages that can be used as a dependency for Bower-managed front-end projects. | <div align="center">[🔍](./bower.md)</div>| [📄](https://github.com/bower/spec/blob/master/json.md)| |[Example](https://github.com/juanjemdIos/somef/blob/master/src/somef/test/test_data/repositories/js-template/bower.json) | | ||
| | `package.json` | JavaScript / Node.js | Defines metadata, scripts, and dependencies for Node.js projects | <div align="center">[🔍](./packagejson.md)| [📄](https://docs.npmjs.com/cli/v10/configuring-npm/package-json)| 10.9.4|[Example](https://github.com/npm/cli/blob/latest/package.json) | | ||
| | `codemeta.json` | JSON-LD | Metadata file for research software using JSON-LD vocabulary | <div align="center">[🔍](./codemetajson.md)</div> | [📄](https://github.com/codemeta/codemeta/blob/master/crosswalk.csv)| [v3.0](https://w3id.org/codemeta/3.0)|[Example](https://github.com/codemeta/codemeta/blob/master/codemeta.json) | | ||
| | `readme.me` | Markdown | Main documentation file of repository | <div align="center">[🔍](./readmefile.md)</div>| | |[Example](https://github.com/KnowledgeCaptureAndDiscovery/somef/blob/master/README.md) | | ||
| | `composer.json` | PHP | Manifest file serves as the package descriptor used in PHP projects. | <div align="center">[🔍](./composer.md)</div>| [📄](https://getcomposer.org/doc/04-schema.md)| [2.8.12](https://getcomposer.org/changelog/2.8.12)|[Example](https://github.com/composer/composer/blob/main/composer.json) | | ||
| | `juliaProject.toml` | Python | Defines the package metadata and dependencies for Julia projects, used by the Pkg package manager.| <div align="center">[🔍](./julia.md)</div>| [📄](https://docs.julialang.org/en/v1/)| |[Example](https://github.com/JuliaLang/TOML.jl/blob/master/Project.toml) | | ||
| | `pyproject.toml` | Python | Modern Python project configuration file used by tools like Poetry and Flit | <div align="center">[🔍](./pyprojecttoml.md)</div>| [📄](https://packaging.python.org/en/latest/guides/writing-pyproject-toml/)| |[Example](https://github.com/KnowledgeCaptureAndDiscovery/somef/blob/master/pyproject.toml) | | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this file? Shouldn't it be requirements.txt?