Skip to content

Commit 7bb91b4

Browse files
authored
Merge pull request #888 from CenterForOpenScience/release/25.6.0
release 25.6.0
2 parents 150997a + 064971d commit 7bb91b4

File tree

91 files changed

+1881
-1236
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

91 files changed

+1881
-1236
lines changed

ARCHITECTURE.md

Lines changed: 82 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -1,112 +1,107 @@
1-
# Architecture of SHARE/Trove
2-
> NOTE: this document requires update (big ol' TODO)
3-
1+
# Architecture of SHARE/trove
42

53
This document is a starting point and reference to familiarize yourself with this codebase.
64

75
## Bird's eye view
8-
In short, SHARE/Trove takes metadata records (in any supported input format),
9-
ingests them, and makes them available in any supported output format.
10-
```
11-
┌───────────────────────────────────────────┐
12-
│ Ingest │
13-
│ ┌──────┐ │
14-
│ ┌─────────────────────────┐ ┌──►Format├─┼────┐
15-
│ │ Normalize │ │ └──────┘ │ │
16-
│ │ │ │ │ ▼
17-
┌───────┐ │ │ ┌─────────┐ ┌────────┐ │ │ ┌──────┐ │ save as
18-
│Harvest├─┬─┼─┼─►Transform├──►Regulate├─┼─┬─┼──►Format├─┼─┬─►FormattedMetadataRecord
19-
└───────┘ │ │ │ └─────────┘ └────────┘ │ │ │ └──────┘ │ │
20-
│ │ │ │ │ . │ │ ┌───────┐
21-
│ │ └─────────────────────────┘ │ . │ └──►Indexer│
22-
│ │ │ . │ └───────┘
23-
│ └─────────────────────────────┼─────────────┘ some formats also
24-
│ │ indexed separately
25-
▼ ▼
26-
save as save as
27-
RawDatum NormalizedData
6+
In short, SHARE/trove holds metadata records that describe things and makes those records available for searching, browsing, and subscribing.
7+
8+
![overview of shtrove: metadata records in, search/browse/subscribe out](./project/static/img/shtroverview.png)
9+
10+
11+
## Parts
12+
a look at the tangles of communication between different parts of the system:
13+
14+
```mermaid
15+
graph LR;
16+
subgraph shtrove;
17+
subgraph web[api/web server];
18+
ingest;
19+
search;
20+
browse;
21+
rss;
22+
atom;
23+
oaipmh;
24+
end;
25+
worker["background worker (celery)"];
26+
indexer["indexer daemon"];
27+
rabbitmq["task queue (rabbitmq)"];
28+
postgres["database (postgres)"];
29+
elasticsearch;
30+
web---rabbitmq;
31+
web---postgres;
32+
web---elasticsearch;
33+
worker---rabbitmq;
34+
worker---postgres;
35+
worker---elasticsearch;
36+
indexer---rabbitmq;
37+
indexer---postgres;
38+
indexer---elasticsearch;
39+
end;
40+
source["metadata source (e.g. osf.io backend)"];
41+
user["web user, either by browsing directly or via web app (like osf.io)"];
42+
subscribers["feed subscription tools"];
43+
source-->ingest;
44+
user-->search;
45+
user-->browse;
46+
subscribers-->rss;
47+
subscribers-->atom;
48+
subscribers-->oaipmh;
2849
```
2950

3051
## Code map
3152

3253
A brief look at important areas of code as they happen to exist now.
3354

34-
### Static configuration
35-
36-
`share/schema/` describes the "normalized" metadata schema/format that all
37-
metadata records are converted into when ingested.
38-
39-
`share/sources/` describes a starting set of metadata sources that the system
40-
could harvest metadata from -- these will be put in the database and can be
41-
updated or added to over time.
42-
43-
`project/settings.py` describes system-level settings which can be set by
44-
environment variables (and their default values), as well as settings
45-
which cannot.
46-
47-
`share/models/` describes the data layer using the [Django](https://www.djangoproject.com/) ORM.
48-
49-
`share/subjects.yaml` describes the "central taxonomy" of subjects allowed
50-
in `Subject.name` fields of `NormalizedData`.
51-
52-
### Harvest and ingest
53-
54-
`share/harvest/` and `share/harvesters/` describe how metadata records
55-
are pulled from other metadata repositories.
56-
57-
`share/transform/` and `share/transformers/` describe how raw data (possibly
58-
in any format) are transformed to the "normalized" schema.
55+
- `trove`: django app for rdf-based apis
56+
- `trove.digestive_tract`: most of what happens after ingestion
57+
- stores records and identifiers in the database
58+
- initiates indexing
59+
- `trove.extract`: parsing ingested metadata records into resource descriptions
60+
- `trove.derive`: from a given resource description, create special non-rdf serializations
61+
- `trove.render`: from an api response modeled as rdf graph, render the requested mediatype
62+
- `trove.models`: database models for identifiers and resource descriptions
63+
- `trove.trovesearch`: builds rdf-graph responses for trove search apis (using `IndexStrategy` implementations from `share.search`)
64+
- `trove.vocab`: identifies and describes concepts used elsewhere
65+
- `trove.vocab.trove`: describes types, properties, and api paths in the trove api
66+
- `trove.vocab.osfmap`: describes metadata from osf.io (currently the only metadata ingested)
67+
- `trove.openapi`: generate openapi json for the trove api from thesaurus in `trove.vocab.trove`
68+
- `share`: django app with search indexes and remnants of sharev2
69+
- `share.models`: database models for external sources, users, and other system book-keeping
70+
- `share.oaipmh`: provide data via [OAI-PMH](https://www.openarchives.org/OAI/openarchivesprotocol.html)
71+
- `share.search`: all interaction with elasticsearch
72+
- `share.search.index_strategy`: abstract base class `IndexStrategy` with multiple implementations, for different approaches to indexing the same data
73+
- `share.search.daemon`: the "indexer daemon", an optimized background worker for batch-processing updates and sending to all active index strategies
74+
- `share.search.index_messenger`: for sending messages to the indexer daemon
75+
- `api`: django app with remnants of the legacy sharev2 api
76+
- `api.views.feeds`: allows custom RSS and Atom feeds
77+
- otherwise, subject to possible deprecation
78+
- `osf_oauth2_adapter`: django app for login via osf.io
79+
- `project`: the actual django project
80+
- default settings at `project.settings`
81+
- pulls together code from other directories implemented as django apps (`share`, `trove`, `api`, and `osf_oauth2_adapter`)
5982

60-
`share/regulate/` describes rules which are applied to every normalized datum,
61-
regardless where or what format it originally come from.
6283

63-
`share/metadata_formats/` describes how a normalized datum can be formatted
64-
into any supported output format.
65-
66-
`share/tasks/` runs the harvest/ingest pipeline and stores each task's status
67-
(including debugging info, if errored) as a `HarvestJob` or `IngestJob`.
68-
69-
### Outward-facing views
70-
71-
`share/search/` describes how the search indexes are structured, managed, and
72-
updated when new metadata records are introduced -- this provides a view for
73-
discovering items based on whatever search criteria.
74-
75-
`share/oaipmh/` describes the [OAI-PMH](https://www.openarchives.org/OAI/openarchivesprotocol.html)
76-
view for harvesting metadata from SHARE/Trove in bulk.
77-
78-
`api/` describes a mostly REST-ful API that's useful for inspecting records for
79-
a specific item of interest.
80-
81-
### Internals
82-
83-
`share/admin/` is a Django-app for administrative access to the SHARE database
84-
and pipeline logs
85-
86-
`osf_oauth2_adapter/` is a Django app to support logging in to SHARE via OSF
84+
## Cross-cutting concerns
8785

88-
### Testing
86+
### Resource descriptions
8987

90-
`tests/` are tests.
88+
Uses the [resource description framework](https://www.w3.org/TR/rdf11-primer/#section-Introduction):
89+
- the content of each ingested metadata record is an rdf graph focused on a specific resource
90+
- all api responses from `trove` views are (experimentally) modeled as rdf graphs, which may be rendered a variety of ways
9191

92-
## Cross-cutting concerns
92+
### Identifiers
9393

94-
### Immutable metadata
94+
Whenever feasible, use full URI strings to identify resources, concepts, types, and properties that may be exposed outwardly.
9595

96-
Metadata records at all stages of the pipeline (`RawDatum`, `NormalizedData`,
97-
`FormattedMetadataRecord`) should be considered immutable -- any updates
98-
result in a new record being created, not an old record being altered.
96+
Prefer using open, standard, well-defined namespaces wherever possible ([DCAT](https://www.w3.org/TR/vocab-dcat-3/) is a good place to start; see `trove.vocab.namespaces` for others already in use). When app-specific concepts must be defined, use the `TROVE` namespace (`https://share.osf.io/vocab/2023/trove/`).
9997

100-
Multiple records which describe the same item/object are grouped by a
101-
"source-unique identifier" or "suid" -- essentially a two-tuple
102-
`(source, identifier)` that uniquely and persistently identifies an item in
103-
the source repository. In most outward-facing views, default to showing only
104-
the most recent record for each suid.
98+
A notable exception (non-URI identifier) is the "source-unique identifier" or "suid" -- essentially a two-tuple `(source, identifier)` that uniquely and persistently identifies a metadata record in a source repository. This `identifier` may be any string value, provided by the external source.
10599

106100
### Conventions
107101
(an incomplete list)
108102

109-
- functions prefixed `pls_` ("please") are a request for something to happen
103+
- local variables prefixed with underscore (to consistently distinguish between internal-only names and those imported/built-in)
104+
- prefer full type annotations in python code, wherever reasonably feasible
110105

111106
## Why this?
112107
inspired by [this writeup](https://matklad.github.io/2021/02/06/ARCHITECTURE.md.html)

CHANGELOG.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,25 @@
11
# Change Log
22

3+
# [25.6.0] - 2025-10-30
4+
- bump dependencies
5+
- `celery` to 5.5.3
6+
- `kombu` to 5.5.4
7+
- improve error handling in celery task-result backend
8+
- use logging config in celery worker
9+
- improve code docs (README.md et al.)
10+
- add cardsearch feeds (rss and atom)
11+
- /trove/index-card-search/rss.xml
12+
- /trove/index-card-search/atom.xml
13+
- fix: render >1 result in streamed index-value-search (csv, tsv, json)
14+
- when browsing trove api in browser, wrap non-browser-friendly mediatypes in html (unless `withFileName`, which requests download)
15+
- better trove.render test coverage
16+
- code cleanliness
17+
- de-collide "simple" names
18+
- SimpleRendering => EntireRendering
19+
- SimpleTrovesearchRenderer => TrovesearchCardOnlyRenderer
20+
- consolidate more shared logic into trove.util
21+
- more accurate type annotations
22+
323
# [25.5.0] - 2025-07-15
424
- use python 3.13
525
- use `poetry` to manage dependencies

CONTRIBUTING.md

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,18 @@
11
# CONTRIBUTING
22

3-
TODO: how do we want to guide community contributors?
3+
> note: this codebase is currently (and historically) rather entangled with [osf.io](https://osf.io), which has its shtrove at https://share.osf.io -- stay tuned for more-reusable open-source libraries and tools that should be more accessible to community contribution
44
5-
For now, if you're interested in contributing to SHARE/Trove, feel free to
5+
For now, if you're interested in contributing to SHARE/trove, feel free to
66
[open an issue on github](https://github.com/CenterForOpenScience/SHARE/issues)
77
and start a conversation.
8+
9+
## Required checks
10+
11+
All changes must pass the following checks with no errors:
12+
- linting: `python -m flake8`
13+
- static type-checking (on `trove/` code only, for now): `python -m mypy trove`
14+
- tests: `python -m pytest -x tests/`
15+
- note: some tests require other services running -- if [using the provided docker-compose.yml](./how-to/run-locally.md), recommend running in the background (upping worker ups all: `docker compose up -d worker`) and executing tests from within one of the python containers (`indexer`, `worker`, or `web`):
16+
`docker compose exec indexer python -m pytest -x tests/`
17+
18+
All new changes should also avoid decreasing test coverage, when reasonably possible (currently checked on github pull requests).

README.md

Lines changed: 9 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,17 @@
1-
# SHARE/Trove
1+
# SHARE/trove (aka SHARtrove, shtrove)
22

3-
SHARE is creating a free, open dataset of research (meta)data.
3+
> share (verb): to have or use in common.
44
5-
> **Note**: SHARE’s open API tools and services help bring together scholarship distributed across research ecosystems for the purpose of greater discoverability. However, SHARE does not guarantee a complete aggregation of searched outputs. For this reason, SHARE results should not be used for methodological analyses, such as systematic reviews.
5+
> trove (noun): a store of valuable or delightful things.
66
7-
[![Coverage Status](https://coveralls.io/repos/github/CenterForOpenScience/SHARE/badge.svg?branch=develop)](https://coveralls.io/github/CenterForOpenScience/SHARE?branch=develop)
7+
SHARE/trove (aka SHARtrove, shtrove) is is a service meant to store (meta)data you wish to keep and offer openly.
88

9-
## Documentation
9+
note: this codebase is currently (and historically) rather entangled with [osf.io](https://osf.io), which has its shtrove at https://share.osf.io -- stay tuned for more-reusable open-source libraries and tools for working with (meta)data
1010

11-
### What is this?
12-
see [WHAT-IS-THIS-EVEN.md](./WHAT-IS-THIS-EVEN.md)
11+
see [ARCHITECTURE.md](./ARCHITECTURE.md) for help navigating this codebase
1312

14-
### How can I use it?
15-
see [how-to/use-the-api.md](./how-to/use-the-api.md)
13+
see [CONTRIBUTING.md](./CONTRIBUTING.md) for info about contributing changes
1614

17-
### How do I navigate this codebase?
18-
see [ARCHITECTURE.md](./ARCHITECTURE.md)
19-
20-
### How do I run a copy locally?
21-
see [how-to/run-locally.md](./how-to/run-locally.md)
22-
23-
24-
## Running Tests
25-
26-
### Unit test suite
27-
28-
py.test
29-
30-
### BDD Suite
31-
32-
behave
15+
see [how-to/use-the-api.md](./how-to/use-the-api.md) for help using the api to add and access (meta)data
3316

17+
see [how-to/run-locally.md](./how-to/run-locally.md) for help running a shtrove instance for local development

TODO.md

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# TODO:
2+
ways to better this mess
3+
4+
## better shtrove api experience
5+
6+
- better web-browsing experience
7+
- include more explanatory docs (and better fill out those explanations)
8+
- even more helpful (less erratic) visual design
9+
- in each html rendering of an api response, include a `<form>` for adding/editing/viewing query params
10+
- in browsable html, replace json literals with rdf rendered like the rest of the page
11+
- (perf) add bare-minimal IndexcardDeriver (iris, types, namelikes); use for search-result display
12+
- better tsv/csv experience
13+
- set default columns for `index-value-search` (and/or broadly improve `fields` handling)
14+
- better turtle experience
15+
- quoted literal graphs also turtle
16+
- omit unnecessary `^^rdf:string`
17+
- better jsonld experience
18+
- provide `@context` (via header, at least)
19+
- accept jsonld at `/trove/ingest` (or at each `ldp:inbox`...)
20+
21+
22+
## modular packaging
23+
move actually-helpful logic into separate packages that can be used and maintained independently of
24+
any particular web app/api/framework (and then use those packages in shtrove and osf)
25+
26+
- `osfmap`: standalone OSFMAP definition
27+
- define osfmap properties and shapes (following DCTAP) in static tsv files
28+
- use `tapshoes` (below) to generate docs and helpful utility functions
29+
- may replace/simplify:
30+
- `osf.metadata.osf_gathering.OSFMAP` (and related constants)
31+
- `trove.vocab.osfmap`
32+
- `trove.derive.osfmap_json`
33+
- `tapshoes`: for using and packaging [tabular application profiles](https://dcmi.github.io/dctap/) in python
34+
- take a set of tsv/csv files as input
35+
- should support any valid DCTAP (aim to be worth community interest)
36+
- initial/immediate use case `osfmap`
37+
- generate more human-readable docs of properties and shapes/types
38+
- validate a given record (rdf graph) against a profile
39+
- serialize a valid record in a consistent/stable way (according to the profile)
40+
- enable publishing "official" application profiles as installable python packages
41+
- learn from and consider using prior dctap work:
42+
- dctap-python: https://pypi.org/project/dctap/
43+
- loads tabular files into more immediately usable form
44+
- tap2shacl: https://pypi.org/project/tap2shacl/
45+
- builds shacl constraints from application profile
46+
- could then validate a given graph with pyshacl: https://pypi.org/project/pyshacl/
47+
- metadata record crosswalk/serialization
48+
- given a record (as rdf graph) and application profile to which it conforms (like OSFMAP), offer:
49+
- crosswalking to a standard vocab (DCAT, schema.org, ...)
50+
- stable rdf serialization (json-ld, turtle, xml, ...)
51+
- special bespoke serialization (datacite xml/json, oai_dc, ...)
52+
- may replace/simplify:
53+
- `osf.metadata.serializers`
54+
- `trove.derive`
55+
- `shtrove`: reusable package with the good parts of share/trove
56+
- python api and command-line tools
57+
- given application profile
58+
- digestive tract with pluggable storage/indexing interfaces
59+
- methods for ingest, search, browse, subscribe
60+
- `django-shtrove`: django wrapper for `shtrove` functionality
61+
- set application profile via django setting
62+
- django models for storage, elasticsearch for indexing
63+
- django views for ingest, search, browse, subscribe
64+
65+
66+
## open web standards
67+
- data catalog vocabulary (DCAT) https://www.w3.org/TR/vocab-dcat-3/
68+
- an appropriate (and better thought-thru) vocab for a lot of what shtrove does
69+
- already used in some ways, but would benefit from adopting more thoroughly
70+
- replace bespoke types (like `trove:Indexcard`) with better-defined dcat equivalents (like `dcat:CatalogRecord`)
71+
- rename various properties/types/variables similarly
72+
- "catalog" vs "index"
73+
- "record" vs "card"
74+
- replace checksum-iris with `spdx:checksum` (added in dcat 3)
75+
- linked data notifications (LDN) https://www.w3.org/TR/ldn/
76+
- shtrove incidentally (partially) aligns with linked-data principles -- could lean into that
77+
- replace `/trove/ingest` with one or more `ldp:inbox` urls
78+
- trove index-card like an inbox containing current/past resource descriptions
79+
```
80+
<://osf.example/blarg> ldp:inbox <://shtrove.example/index-card/0000-00...> .
81+
<://shtrove.example/index-card/0000-00...> ldp:contains <://shtrove.example/description/0000-00...> .
82+
<://shtrove.example/description/0000-00...> foaf:primaryTopic <://osf.example/blarg>
83+
```
84+
(might consider renaming "index-card" for consistency/clarity)

0 commit comments

Comments
 (0)