|
1 | | -# Architecture of SHARE/Trove |
2 | | -> NOTE: this document requires update (big ol' TODO) |
3 | | -
|
| 1 | +# Architecture of SHARE/trove |
4 | 2 |
|
5 | 3 | This document is a starting point and reference to familiarize yourself with this codebase. |
6 | 4 |
|
7 | 5 | ## Bird's eye view |
8 | | -In short, SHARE/Trove takes metadata records (in any supported input format), |
9 | | -ingests them, and makes them available in any supported output format. |
10 | | -``` |
11 | | - ┌───────────────────────────────────────────┐ |
12 | | - │ Ingest │ |
13 | | - │ ┌──────┐ │ |
14 | | - │ ┌─────────────────────────┐ ┌──►Format├─┼────┐ |
15 | | - │ │ Normalize │ │ └──────┘ │ │ |
16 | | - │ │ │ │ │ ▼ |
17 | | -┌───────┐ │ │ ┌─────────┐ ┌────────┐ │ │ ┌──────┐ │ save as |
18 | | -│Harvest├─┬─┼─┼─►Transform├──►Regulate├─┼─┬─┼──►Format├─┼─┬─►FormattedMetadataRecord |
19 | | -└───────┘ │ │ │ └─────────┘ └────────┘ │ │ │ └──────┘ │ │ |
20 | | - │ │ │ │ │ . │ │ ┌───────┐ |
21 | | - │ │ └─────────────────────────┘ │ . │ └──►Indexer│ |
22 | | - │ │ │ . │ └───────┘ |
23 | | - │ └─────────────────────────────┼─────────────┘ some formats also |
24 | | - │ │ indexed separately |
25 | | - ▼ ▼ |
26 | | - save as save as |
27 | | - RawDatum NormalizedData |
| 6 | +In short, SHARE/trove holds metadata records that describe things and makes those records available for searching, browsing, and subscribing. |
| 7 | + |
| 8 | + |
| 9 | + |
| 10 | + |
| 11 | +## Parts |
| 12 | +a look at the tangles of communication between different parts of the system: |
| 13 | + |
| 14 | +```mermaid |
| 15 | +graph LR; |
| 16 | + subgraph shtrove; |
| 17 | + subgraph web[api/web server]; |
| 18 | + ingest; |
| 19 | + search; |
| 20 | + browse; |
| 21 | + rss; |
| 22 | + atom; |
| 23 | + oaipmh; |
| 24 | + end; |
| 25 | + worker["background worker (celery)"]; |
| 26 | + indexer["indexer daemon"]; |
| 27 | + rabbitmq["task queue (rabbitmq)"]; |
| 28 | + postgres["database (postgres)"]; |
| 29 | + elasticsearch; |
| 30 | + web---rabbitmq; |
| 31 | + web---postgres; |
| 32 | + web---elasticsearch; |
| 33 | + worker---rabbitmq; |
| 34 | + worker---postgres; |
| 35 | + worker---elasticsearch; |
| 36 | + indexer---rabbitmq; |
| 37 | + indexer---postgres; |
| 38 | + indexer---elasticsearch; |
| 39 | + end; |
| 40 | + source["metadata source (e.g. osf.io backend)"]; |
| 41 | + user["web user, either by browsing directly or via web app (like osf.io)"]; |
| 42 | + subscribers["feed subscription tools"]; |
| 43 | + source-->ingest; |
| 44 | + user-->search; |
| 45 | + user-->browse; |
| 46 | + subscribers-->rss; |
| 47 | + subscribers-->atom; |
| 48 | + subscribers-->oaipmh; |
28 | 49 | ``` |
29 | 50 |
|
30 | 51 | ## Code map |
31 | 52 |
|
32 | 53 | A brief look at important areas of code as they happen to exist now. |
33 | 54 |
|
34 | | -### Static configuration |
35 | | - |
36 | | -`share/schema/` describes the "normalized" metadata schema/format that all |
37 | | -metadata records are converted into when ingested. |
38 | | - |
39 | | -`share/sources/` describes a starting set of metadata sources that the system |
40 | | -could harvest metadata from -- these will be put in the database and can be |
41 | | -updated or added to over time. |
42 | | - |
43 | | -`project/settings.py` describes system-level settings which can be set by |
44 | | -environment variables (and their default values), as well as settings |
45 | | -which cannot. |
46 | | - |
47 | | -`share/models/` describes the data layer using the [Django](https://www.djangoproject.com/) ORM. |
48 | | - |
49 | | -`share/subjects.yaml` describes the "central taxonomy" of subjects allowed |
50 | | -in `Subject.name` fields of `NormalizedData`. |
51 | | - |
52 | | -### Harvest and ingest |
53 | | - |
54 | | -`share/harvest/` and `share/harvesters/` describe how metadata records |
55 | | -are pulled from other metadata repositories. |
56 | | - |
57 | | -`share/transform/` and `share/transformers/` describe how raw data (possibly |
58 | | -in any format) are transformed to the "normalized" schema. |
| 55 | +- `trove`: django app for rdf-based apis |
| 56 | + - `trove.digestive_tract`: most of what happens after ingestion |
| 57 | + - stores records and identifiers in the database |
| 58 | + - initiates indexing |
| 59 | + - `trove.extract`: parsing ingested metadata records into resource descriptions |
| 60 | + - `trove.derive`: from a given resource description, create special non-rdf serializations |
| 61 | + - `trove.render`: from an api response modeled as rdf graph, render the requested mediatype |
| 62 | + - `trove.models`: database models for identifiers and resource descriptions |
| 63 | + - `trove.trovesearch`: builds rdf-graph responses for trove search apis (using `IndexStrategy` implementations from `share.search`) |
| 64 | + - `trove.vocab`: identifies and describes concepts used elsewhere |
| 65 | + - `trove.vocab.trove`: describes types, properties, and api paths in the trove api |
| 66 | + - `trove.vocab.osfmap`: describes metadata from osf.io (currently the only metadata ingested) |
| 67 | + - `trove.openapi`: generate openapi json for the trove api from thesaurus in `trove.vocab.trove` |
| 68 | +- `share`: django app with search indexes and remnants of sharev2 |
| 69 | + - `share.models`: database models for external sources, users, and other system book-keeping |
| 70 | + - `share.oaipmh`: provide data via [OAI-PMH](https://www.openarchives.org/OAI/openarchivesprotocol.html) |
| 71 | + - `share.search`: all interaction with elasticsearch |
| 72 | + - `share.search.index_strategy`: abstract base class `IndexStrategy` with multiple implementations, for different approaches to indexing the same data |
| 73 | + - `share.search.daemon`: the "indexer daemon", an optimized background worker for batch-processing updates and sending to all active index strategies |
| 74 | + - `share.search.index_messenger`: for sending messages to the indexer daemon |
| 75 | +- `api`: django app with remnants of the legacy sharev2 api |
| 76 | + - `api.views.feeds`: allows custom RSS and Atom feeds |
| 77 | + - otherwise, subject to possible deprecation |
| 78 | +- `osf_oauth2_adapter`: django app for login via osf.io |
| 79 | +- `project`: the actual django project |
| 80 | + - default settings at `project.settings` |
| 81 | + - pulls together code from other directories implemented as django apps (`share`, `trove`, `api`, and `osf_oauth2_adapter`) |
59 | 82 |
|
60 | | -`share/regulate/` describes rules which are applied to every normalized datum, |
61 | | -regardless where or what format it originally come from. |
62 | 83 |
|
63 | | -`share/metadata_formats/` describes how a normalized datum can be formatted |
64 | | -into any supported output format. |
65 | | - |
66 | | -`share/tasks/` runs the harvest/ingest pipeline and stores each task's status |
67 | | -(including debugging info, if errored) as a `HarvestJob` or `IngestJob`. |
68 | | - |
69 | | -### Outward-facing views |
70 | | - |
71 | | -`share/search/` describes how the search indexes are structured, managed, and |
72 | | -updated when new metadata records are introduced -- this provides a view for |
73 | | -discovering items based on whatever search criteria. |
74 | | - |
75 | | -`share/oaipmh/` describes the [OAI-PMH](https://www.openarchives.org/OAI/openarchivesprotocol.html) |
76 | | -view for harvesting metadata from SHARE/Trove in bulk. |
77 | | - |
78 | | -`api/` describes a mostly REST-ful API that's useful for inspecting records for |
79 | | -a specific item of interest. |
80 | | - |
81 | | -### Internals |
82 | | - |
83 | | -`share/admin/` is a Django-app for administrative access to the SHARE database |
84 | | -and pipeline logs |
85 | | - |
86 | | -`osf_oauth2_adapter/` is a Django app to support logging in to SHARE via OSF |
| 84 | +## Cross-cutting concerns |
87 | 85 |
|
88 | | -### Testing |
| 86 | +### Resource descriptions |
89 | 87 |
|
90 | | -`tests/` are tests. |
| 88 | +Uses the [resource description framework](https://www.w3.org/TR/rdf11-primer/#section-Introduction): |
| 89 | +- the content of each ingested metadata record is an rdf graph focused on a specific resource |
| 90 | +- all api responses from `trove` views are (experimentally) modeled as rdf graphs, which may be rendered a variety of ways |
91 | 91 |
|
92 | | -## Cross-cutting concerns |
| 92 | +### Identifiers |
93 | 93 |
|
94 | | -### Immutable metadata |
| 94 | +Whenever feasible, use full URI strings to identify resources, concepts, types, and properties that may be exposed outwardly. |
95 | 95 |
|
96 | | -Metadata records at all stages of the pipeline (`RawDatum`, `NormalizedData`, |
97 | | -`FormattedMetadataRecord`) should be considered immutable -- any updates |
98 | | -result in a new record being created, not an old record being altered. |
| 96 | +Prefer using open, standard, well-defined namespaces wherever possible ([DCAT](https://www.w3.org/TR/vocab-dcat-3/) is a good place to start; see `trove.vocab.namespaces` for others already in use). When app-specific concepts must be defined, use the `TROVE` namespace (`https://share.osf.io/vocab/2023/trove/`). |
99 | 97 |
|
100 | | -Multiple records which describe the same item/object are grouped by a |
101 | | -"source-unique identifier" or "suid" -- essentially a two-tuple |
102 | | -`(source, identifier)` that uniquely and persistently identifies an item in |
103 | | -the source repository. In most outward-facing views, default to showing only |
104 | | -the most recent record for each suid. |
| 98 | +A notable exception (non-URI identifier) is the "source-unique identifier" or "suid" -- essentially a two-tuple `(source, identifier)` that uniquely and persistently identifies a metadata record in a source repository. This `identifier` may be any string value, provided by the external source. |
105 | 99 |
|
106 | 100 | ### Conventions |
107 | 101 | (an incomplete list) |
108 | 102 |
|
109 | | -- functions prefixed `pls_` ("please") are a request for something to happen |
| 103 | +- local variables prefixed with underscore (to consistently distinguish between internal-only names and those imported/built-in) |
| 104 | +- prefer full type annotations in python code, wherever reasonably feasible |
110 | 105 |
|
111 | 106 | ## Why this? |
112 | 107 | inspired by [this writeup](https://matklad.github.io/2021/02/06/ARCHITECTURE.md.html) |
|
0 commit comments