-
Notifications
You must be signed in to change notification settings - Fork 14
Advanced resource entity implementation
Elaborated on #184:
Summary
This describes resource provenance using two attributes which relate the ORI resource to the original resource so it can be used of metadata.
- Internally we use the
canonical_idandcanonical_iri, which can be serialized and made public into a different form. -
entityhas been replaced bycanonical_idandcanonical_iri, see history. - Both
canonical_idandcanonical_irimay be specified in the same resource. -
canonical_idcan appear in theused_fileto designate a subsection if it contains multiple nested resources -
canonical_irishould designate as close as possible what resource was used. In the most simple case this is the URL of the resource that was retrieved. However, if that URL contains multiple resources we can 'guess' what the URL directly to the resource would be.
Description
Different resources can been derived from one entity, i.e. a meeting has multiple nested documents. These documents can be resolvable by their own URL but the original source of the resource is still the same. The same as its parent since this is our actual source (Resources that can be resolved (have their own identifier) should use that URL instead, see the comment below).
If possible, it should be identified with a URL, scheme and query parameters like this. It should represent the suppliers resource as they specify it, it should include a scheme (https:// by default) but no additional parameters. If the supplier does not specify it but we can assume the resource exists, we can construct the more specific URL ourselves. This makes it IRI's, which are most often URL's. This implies that we cannot assume they always resolve.
The canonical creates the bridge between the mapping IRI and the supplier's resource. In SOAP it is not possible to use URL's to identify a specific resource, in that case we do not have more information than the identifier itself so we use canonical_id, it would be something like '8984124'. The used_file would be the URL to our cached version of the SOAP response. In a later iteration we can use URL fragments to designate the identifier within the context of the cached version (this proves to be a problem with Google Storage document revision). We use canonical_id and canonical_iri fields since we need to serialize them as different attributes.
Some considerations:
- When a subresource has an own URL,
canonical_iriis used to specify. There is no direct relation betweencanonical_iriandused_file, the canonical refers to the specific resource while theused_fileshould be the cached version of the resource's parent. - When a subresource doesn't have an own URL,
canonical_idis used to designate the subresource within the resource. There is a direct relation betweencanonical_idandused_file, since the id will always be in the scope of the cached file. - A downloadable document has a
schema:contentUrlto the resolver, soused_fileshouldn't refer the same cache URL. Instead it should refer to the file where the URL to the document was originally specified. Also,schema:isBasedOnset by the enricher refers to the document's original download URL. Canonical should refer to the same URL, except for when the following applies: - Some suppliers distinguish between a document resource URL and a document download URL. If this is the case,
canonical_irishould be the resource URL andschema:isBasedOnshould be the download URL. - Note that for
canonical_irithe document resource URL is specified here as"self": "api.notubiz.nl/document/780972",withoutwith the?format=json&version=1.10.8.However we cannot add this information, it is up to the user to make the decision about which version and format to use.If possible, we want to give the user as much information how to find the actual resource we used, so it will be including at least theversionquery parameter but it would also be wise to includeformatas well. Sensitive query parameters like authentication should be left out.