Skip to content

Advanced resource entity implementation

Jurrian Tromp edited this page Nov 19, 2019 · 2 revisions

As discussed in #184:

entity is used to designate the origin of a resource. Different resources can been derived from one entity, i.e. a meeting has multiple nested documents. These documents can be resolvable by their own URL but the original source of the resource is still the same entity. The same as its parent since this is our actual source. Resources that can be resolved (have their own identifier) should use that URL instead, see the comment below.

However, some resources can be enriched. A document can be downloaded and process by a enricher, this changes the original source of the document and this should be reflected in entity. This should be implemented.

Some extra explanation: entity should better be renamed to canonical_iri and canonical_id instead. entity_type can be dropped. If possible, it should be identified with a URL, scheme and query parameters like this. It should represent the suppliers resource as they specify it, it should include a scheme (https:// by default) but no additional parameters. If the supplier does not specify it but we can assume the resource exists, we can construct the more specific URL ourselves. This makes it IRI's, which are most often URL's. This implies that we cannot assume they always resolve.

The canonical creates the bridge between the mapping IRI and the supplier's resource. In SOAP it is not possible to use URL's to identify a specific resource, in that case we do not have more information than the identifier itself so we use canonical_id, it would be something like '8984124'. The used_file would be the URL to our cached version of the SOAP response. In a later iteration we can use URL fragments to designate the identifier within the context of the cached version. We use canonical_id and canonical_iri fields since we need to serialize them as different attributes.

Some considerations:

  • When a subresource has an own URL, canonical_iri is used to specify. There is no direct relation between canonical_iri and used_file, the canonical refers to the specific resource while the used_file should be the cached version of the resource's parent.
  • When a subresource doesn't have an own URL, canonical_id is used to designate the subresource within the resource. There is a direct relation between canonical_id and used_file, since the id will always be in the scope of the cached file.
  • A downloadable document has a schema:contentUrl to the resolver, so used_file shouldn't refer the same cache URL. Instead it should refer to the file where the URL to the document was originally specified. Also, schema:isBasedOn set by the enricher refers to the document's original download URL. Canonical should refer to the same URL, except for when the following applies:
  • Some suppliers distinguish between a document resource URL and a document download URL. If this is the case, canonical_iri should be the resource URL and schema:isBasedOn should be the download URL.
  • Note that for canonical_iri the document resource URL is specified here as "self": "api.notubiz.nl/document/780972", without with the ?format=json&version=1.10.8. However we cannot add this information, it is up to the user to make the decision about which version and format to use. We want to give the user as much information how to find the actual resource we used, so it will be including at least the version query parameter but it would also be wise to include format as well. Sensitive query parameters like authentication should be left out.

Clone this wiki locally