Skip to content

Conversation

@hannes-ucsc
Copy link
Contributor

@hannes-ucsc hannes-ucsc commented Nov 5, 2025

Release notes

This is needed by LungMAP.

For system/file_descriptor.json schema:

  • Add support for compact indentifier-based DRS URIs in file descriptors

Since we added the optional drs_uri field (see DCP/2 spec) the DRS specification was amended to a allow for a new kind of DRS URI, one that doesn't use a DNS host name, but a proprietary resolution mechanism implemented by identifiers.org.

Unfortunately, that change introduced a major defect in the DRS specification. The new compact identifier-based URIs are not URIs according to RFC 3986. Standard URIs require a numeric port number after a colon in the authority part of the URI whereas the DRS spec allows alphanumeric characters after the colon. Consequently, the uri format that we currently use in the schema rejects compact identifier-based URIs.

This PR attempts to fix that. Please see the $comment section in the schema change for details.

This PR also requires the value of the drs_uri property to begin with drs://. This requirement has always been part of the DCP/2 spec but hadn't been included in PR #1575 which introduced the drs_uri field

@hannes-ucsc hannes-ucsc changed the base branch from master to staging November 5, 2025 07:41
@hannes-ucsc hannes-ucsc requested review from NoopDog, amnonkhen, arschat, idazucchi and ncalvanese1 and removed request for arschat and idazucchi November 5, 2025 07:43
@hannes-ucsc hannes-ucsc changed the title Add support for compact indentifier-based DRS URIs in file descriptors Add support for compact identifier-based DRS URIs in file descriptors Nov 5, 2025
@hannes-ucsc
Copy link
Contributor Author

My latest push rebases on staging, only resolving the conflicts in changelog.md and versions.json introduced by another recently merged but unrelated PR.

@JoshuaFortriede
Copy link

Using jsonschemavalidator.net, I used this proposed schema to test the following JSON data:

{
"content_type": ".gz",
"crc32c": "c160990a",
"describedBy": "https://schema.humancellatlas.org/system/2.1.0/file_descriptor",
"drs_uri": "drs://dg.4503/44c5fa8e-c465-4187-8565-734b3ac0a32d1",
"file_id": "13132ac2-0d81-404f-9f3a-bc000c4e64ad",
"file_name": "JB_515_2_S5_L002_R2_001.fastq.gz",
"file_version": "2025-01-07T21:16:04.497136Z",
"schema_type": "file_descriptor",
"schema_version": "2.1.0",
"sha256": "e842eeed9309978b6649a49afd300871dba3771c07f294c98e2647b76a086ce6",
"size": 6055979177
}

This data blob validates, but I contend that the drs_uri is not a valid drs_uri.

The schema is identifying the drs_uri value as a hostname based DRS URI, but I contend that dg.4503 is not a valid hostname. While the standards on TLD have changed throughout the years, the current ICANN/IANA standards do not allow public TLDs to start with digits. As such, I do not believe we should allow digit based TLDs. Another main reason is that it would be fairly easy to construct a DRS URI that looks valid and would validate, but is using a "/" delimited compact identifier, like above. Additionally, there are other character rules in place for hostnames that are not imposed for URIs that limit the allowed characters more than not being "/" or ":".

The comment I made on this thread provides a regex based approach to validate both types of DRS URIs. These could be used independently or combined into the single REGEX that I have done in the post.

I would recommend switching to the more complicated, but more robust REGEX that I have provided. I have tested it with valid and invalid DRS URIs and I have not found any failures.

@hannes-ucsc
Copy link
Contributor Author

The motivation of the example above appears to be to show that this PR doesn't catch cases where the input DRS URI was intended to be a compact-identifier based URI, but uses the wrong separator, i.e., slash instead of colon. That is true. This PR does not catch that mistake, because the URI is regarded as a valid hostname-based DRS URI, which it is. This ambiguity is a difficult to avoid consequence of the poor design/implementation of compact identifier-based URIs, as I described at the top of the PR.

While .4503 is not a valid TLD, and dg.4503 is therefore not a valid FQDN, dg and 4503 are a valid DNS labels (the parts separated by a dot), and dg.4503 is therefore a valid, locally resolvable host name. I'm not aware of any part of the DRS spec or RFC 3986 that requires the host name to be fully qualified. We shouldn't assume that it is, denying submitters the option of creating locally resolvable DRS URIs. Besides, even if we wanted to require full qualification, it would be more complicated to incorporate the rules for valid FQDNs in a regular expression. Also see https://www.rfc-editor.org/rfc/rfc2181#section-11 about a clarification about valid labels, to my knowledge, the most recent RFC about that subject.

I tried the regex you proposed

drs:\/\/((?=.{1,253}$)(([a-zA-Z0-9]{1,63}|[a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9])\\.)+([A-Za-z]{1,63}|[A-Za-z][A-Za-z0-9-]{1,61}[A-Za-z0-9])\/[A-Za-z0-9.-_~]+|([a-z0-9.-]+\/)?[a-z0-9.-]+:.+)

and can't get it to match even some simple cases, suggesting that it would need more work.

CleanShot 2025-11-13 at 08 44 30@2x

To summarize, this PR solves the problem LungMAP is running into today. It may not be perfect, but achieving perfection would be prohibitively complicated for our purposes. I think we should do the pragmatic thing here and merge this PR. It has the required approvals as per the DCP/2 SOP.

@JoshuaFortriede
Copy link

JoshuaFortriede commented Nov 13, 2025

In my original post, I mentioned "Note, if you check this in a regex checker, you will want to remove one of the \ in \\. so it is \.. For putting the REGEX in JSON, you need to escape the \ in \., but in checkers like regex101.com, you must remove it.

To test in regex101.com use the following:
drs:\/\/((?=.{1,253}$)(([a-zA-Z0-9]{1,63}|[a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9])\.)+([A-Za-z]{1,63}|[A-Za-z][A-Za-z0-9-]{1,61}[A-Za-z0-9])\/[A-Za-z0-9.-_~]+|([a-z0-9.-]+\/)?[a-z0-9.-]+:.+)

Can you confirm that this simple case works for you if following these instructions?
image

Regarding FQDN vs any valid, even locally resolvable host name, I agree that the DRS URI standard does not require DRS URIs to be FDQN. With that said, I do believe that is a valid assumption for the purposes of this schema. Similar to how this schema allows for a null value for a DRS URI, which is used in the implementation step and has no direct connection to DRS URI standards, I would submit that requiring FQDN is a valid, and possibly even required assumption for specifying the location for files that should be publicly used (given appropriate access authorization).

@hannes-ucsc
Copy link
Contributor Author

hannes-ucsc commented Nov 13, 2025

In my original post, I mentioned "Note, if you check this in a regex checker, you will want to remove one of the \ in .". For putting the REGEX in JSON, you need to escape the "" in ".", but in checkers like regex101.com, you must remove it.

Apologies, I missed the mention in your original post. Unfortunately, I don't understand the instructions. For example, I don't understand "you need to escape the QUOTE QUOTE in QUOTE DOT QUOTE". That makes no sense to me. You may want to adopt the use of single and triple backquotes to embed literals in GitHub comment markdown. But for simplicity and everyone's convenience, would you mind posting the proposed regex here, and in a form that can be readily used on regex101?

Regarding FQDN vs any valid, even locally resolvable host name, I agree that the DRS URI standard does not require DRS URIs to be FDQN. With that said, I do believe that is a valid assumption for the purposes of this schema.

I disagree. I'd like this to only enforce restrictions explicitly mentioned in the spec. This is more important to me than to catch some mistakes that, due to the aforementioned spec insufficiencies, end up passing as valid. I can easily change your example dg.4503/44c5fa8e-c465-4187-8565-734b3ac0a32d1 to dg.edu/44c5fa8e-c465-4187-8565-734b3ac0a32d1 which could also have been a erroneously formatted compact-identifier based URI. No amount of regex validation is going to catch every mistake.

Similar to how this schema allows for a null value for a DRS URI, which is used in the implementation step and has no direct connection to DRS URI standards, I would submit that requiring FQDN is a valid, and possibly even required assumption for specifying the location for files that should be publicly used (given appropriate access authorization).

Apples and oranges. Allowing null (not "null") says nothing about the syntax of the string when a string is provided. Specifically, adding null makes the schema more permissive, which we needed to represent phantom files. You are proposing to make it more restrictive, in a potentially error-prone and limiting way.

@JoshuaFortriede
Copy link

I updated my comment above to be correct. Sorry, I don't do many comments in GitHub and I didn't check the Preview bit.

@hannes-ucsc
Copy link
Contributor Author

Thank you for providing that regex. I can confirm that it now works for trivial cases, but it doesn't for more complicated ones. However, I don't want this discussion to be side-tracked into a debugging session for some alternative solution. Additionally, your regex is excessively restrictive, as outlined in my prior comments. This would be a show stopper for me. As one of the designated reviewers of HCA schema PRs, I am not persuaded that your alternative is superior.

This PR has the required approvals and, unless you can show that it does not solve the problem at hand, the one that's blocking LungMAP today, I move that the PR be merged. I will leave it open a few more days, to give an opportunity for others to chime in, but on Wednesday 11/19/2025 the official two-week review period ends and the PR can and should be merged, unless any of the designated reviewers change their mind.

@JoshuaFortriede
Copy link

I believe there are still differences in opinion on the completeness and accuracy of the competing methods. With that said, the method in the PR should technically work for all scenarios that LungMAP will see, so I will not hold up a PR. This PR has been reviewed by the appropriate number of people, so they believe that the solution should work as well.

While this will allow the requested DRS URIs to now validate, it appears to me that the truly underlying issue is not resolved, and most likely should NOT be resolved in this repository. The issue is not whether a DRS URI appears valid, but IS valid. As such, I suggest that the HCA Import Validation scripts be modified to resolve DRS URIs and confirm that the resource is valid. I have tried doing this with the compact identifier-based DRS URIs, but I am unsure if I am using a correct and compliant method based on Azul.

I am using http://identifiers.org/{compact_drs_uri} to resolve the DRS URI. So, given drs://dg.4503:44c5fa8e-c465-4187-8565-734b3ac0a32d, I remove the first 6 characters and get the compact DRS URI, dg.4503:44c5fa8e-c465-4187-8565-734b3ac0a32d. Using this (http://identifiers.org/dg.4503:44c5fa8e-c465-4187-8565-734b3ac0a32d), I can resolve appropriately. However, if I use a / instead of a : in the compact DRS URI (http://identifiers.org/dg.4503:44c5fa8e-c465-4187-8565-734b3ac0a32d), I still resolve to the DRS URI to the same object. Can you let me know if there is a different method that I should use to resolve the DRS URIs so I can validate they are going to a valid object?

@hannes-ucsc
Copy link
Contributor Author

@JoshuaFortriede, yes this PR is about whether the value of the drs_uri property is valid syntactically. Syntax validation is what schemas are for.

Any discussion about the resolution of compact identifier-based DRS URI should be had elsewhere. For questions, I recommend moving back to DataBiosphere/azul#3631. If you'd like to improve the staging area validator, you are more than welcome to file a PR with your improvements against https://github.com/DataBiosphere/hca-import-validation.

@hannes-ucsc hannes-ucsc merged commit 84f3b4b into staging Nov 20, 2025
2 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants