-
Notifications
You must be signed in to change notification settings - Fork 32
Add support for compact identifier-based DRS URIs in file descriptors #1632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
699aefc to
ba72595
Compare
|
My latest push rebases on staging, only resolving the conflicts in changelog.md and versions.json introduced by another recently merged but unrelated PR. |
|
Using jsonschemavalidator.net, I used this proposed schema to test the following JSON data: { This data blob validates, but I contend that the drs_uri is not a valid drs_uri. The schema is identifying the drs_uri value as a hostname based DRS URI, but I contend that dg.4503 is not a valid hostname. While the standards on TLD have changed throughout the years, the current ICANN/IANA standards do not allow public TLDs to start with digits. As such, I do not believe we should allow digit based TLDs. Another main reason is that it would be fairly easy to construct a DRS URI that looks valid and would validate, but is using a "/" delimited compact identifier, like above. Additionally, there are other character rules in place for hostnames that are not imposed for URIs that limit the allowed characters more than not being "/" or ":". The comment I made on this thread provides a regex based approach to validate both types of DRS URIs. These could be used independently or combined into the single REGEX that I have done in the post. I would recommend switching to the more complicated, but more robust REGEX that I have provided. I have tested it with valid and invalid DRS URIs and I have not found any failures. |
|
The motivation of the example above appears to be to show that this PR doesn't catch cases where the input DRS URI was intended to be a compact-identifier based URI, but uses the wrong separator, i.e., slash instead of colon. That is true. This PR does not catch that mistake, because the URI is regarded as a valid hostname-based DRS URI, which it is. This ambiguity is a difficult to avoid consequence of the poor design/implementation of compact identifier-based URIs, as I described at the top of the PR. While I tried the regex you proposed
and can't get it to match even some simple cases, suggesting that it would need more work.
To summarize, this PR solves the problem LungMAP is running into today. It may not be perfect, but achieving perfection would be prohibitively complicated for our purposes. I think we should do the pragmatic thing here and merge this PR. It has the required approvals as per the DCP/2 SOP. |
Apologies, I missed the mention in your original post. Unfortunately, I don't understand the instructions. For example, I don't understand "you need to escape the QUOTE QUOTE in QUOTE DOT QUOTE". That makes no sense to me. You may want to adopt the use of single and triple backquotes to embed literals in GitHub comment markdown. But for simplicity and everyone's convenience, would you mind posting the proposed regex here, and in a form that can be readily used on regex101?
I disagree. I'd like this to only enforce restrictions explicitly mentioned in the spec. This is more important to me than to catch some mistakes that, due to the aforementioned spec insufficiencies, end up passing as valid. I can easily change your example
Apples and oranges. Allowing |
|
I updated my comment above to be correct. Sorry, I don't do many comments in GitHub and I didn't check the Preview bit. |
|
Thank you for providing that regex. I can confirm that it now works for trivial cases, but it doesn't for more complicated ones. However, I don't want this discussion to be side-tracked into a debugging session for some alternative solution. Additionally, your regex is excessively restrictive, as outlined in my prior comments. This would be a show stopper for me. As one of the designated reviewers of HCA schema PRs, I am not persuaded that your alternative is superior. This PR has the required approvals and, unless you can show that it does not solve the problem at hand, the one that's blocking LungMAP today, I move that the PR be merged. I will leave it open a few more days, to give an opportunity for others to chime in, but on Wednesday 11/19/2025 the official two-week review period ends and the PR can and should be merged, unless any of the designated reviewers change their mind. |
|
I believe there are still differences in opinion on the completeness and accuracy of the competing methods. With that said, the method in the PR should technically work for all scenarios that LungMAP will see, so I will not hold up a PR. This PR has been reviewed by the appropriate number of people, so they believe that the solution should work as well. While this will allow the requested DRS URIs to now validate, it appears to me that the truly underlying issue is not resolved, and most likely should NOT be resolved in this repository. The issue is not whether a DRS URI appears valid, but IS valid. As such, I suggest that the HCA Import Validation scripts be modified to resolve DRS URIs and confirm that the resource is valid. I have tried doing this with the compact identifier-based DRS URIs, but I am unsure if I am using a correct and compliant method based on Azul. I am using http://identifiers.org/{compact_drs_uri} to resolve the DRS URI. So, given |
|
@JoshuaFortriede, yes this PR is about whether the value of the Any discussion about the resolution of compact identifier-based DRS URI should be had elsewhere. For questions, I recommend moving back to DataBiosphere/azul#3631. If you'd like to improve the staging area validator, you are more than welcome to file a PR with your improvements against https://github.com/DataBiosphere/hca-import-validation. |


Release notes
This is needed by LungMAP.
For
system/file_descriptor.jsonschema:Since we added the optional
drs_urifield (see DCP/2 spec) the DRS specification was amended to a allow for a new kind of DRS URI, one that doesn't use a DNS host name, but a proprietary resolution mechanism implemented by identifiers.org.Unfortunately, that change introduced a major defect in the DRS specification. The new compact identifier-based URIs are not URIs according to RFC 3986. Standard URIs require a numeric port number after a colon in the
authoritypart of the URI whereas the DRS spec allows alphanumeric characters after the colon. Consequently, theuriformat that we currently use in the schema rejects compact identifier-based URIs.This PR attempts to fix that. Please see the
$commentsection in the schema change for details.This PR also requires the value of the
drs_uriproperty to begin withdrs://. This requirement has always been part of the DCP/2 spec but hadn't been included in PR #1575 which introduced thedrs_urifield