Skip to content

Add syntax tests for codepoint escaping. #151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kasei
Copy link
Contributor

@kasei kasei commented Oct 28, 2024

Adds new tests for some interesting cases of unicode codepoint escaping, addressing w3c/sparql-query#164. Two tests (codepoint-esc-01 and codepoint-esc-10) are marked in the manifest with TODO markers as being dependent on decisions on how systems should handle invalid escape sequences. I believe the others are accurately test the existing spec text of SPARQL 1.1.

I think many of these cases should also be turned into evaluation tests, to ensure the unescaping is being performed correctly, but I'll leave that for another PR (or a subsequent update to this PR).

@kasei kasei requested a review from afs October 28, 2024 18:07
@gkellogg
Copy link
Member

If you create the branch for the PR in the rdf-tests repo, the automatic report generation should work properly. It's conceivable that there is a different package that allows pushing the changes to a remote repo, or some filter that would prevent running that action if the repo is not local.

@afs
Copy link
Contributor

afs commented Oct 30, 2024

Turtle handles Unicode escape sequences differently - it has UCHAR in the grammar and it can occur only in strings and URIs. Personally, I think this is a better design - a more common pattern, and it makes it clear what happens when the codepoint itself is meaningful near an escape sequence. I believe this should be "good practice".

The fact that obfuscated queries can be written in SPARQL is not good.
\u0041\u0053\u004B\u0020\u007B\u007D (codepoint-esc-09.rq) (that's ASK {})

And it is bad for streaming (SPARQL Update more than SPARQL Query).

The text 19.2 Codepoint Escape Sequences isn't precise how replacement happens. These are errata that need to be addressed in the spec..

We could split tests into two: "what we want", that is good practice (to be agreed), and "full spec".

Surveying existing systems:

  • Codemirror/YASGUI for SPARQL does not seem to support this.

@gkellogg gkellogg added the SPARQL label Nov 7, 2024
@kasei kasei force-pushed the sparql-syntax-codepoint-escaping branch from f8013f6 to b55db9e Compare April 6, 2025 21:49
@kasei
Copy link
Contributor Author

kasei commented Apr 6, 2025

I've updated the PR so that it only contains tests that I think align with agreed upon spec text.

Despite what I think is agreement that we shouldn't support things like the obfuscated \u0041\u0053\u004B\u0020\u007B\u007D (renumbered in the latest update to :codepoint-esc-01), I think the spec is clear that this is currently supported and unambiguous.

I'll file a new issue with the two removed tests that probably need WG agreement before it is clear what the their results should be. I had forgotten that this PR came after w3c/sparql-query#164, which seems like a better place to follow-up.

@prefix dawgt: <http://www.w3.org/2001/sw/DataAccess/tests/test-dawg#> .

:manifest rdf:type mf:Manifest ;
rdfs:label "SPARQL Codepoint-Escaping Tests" ;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should indicate that these are related to the character stream processing. Calling them "escaping" is too general.

Suggestion:

  1. Rename syntax-escaping/ as syntax-char-stream-processing/
  2. rdfs:label "Character Stream Processing Tests" ;

Copy link
Contributor Author

@kasei kasei Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afs I can rename as requested, but unsure about the exact naming (and rdfs:label) here. "Codepoint Escape" is a term used in the spec, but I don't think we use anything similar to "Character Stream Processing". I'm obviously biased as the author here, but I think without context I'd be confused by what "Character Stream Processing Tests" were, but would have a pretty good idea about "Codepoint-Escaping Tests". Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants