-
Notifications
You must be signed in to change notification settings - Fork 99
JenaTextSparql: multi-word search returns 0 results (space escaped as Lucene special character) #1930
Description
Summary
Multi-word search queries return 0 results when using sparqlDialect "JenaText", even though each word individually matches correctly. This affects both the UI search and the REST API.
Reproduction
Using stock Skosmos 3.1 with Fuseki 5.4.0 (StandardAnalyzer, default config).
Test vocabulary — a simple 9-concept SKOS vocabulary with multi-word labels:
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix ex: <http://example.org/test/> .
ex:scheme a skos:ConceptScheme ;
skos:hasTopConcept ex:cat, ex:dog .
ex:cat a skos:Concept ;
skos:inScheme ex:scheme ;
skos:prefLabel "Cat"@en ;
skos:altLabel "Domestic cat"@en .
ex:siamese a skos:Concept ;
skos:inScheme ex:scheme ;
skos:prefLabel "Siamese cat"@en ;
skos:broader ex:cat .
ex:dog a skos:Concept ;
skos:inScheme ex:scheme ;
skos:prefLabel "Dog"@en .
ex:labrador a skos:Concept ;
skos:inScheme ex:scheme ;
skos:prefLabel "Labrador retriever"@en ;
skos:broader ex:dog .Results via REST API:
| Search query | Expected | Actual |
|---|---|---|
Siamese* |
1 result (Siamese cat) | 1 result ✅ |
Siamese cat* |
1 result (Siamese cat) | 0 results ❌ |
Labrador retriever* |
1 result | 0 results ❌ |
Domestic cat* |
1 result (altLabel match) | 0 results ❌ |
# Works:
curl -s 'http://localhost:9090/rest/v1/test/search?query=Siamese*&lang=en'
# → {"results":[{"prefLabel":"Siamese cat",...}]}
# Broken:
curl -s 'http://localhost:9090/rest/v1/test/search?query=Siamese+cat*&lang=en'
# → {"results":[]}Root Cause
In src/model/sparql/JenaTextSparql.php, the LUCENE_ESCAPE_CHARS constant includes a space character at position 0:
public const LUCENE_ESCAPE_CHARS = ' +-&|!(){}[]^"~?:\\/';
// ^ space hereThe createTextQueryCondition() method escapes every character in that list:
$lucenemap[$char] = '\\' . $char;
$term = strtr($term, $lucenemap);This transforms "Siamese cat*" into "Siamese\ cat*", telling Lucene to treat the space as a literal character rather than a word separator.
With StandardAnalyzer (the default for Jena Text indexes), labels are tokenized into individual words. No indexed token ever contains a literal space, so the escaped query "Siamese\ cat*" never matches anything.
Why the space was included
Space is listed in the Lucene Classic Query Parser documentation as a special character. However, escaping it is incorrect when using word-level analyzers like StandardAnalyzer. The space should act as a term separator, not be escaped into a literal character.
Proposed Fix
- Remove space from
LUCENE_ESCAPE_CHARS - Split multi-word queries into individual required Lucene terms using the
+(required) operator
public const LUCENE_ESCAPE_CHARS = '+-&|!(){}[]^"~?:\\/';
// (no leading space)Transform multi-word queries by splitting on whitespace and prefixing each word with +:
"Siamese cat*" → "+Siamese +cat*"
"Labrador retriever*" → "+Labrador +retriever*"
This ensures each word must match independently, which works correctly with StandardAnalyzer's word-level tokenization. The wildcard suffix is preserved on the last (or any) term.
Environment
- Skosmos 3.1 (also present in
v2.18-maintenance) - Apache Jena Fuseki 5.4.0
- Jena Text with Lucene (StandardAnalyzer, default config)
sparqlDialect "JenaText",searchByNotation true
This bug has existed since at least 2016 (the space has been in LUCENE_ESCAPE_CHARS since the early versions of JenaTextSparql.php).