Skip to content

JenaTextSparql: multi-word search returns 0 results (space escaped as Lucene special character) #1930

@fvogel

Description

@fvogel

Summary

Multi-word search queries return 0 results when using sparqlDialect "JenaText", even though each word individually matches correctly. This affects both the UI search and the REST API.

Reproduction

Using stock Skosmos 3.1 with Fuseki 5.4.0 (StandardAnalyzer, default config).

Test vocabulary — a simple 9-concept SKOS vocabulary with multi-word labels:

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix ex:   <http://example.org/test/> .

ex:scheme a skos:ConceptScheme ;
    skos:hasTopConcept ex:cat, ex:dog .

ex:cat a skos:Concept ;
    skos:inScheme ex:scheme ;
    skos:prefLabel "Cat"@en ;
    skos:altLabel "Domestic cat"@en .

ex:siamese a skos:Concept ;
    skos:inScheme ex:scheme ;
    skos:prefLabel "Siamese cat"@en ;
    skos:broader ex:cat .

ex:dog a skos:Concept ;
    skos:inScheme ex:scheme ;
    skos:prefLabel "Dog"@en .

ex:labrador a skos:Concept ;
    skos:inScheme ex:scheme ;
    skos:prefLabel "Labrador retriever"@en ;
    skos:broader ex:dog .

Results via REST API:

Search query Expected Actual
Siamese* 1 result (Siamese cat) 1 result ✅
Siamese cat* 1 result (Siamese cat) 0 results
Labrador retriever* 1 result 0 results
Domestic cat* 1 result (altLabel match) 0 results
# Works:
curl -s 'http://localhost:9090/rest/v1/test/search?query=Siamese*&lang=en'
# → {"results":[{"prefLabel":"Siamese cat",...}]}

# Broken:
curl -s 'http://localhost:9090/rest/v1/test/search?query=Siamese+cat*&lang=en'
# → {"results":[]}

Root Cause

In src/model/sparql/JenaTextSparql.php, the LUCENE_ESCAPE_CHARS constant includes a space character at position 0:

public const LUCENE_ESCAPE_CHARS = ' +-&|!(){}[]^"~?:\\/';
//                                  ^ space here

The createTextQueryCondition() method escapes every character in that list:

$lucenemap[$char] = '\\' . $char;
$term = strtr($term, $lucenemap);

This transforms "Siamese cat*" into "Siamese\ cat*", telling Lucene to treat the space as a literal character rather than a word separator.

With StandardAnalyzer (the default for Jena Text indexes), labels are tokenized into individual words. No indexed token ever contains a literal space, so the escaped query "Siamese\ cat*" never matches anything.

Why the space was included

Space is listed in the Lucene Classic Query Parser documentation as a special character. However, escaping it is incorrect when using word-level analyzers like StandardAnalyzer. The space should act as a term separator, not be escaped into a literal character.

Proposed Fix

  1. Remove space from LUCENE_ESCAPE_CHARS
  2. Split multi-word queries into individual required Lucene terms using the + (required) operator
public const LUCENE_ESCAPE_CHARS = '+-&|!(){}[]^"~?:\\/';
//                                  (no leading space)

Transform multi-word queries by splitting on whitespace and prefixing each word with +:

"Siamese cat*"       → "+Siamese +cat*"
"Labrador retriever*" → "+Labrador +retriever*"

This ensures each word must match independently, which works correctly with StandardAnalyzer's word-level tokenization. The wildcard suffix is preserved on the last (or any) term.

Environment

  • Skosmos 3.1 (also present in v2.18-maintenance)
  • Apache Jena Fuseki 5.4.0
  • Jena Text with Lucene (StandardAnalyzer, default config)
  • sparqlDialect "JenaText", searchByNotation true

This bug has existed since at least 2016 (the space has been in LUCENE_ESCAPE_CHARS since the early versions of JenaTextSparql.php).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions