Skip to content

Problem with some Unicode chars #4

@vityok

Description

@vityok

It looks like Wilbur has a problem with certain Unicode chars in certain circumstances.

Code to reproduce:

  1. download RDF/XML date from DBPedia:
wget http://dbpedia.org/data/Semantic_Web.rdf
  1. parse with external format explicitly defined:
(defvar stream (open #P"Semantic_Web.rdf"
             :direction :input
             :external-format :utf-8))
(setf wilbur:*db*
      (wilbur:parse-db-from-stream stream "http://dbpedia.org/page/Semantic_Web"))

Produces error both on CCL and SBCL:

> Error: Cannot decode this: (#\U+30BB #\U+30DE #\U+30F3 #\U+30C6 #\U+30A3 #\U+30C3 #\U+30AF #\U+30FB #\U+30A6 #\U+30A7 #\U+30D6)
> While executing: (:INTERNAL WILBUR::COLLAPSE WILBUR:COLLAPSE-WHITESPACE), in process listener(1).
debugger invoked on a SIMPLE-ERROR in thread
#<THREAD "main thread" RUNNING {AB2F861}>:
  Cannot decode this: (#\HANGUL_SYLLABLE_U #\HANGUL_SYLLABLE_KEU
                       #\HANGUL_SYLLABLE_RA #\HANGUL_SYLLABLE_I
                       #\HANGUL_SYLLABLE_NA)
(WILBUR:COLLAPSE-WHITESPACE "우크라이나")

But everything works fine if the external format is not specified:

(defvar stream (open #P"Semantic_Web.rdf"
             :direction :input))
(setf wilbur:*db*
      (wilbur:parse-db-from-stream stream "http://dbpedia.org/page/Semantic_Web"))

Produces:

#<TEMPORARY-PARSER-DB size 157 #x1862A5C6>

That then can be successfully queried.

The problem is even more evident when using flexi-streams.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions