Skip to content

px_reader can't handle semicolon weirdness in NOTE-field #3

@jampekka

Description

@jampekka

File parsing fails with http://www.aluesarjat.fi/database/aluesarjat_kaupunkiverkko/vaesto_sal/vaestoennusteet_sal/b01esps_vaestoennuste.px

The problem is due to it having weird stuff in NOTE-field, in format

NOTE="Some stuff";
"Some stuff more";
"and even more stuff";

from which the reader tries to create new attribute named 'Some stuff more', but fails due to there being non-ascii characters. Fixing would need rewriting the whole metadata tokenizing code.

Seems that the different "statements" are meant to split paragraphs. Don't know if the "spec" (if there is one) allows this ugliness. At least PX-Web seems to eat it: http://www.aluesarjat.fi/graph/Footnote.aspx?File=B01ESPS_Vaestoennuste&path=..%2fDATABASE%2fALUESARJAT_KAUPUNKIVERKKO%2fVAESTO_SAL%2fVAESTOENNUSTEET_SAL%2f&ti=Espoon+v%C3%A4est%C3%B6+1.1.1999-2013+ja+v%C3%A4est%C3%B6ennuste+1.1.2014+-+2023&case=db&ssid=1403051945183&Gedit=false

Fails with both ";-tokenizer" implementations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions