Skip to content

Robuster parsing and allow reading from a file-object#2

Open
jampekka wants to merge 2 commits intostatfi:masterfrom
jampekka:master
Open

Robuster parsing and allow reading from a file-object#2
jampekka wants to merge 2 commits intostatfi:masterfrom
jampekka:master

Conversation

@jampekka
Copy link

Works in quick tests. May be marginally slower than the previous implementation, but the metadata shouldn't be too heavy for this.

Split by ';' honoring quotes instead of relying on ';\n'
pattern
@statfi
Copy link
Owner

statfi commented Mar 2, 2014

Ah, thank you very many.

Could you perhaps attach a test file? Preferably multi lingual since that's what we have. I'd like to see how things differ with regards to my old implementation and this one. Also, it has been some time since I had to look at the metadata parts...

@jampekka
Copy link
Author

jampekka commented Mar 3, 2014

Seems to work with eg. http://pxweb2.stat.fi/database/StatFin/asu/asas/010_asas_tau_101.px using eg.

from px_reader import Px
from urllib2 import urlopen
px = Px(urlopen("http://pxweb2.stat.fi/database/StatFin/asu/asas/010_asas_tau_101.px"))
print(getattr(px, 'title[en]'))

If I haven't gravely misunderstood something, this patch shouldn't really change anything in the logic. It just splits the lines on ";"-character instead of ";\n". I suspect that the semicolon is the proper delimiter of .px, although I haven't seen any specs. Also using ";\n" depends on universal newlines on DOS-files, which may be difficult with some file-objects (eg. ones from urlopen).

@statfi
Copy link
Owner

statfi commented Mar 3, 2014

I was thinking more along the lines of what can break the old version. Though I'am not entirely sure anymore why I went with line break. According to PX specs one should not use line breaks for syntax...

Closest I have seen for a spec is this:
http://www.stat.fi/tup/pcaxis/px-file_format_2008_1_2008-02-04_fi.pdf
Though that is more geared for Statistics Finland's production systems, and in finnish.

Hopefully I get a chance to run this through Statfin data at the end of week. I'm in the middle of moving offices...

@jampekka
Copy link
Author

jampekka commented Mar 3, 2014

A file with ";" without a newline or ";\n" in a quoted section would break the old implementation, although I haven't seen such files. What made me do the changes was reading the files with urlopen, for which assuring universal newlines is a bit of a chore, and the ';\n'-implementation breaks on DOS files as they will show up as ';\r\n' without the universal newlines support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants