Robuster parsing and allow reading from a file-object#2
Robuster parsing and allow reading from a file-object#2jampekka wants to merge 2 commits intostatfi:masterfrom
Conversation
Split by ';' honoring quotes instead of relying on ';\n' pattern
|
Ah, thank you very many. Could you perhaps attach a test file? Preferably multi lingual since that's what we have. I'd like to see how things differ with regards to my old implementation and this one. Also, it has been some time since I had to look at the metadata parts... |
|
Seems to work with eg. http://pxweb2.stat.fi/database/StatFin/asu/asas/010_asas_tau_101.px using eg. from px_reader import Px
from urllib2 import urlopen
px = Px(urlopen("http://pxweb2.stat.fi/database/StatFin/asu/asas/010_asas_tau_101.px"))
print(getattr(px, 'title[en]'))If I haven't gravely misunderstood something, this patch shouldn't really change anything in the logic. It just splits the lines on ";"-character instead of ";\n". I suspect that the semicolon is the proper delimiter of .px, although I haven't seen any specs. Also using ";\n" depends on universal newlines on DOS-files, which may be difficult with some file-objects (eg. ones from urlopen). |
|
I was thinking more along the lines of what can break the old version. Though I'am not entirely sure anymore why I went with line break. According to PX specs one should not use line breaks for syntax... Closest I have seen for a spec is this: Hopefully I get a chance to run this through Statfin data at the end of week. I'm in the middle of moving offices... |
|
A file with ";" without a newline or ";\n" in a quoted section would break the old implementation, although I haven't seen such files. What made me do the changes was reading the files with urlopen, for which assuring universal newlines is a bit of a chore, and the ';\n'-implementation breaks on DOS files as they will show up as ';\r\n' without the universal newlines support. |
Works in quick tests. May be marginally slower than the previous implementation, but the metadata shouldn't be too heavy for this.