-
Notifications
You must be signed in to change notification settings - Fork 71
Open
Labels
bugSomething isn't workingSomething isn't workingrequires changes in Readstatwaiting for changes in the C library Readstatwaiting for changes in the C library Readstat
Description
pyreadstat throws a utf error when reading this file or this file. Haven is able to read them both successfully, which makes me think this is a problem with pyreadstat rather than readstat itself.
It appears that there are Windows-1252 encoded "smart quotes" in the header that pyreadstat is trying to read as utf. Passing encoding="..." to read_dta has no effect.
To reproduce:
# get file
curl http://www.principlesofeconometrics.com/stata/cocaine.dta
# fails to read
python -c '
import pyreadstat
pyreadstat.read_dta("cocaine.dta")
pyreadstat.read_dta("cocaine.dta", encoding = "WINDOWS-1252")
pyreadstat.read_dta("cocaine.dta", encoding = "CP1252")
'
# successfully reads
python -c 'import pandas as pd; pd.read_stata("cocaine.dta")'
R -e 'haven::read_dta("cocaine.dta")'
Full stack trace:
Traceback (most recent call last):
File "<string>", line 1, in <module>
import pyreadstat; pyreadstat.read_dta("cocaine.dta")
~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
File "pyreadstat/pyreadstat.pyx", line 301, in pyreadstat.pyreadstat.read_dta
File "pyreadstat/_readstat_parser.pyx", line 1176, in pyreadstat._readstat_parser.run_conversion
File "pyreadstat/_readstat_parser.pyx", line 796, in pyreadstat._readstat_parser.handle_note
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 88: invalid start byte
The relevant part of cocaine.dta. \x93 is a non-utf quote that pyreadstat is interpreting as utf8.
00000370 20 43 61 75 6c 6b 69 6e 73 2c 20 4a 2e 50 2e 20 | Caulkins, J.P. |
00000380 61 6e 64 20 52 2e 20 50 61 64 6d 61 6e 20 28 31 |and R. Padman (1|
00000390 39 39 33 29 2c 20 93 51 75 61 6e 74 69 74 79 20 |993), .Quantity | <- BAD LINE
000003a0 44 69 73 63 6f 75 6e 74 73 20 61 6e 64 20 51 75 |Discounts and Qu|
000003b0 61 6c 69 74 79 20 50 72 65 6d 69 61 20 66 6f 72 |ality Premia for|
installed from pip in a virtualenv, 64-bit Mac, python 3.13, pyreadstat 1.30
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingrequires changes in Readstatwaiting for changes in the C library Readstatwaiting for changes in the C library Readstat