Skip to content

Always read python source using UTF-8 #59

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 4, 2021

Conversation

MrMino
Copy link
Collaborator

@MrMino MrMino commented Mar 18, 2021

By default, Python 3 reads source in UTF-8, and so should pip-check-reqs.

This is an issue on one of my systems, where sys.getfilesystemencoding() returns "ascii". While I'm not sure why it does (it's a debian-slim docker image) - reading Python source in an encoding other than UTF-8 doesn't seem to make much sense. It's a very niche usecase, and if someone marks the source with non-standard encoding, it wouldn't work anyway. Let's just hardcode it.

@adamtheturtle
Copy link
Owner

This is not my area of expertise and I'm wary of breaking existing setups.

Could we have a test which would fail if this change did not exist?
Does this have any interaction / overlap with #20 ?

@MrMino
Copy link
Collaborator Author

MrMino commented Apr 2, 2021

Hi @adamtheturtle, sorry for the late response

Yes, this could use a testcase, I just didn't have the time for it when I initially created this PR :(.

This is similar to #20, but the angle of attack is different. #20 tries to provide a configuration option. This PR forces utf-8 as a default. IMO there should be logic to autodetect it, but it's something rather complex to implement.

Python source is read in utf-8 by default. The way it's read is configurable by using the markers specified in PEP-263.

This is different to how the text files are read. The default encoding of open() in the text mode is set to whatever the locale.getpreferredencoding() returns (source).

If the encoding specified by locale is utf-8, and the source is not marked as using non-UTF-8 encoding - everything is ok.

Reading the code, it looks like the following situations can potentially make pip-check-reqs fail:

  • ❌ source is not marked (defaults to UTF-8), contains characters exclusive to UTF-8, but the preferred encoding is different than UTF-8 (let's say - in the case that prompted this PR - ASCII)
  • ❌ source uses non-UTF-8 encoding that is different than the preferred encoding
  • ❌ source is marked differently than UTF-8 and uses non-UTF-8 characters, preferred encoding is UTF-8

This PR solves the first case, which should be the most prolific source of errors:

  • ✔️ source is not marked (defaults to UTF-8), contains characters exclusive to UTF-8, but the preferred encoding is different than UTF-8
  • ❌ source uses non-UTF-8 encoding that is different than the preferred encoding
  • ❌ source is marked differently than UTF-8 and uses non-UTF-8 characters, preferred encoding is UTF-8

#20 makes it possible to workaround these situations, but I don't think it's the way to go. Codebases can contain files with different encodings, and in that case setting a config option for it will make pip-check-reqs fail on different files depending on the config options used.

@MrMino
Copy link
Collaborator Author

MrMino commented Apr 4, 2021

@adamtheturtle I added a test, albeit a shallow one. It's impossible for me to change the default encoding without subprocessing python, and even then I'm not sure how to do it properly. It's a very low level setting which is difficult to change without making all sort of other things go haywire.

@adamtheturtle adamtheturtle merged commit d17aa5e into adamtheturtle:master Apr 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants