Skip to content

Commit 6c59404

Browse files
author
slibs63
authored
Merge pull request #114 from LuminosoInsight/more-py3.7-fixes
Fixes for Python 3.7.0
2 parents 8d73f59 + 78a12aa commit 6c59404

File tree

5 files changed

+29
-10
lines changed

5 files changed

+29
-10
lines changed

docs/index.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -301,8 +301,8 @@ Variants of UTF-8
301301

302302
*ftfy.chardata* and *ftfy.build_data*: trivia about characters
303303
--------------------------------------------------------------
304-
These files load information about the character properties in Unicode 9.0.
305-
Yes, even if your version of Python doesn't support Unicode 9.0. This ensures
304+
These files load information about the character properties in Unicode 11.0.
305+
Yes, even if your version of Python doesn't support Unicode 11.0. This ensures
306306
that ftfy's behavior is consistent across versions.
307307

308308
.. automodule:: ftfy.chardata

ftfy/badness.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -128,7 +128,7 @@ def _make_weirdness_regex():
128128
'[ÂÃĂ][\x80-\x9f€ƒ‚„†‡ˆ‰‹Œ“•˜œŸ¡¢£¤¥¦§¨ª«¬¯°±²³µ¶·¸¹º¼½¾¿ˇ˘˝]|'
129129
# Characters we have to be a little more cautious about if they're at
130130
# the end of a word, but totally okay to fix in the middle
131-
'[ÂÃĂ][›»‘”©™]\w|'
131+
r'[ÂÃĂ][›»‘”©™]\w|'
132132
# Similar mojibake of low-numbered characters in MacRoman. Leaving out
133133
# most mathy characters because of false positives, but cautiously catching
134134
# "√±" (mojibake for "ñ") and "√∂" (mojibake for "ö") in the middle of a
@@ -141,7 +141,7 @@ def _make_weirdness_regex():
141141
# Also left out eye-like letters, including accented o's, for when ¬ is
142142
# the nose of a kaomoji.
143143
'[¬√][ÄÅÇÉÑÖÜáàâäãåçéèêëíìîïñúùûü†¢£§¶ß®©™≠ÆØ¥ªæø≤≥]|'
144-
'\w√[±∂]\w|'
144+
r'\w√[±∂]\w|'
145145
# ISO-8859-1, ISO-8859-2, or Windows-1252 mojibake of characters U+10000
146146
# to U+1FFFF. (The Windows-1250 and Windows-1251 versions might be too
147147
# plausible.)

ftfy/build_data.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
classes we care about change, or if a new version of Python supports a new
66
Unicode standard and we want it to affect our string decoding.
77
8-
The file that we generate is based on Unicode 9.0, as supported by Python 3.6.
8+
The file that we generate is based on Unicode 11.0, as supported by Python 3.7.
99
You can certainly use it in earlier versions. This simply makes sure that we
1010
get consistent results from running ftfy on different versions of Python.
1111
@@ -39,16 +39,16 @@ def make_char_data_file(do_it_anyway=False):
3939
Build the compressed data file 'char_classes.dat' and write it to the
4040
current directory.
4141
42-
If you run this, run it in Python 3.6 or later. It will run in earlier
43-
versions, but you won't get the Unicode 9 standard, leading to inconsistent
44-
behavior.
42+
If you run this, run it in Python 3.7.0 or later. It will run in earlier
43+
versions, but you won't get the Unicode 11 standard, leading to inconsistent
44+
behavior. Pre-releases of Python 3.7 won't work (Unicode 11 wasn't out yet).
4545
4646
To protect against this, running this in the wrong version of Python will
4747
raise an error unless you pass `do_it_anyway=True`.
4848
"""
49-
if sys.hexversion < 0x03060000 and not do_it_anyway:
49+
if sys.hexversion < 0x030700f0 and not do_it_anyway:
5050
raise RuntimeError(
51-
"This function should be run in Python 3.6 or later."
51+
"This function should be run in Python 3.7.0 or later."
5252
)
5353

5454
cclasses = [None] * 0x110000

ftfy/char_classes.dat

75 Bytes
Binary file not shown.

tests/test_futuristic_codepoints.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,3 +40,22 @@ def test_unicode_10():
4040
# all versions for consistency.
4141
thalim = "\U00011A1A\U00011A2C\U00011A01\U00011A38"
4242
assert sequence_weirdness(thalim) == 0
43+
44+
45+
def test_unicode_11():
46+
# Unicode 11 has implemented the mtavruli form of the Georgian script.
47+
# They are analogous to capital letters in that they can be used to
48+
# emphasize text or write a headline.
49+
#
50+
# Python will convert to that form when running .upper() on Georgian text,
51+
# starting in version 3.7.0. We want to recognize the result as reasonable
52+
# text on all versions.
53+
#
54+
# This text is the mtavruli form of "ქართული ენა", meaning "Georgian
55+
# language".
56+
57+
georgian_mtavruli_text = 'ᲥᲐᲠᲗᲣᲚᲘ ᲔᲜᲐ'
58+
assert sequence_weirdness(georgian_mtavruli_text) == 0
59+
60+
mojibake = georgian_mtavruli_text.encode('utf-8').decode('sloppy-windows-1252')
61+
assert fix_encoding(mojibake) == georgian_mtavruli_text

0 commit comments

Comments
 (0)