|
1 | | -This page documents various differences between IronPython and CPython. Since IronPython is under active development, any of the differences described here may change or disappear in the future: |
2 | | - |
3 | | -- [Environment Variables](#environment-variables) |
4 | | -- [COM Interaction](#com-interaction) |
5 | | -- [Strings](#strings) |
6 | | -- [Interaction with the Operating System](#interaction-with-the-operating-system) |
7 | | -- [Codecs](#codecs) |
8 | | -- [Source File Encoding](#source-file-encoding) |
9 | | -- [Recursion](#recursion) |
10 | | - |
11 | | -# Environment Variables |
12 | | - |
13 | | -* `IRONPYTHONSTARTUP` is used instead of `PYTHONSTARTUP` |
14 | | - |
15 | | -* `IRONPYTHONPATH` is used instead of `PYTHONPATH` |
16 | | - |
17 | | -# COM Interaction |
18 | | - |
19 | | -* Interaction with COM objects is handled by the CLR rather than a python library binding to the native COM dlls. |
20 | | - |
21 | | -# Strings |
22 | | - |
23 | | -* `str` objects are represented in UTF-16 (like all .NET strings) rather than UTF-32 used by CPython. |
24 | | - |
25 | | -This has a few visible consequences if characters ouside of the Basic Multilingual Plane (BMP) are used (that is, characters with Unicode code points above `U+FFFF`). A few examples below illustrate the differences. |
26 | | - |
27 | | -Let's take a Unicode character U+1F70B, '🜋'. In CPython, it is represented by a single character: |
28 | | - |
29 | | -_CPython_ |
30 | | -``` |
31 | | ->>> len('\U0001f70b') |
32 | | -1 |
33 | | ->>> str('\U0001f70b') |
34 | | -'🜋' |
35 | | -``` |
36 | | - |
37 | | -In IronPython, it is represented by a pair of surrogate characters U+D83D and U+DF0B: |
38 | | - |
39 | | -_IronPython_ |
40 | | -``` |
41 | | ->>> len('\U0001f70b') |
42 | | -2 |
43 | | ->>> str('\U0001f70b') |
44 | | -'\ud83d\udf0b' |
45 | | -``` |
46 | | - |
47 | | -In **both** cases, however, the string containing such character is printed out correctly, since `print` will transcode the string from its internal representation to whichever encoding is used by the console or file (usually UTF-8): |
48 | | - |
49 | | -_CPython_ and _IronPython_ |
50 | | -``` |
51 | | -print('\U0001f70b') |
52 | | -'🜋' |
53 | | -``` |
54 | | - |
55 | | -Any surrogate pair in IronPython strings represents one logical character. CPython, however, sees a surrogate pair as two invalid characters. |
56 | | - |
57 | | -_IronPython_ |
58 | | -``` |
59 | | ->>> '\ud83d\udf0b' |
60 | | -'\ud83d\udf0b' |
61 | | ->>> print('\ud83d\udf0b') |
62 | | -🜋 |
63 | | ->>> '\ud83d\udf0b'.encode('utf-8') |
64 | | -b'\xf0\x9f\x9c\x8b' |
65 | | ->>> '\U0001f70b'.encode('utf-8') |
66 | | -b'\xf0\x9f\x9c\x8b' |
67 | | -``` |
68 | | - |
69 | | -_CPython_ |
70 | | -``` |
71 | | ->>> '\ud83d\udf0b' |
72 | | -'\ud83d\udf0b' |
73 | | ->>> print('\ud83d\udf0b') |
74 | | -Traceback (most recent call last): |
75 | | - File "<stdin>", line 1, in <module> |
76 | | -UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed |
77 | | -'\ud83d\udf0b'.encode('utf-8') |
78 | | -Traceback (most recent call last): |
79 | | - File "<stdin>", line 1, in <module> |
80 | | -UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed |
81 | | -``` |
82 | | - |
83 | | -CPython requires use of `'surrogatepass'` error handler to let those pairs through. Note however, that they are still being treated as two separate characters. IronPython encodes the pair as if it were one character. |
84 | | - |
85 | | -_CPython_ |
86 | | -``` |
87 | | ->>> '\ud83d\udf0b'.encode('utf-8','surrogatepass') |
88 | | -b'\xed\xa0\xbd\xed\xbc\x8b' |
89 | | ->>> '\U0001f70b'.encode('utf-8') |
90 | | -b'\xf0\x9f\x9c\x8b' |
91 | | -``` |
92 | | - |
93 | | -The `'surrogatepass'` error handler is still needed in IronPython to handle surrogate characters that do not form a valid surrogate pair: |
94 | | - |
95 | | -_IronPython_ |
96 | | -``` |
97 | | -print('\ud83d\udf0b') |
98 | | -🜋 |
99 | | ->>> print('\ud83d\udf0b'[::-1]) |
100 | | -Traceback (most recent call last): |
101 | | - File "<stdin>", line 1, in <module> |
102 | | -UnicodeEncodeError: 'cp65001' codec can't encode character '\udf0b' in position 0: Unable to translate Unicode character \\uDF0B at index 0 to specified code page. |
103 | | ->>> print('\ud83d\udf0b'[::-1].encode('utf-8','surrogatepass')) |
104 | | -b'\xed\xbc\x8b\xed\xa0\xbd' |
105 | | -``` |
106 | | - |
107 | | -_CPython_ |
108 | | -``` |
109 | | ->>> print('\ud83d\udf0b') |
110 | | -Traceback (most recent call last): |
111 | | - File "<stdin>", line 1, in <module> |
112 | | -UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed |
113 | | ->>> print('\ud83d\udf0b'[::-1]) |
114 | | -Traceback (most recent call last): |
115 | | - File "<stdin>", line 1, in <module> |
116 | | -UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed |
117 | | ->>> print('\ud83d\udf0b'[::-1].encode('utf-8','surrogatepass')) |
118 | | -b'\xed\xbc\x8b\xed\xa0\xbd' |
119 | | -``` |
120 | | - |
121 | | -# Interaction with the Operating System |
122 | | - |
123 | | -* Environment variables are decoded using the `'replace'` error handler, rather than the `'surrogateescape'` error handler used by CPython. |
124 | | - |
125 | | -This is how .NET libraries handle encoding errors in the system. The difference is only visible on Posix systems that have environment variables defined using a different encoding than the encoding used by the system (Windows environment variables are always in UTF-16, so no conversion takes place when accessed as Python `str` objects). |
126 | | - |
127 | | -Assume that a Linux system is configured to use UTF-8. Under bash: |
128 | | - |
129 | | -``` |
130 | | -$ python -c 'f=open("test.sh","w",encoding="latin-1");print("NAME=\"André\"",file=f)' |
131 | | -$ source test.sh |
132 | | -$ export NAME |
133 | | -``` |
134 | | - |
135 | | -This creates an environment variable that is encoded using Latin-1 encoding, rather than the system encoding. CPython will escape the invalid byte 0xe9 (letter 'é' in Latin-1) in a lone surrogate 0xdce9, which is still an invalid Unicode character. |
136 | | - |
137 | | -_CPython_ |
138 | | -``` |
139 | | ->>> import os |
140 | | ->>> os.environ["NAME"] |
141 | | -'Andr\udce9' |
142 | | ->>> print(os.environ["NAME"]) |
143 | | -Traceback (most recent call last): |
144 | | - File "<stdin>", line 1, in <module> |
145 | | -UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 4: surrogates not allowed |
146 | | -``` |
147 | | - |
148 | | -IronPython will replace the invalid byte with U+FFFD, the Unicode replacement character, which is a valid and printable character. |
149 | | - |
150 | | -_IronPython_ |
151 | | -``` |
152 | | ->>> import os |
153 | | ->>> os.environ["NAME"] |
154 | | -'Andr�' |
155 | | ->>> print(os.environ["NAME"]) |
156 | | -Andr� |
157 | | ->>> hex(ord(os.environ["NAME"][-1])) |
158 | | -'0xfffd' |
159 | | -``` |
160 | | - |
161 | | -The CPython representation is not printable, but can be safely encoded back to the original form using `'surrogateescape'` (default when dealing with the OS environment): |
162 | | - |
163 | | -_CPython_ |
164 | | -``` |
165 | | ->>> os.environ["PATH"] = os.environ["PATH"] + ":/home/" + os.environ["NAME"] + "/bin" |
166 | | ->>> import posix |
167 | | ->>> posix.environ[b"PATH"] |
168 | | -b'/bin:/usr/bin:/usr/local/bin:/home/Andr\xe9/bin' |
169 | | ->>> os.environ["NAME"].encode("utf-8","surrogateescape") |
170 | | -b'Andr\xe9' |
171 | | -``` |
172 | | - |
173 | | -The IronPython representation is printable, but the original byte value is lost: |
174 | | - |
175 | | -_IronPython_ |
176 | | -``` |
177 | | ->>> os.environ["NAME"].encode("utf-8","surrogateescape") |
178 | | -b'Andr\xef\xbf\xbd' |
179 | | -``` |
180 | | - |
181 | | -# Codecs |
182 | | - |
183 | | -* Some single-byte codecs may have unused positions in their codepage. There are differences between how CPython and IronPython (and .NET) handle such cases. |
184 | | - |
185 | | -A simple example is encoding Windows-1252. According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API `MultiByteToWideChar` maps these to the corresponding C1 control codes. The Unicode "best fit" mapping [documents this behavior](https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt). CPython will treat those bytes as invalid, while IronPython will map them to the "best fit" Unicode character: |
186 | | - |
187 | | -_CPython_ |
188 | | -``` |
189 | | ->>> b'\x81'.decode('windows-1252') |
190 | | -Traceback (most recent call last): |
191 | | - File "<stdin>", line 1, in <module> |
192 | | - File "/opt/anaconda3/envs/py34/lib/python3.4/encodings/cp1252.py", line 15, in decode |
193 | | - return codecs.charmap_decode(input,errors,decoding_table) |
194 | | -UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined> |
195 | | ->>> b'\x81'.decode('windows-1252','surrogateescape') |
196 | | -'\udc81' |
197 | | -``` |
198 | | - |
199 | | -_IronPython_ |
200 | | -``` |
201 | | ->>> b'\x81'.decode('windows-1252') |
202 | | -'\x81' |
203 | | ->>> b'\x81'.decode('windows-1252','surrogateescape') |
204 | | -'\x81' |
205 | | -``` |
206 | | - |
207 | | -The same difference in behavior can be observed during encoding: |
208 | | - |
209 | | -_CPython_ |
210 | | -``` |
211 | | ->>> '\x81'.encode('windows-1252') |
212 | | -Traceback (most recent call last): |
213 | | - File "<stdin>", line 1, in <module> |
214 | | - File "/opt/anaconda3/envs/py34/lib/python3.4/encodings/cp1252.py", line 12, in encode |
215 | | - return codecs.charmap_encode(input,errors,encoding_table) |
216 | | -UnicodeEncodeError: 'charmap' codec can't encode character '\x81' in position 0: character maps to <undefined> |
217 | | -``` |
218 | | - |
219 | | -_IronPython_ |
220 | | -``` |
221 | | ->>> '\x81'.encode('windows-1252') |
222 | | -b'\x81' |
223 | | -``` |
224 | | - |
225 | | -* When using the UTF-7 encoding, IronPython (and .NET) always terminates the modified Base64 encoded blocks with a '-' while CPython omits the '-' if allowed. |
226 | | - |
227 | | -The UTF-7 standard allows encoders for some freedom of implementation. One optionality allowed in UTF-7 is how to end a sequence encoded in the modified Base64 code. In principle, `+` marks the start of the sequence, and `-` is the terminator. However, it is allowed to omit the terminating `-` if the next character unambiguously does not belong to the encoded Base64 block. CPython chooses to drop the terminating `-` in such cases, while IronPython will always terminate Base64-encoded blocks with a `-`: |
228 | | - |
229 | | -_CPython_ |
230 | | -``` |
231 | | ->>> 'abc:~~:zyz'.encode('utf-7') |
232 | | -b'abc:+AH4Afg:zyz' |
233 | | -``` |
234 | | - |
235 | | -_IronPython_ |
236 | | -``` |
237 | | ->>> 'abc:~~:zyz'.encode('utf-7') |
238 | | -b'abc:+AH4Afg-:zyz' |
239 | | -``` |
240 | | - |
241 | | -Note that both forms are fully interchangeable; IronPython will correctly decode what CPython encoded and vice versa. |
242 | | - |
243 | | -# Source File Encoding |
244 | | - |
245 | | -* Widechar Unicode encodings are supported as source file encoding, in addition to standard Python encodings. |
246 | | - |
247 | | -The default source file encoding is UTF-8. This also applies to bytestrings used within the program (processed by `compile`, `eval`, or `exec`). The source file encoding can be explicitly specified, and possibly changed, in one of the two ways: |
248 | | - |
249 | | - 1. By declaring the encoding in a Python comment in one of the first two lines — in accordance with [PEP-263](https://www.python.org/dev/peps/pep-0263/). |
250 | | - 2. By a byte-order-mark (BOM) — only for Unicode encodings. |
251 | | - |
252 | | -CPython recognizes only UTF-8 BOM. IronPython recognizes BOM in UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE. |
253 | | - |
254 | | -If both BOM and PEP-263 methods are used simultaneously in the same file, they should be specifying the same encoding. If the PEP-263 encoding does not match the BOM, then: |
255 | | - |
256 | | - * In case of UTF-8 BOM, an error will be reported (by both CPython and IronPython). |
257 | | - * In case of other BOMs, the encoding specified in the PEP-263 comment is silently ignored. |
258 | | - |
259 | | -# Recursion |
260 | | - |
261 | | -By default, instead of raising a `RecursionError` when the maximum recursion depth is reached, IronPython will terminate with a `StackOverflowException`. You can enable the recursion limit in IronPython in a number of ways: |
262 | | - |
263 | | - 1. From the command line: `ipy -X MaxRecursion=100`. |
264 | | - 2. In hosted scenarios: `Python.CreateEngine(new Dictionary<string, object>() { { "RecursionLimit", 100 } });`. |
265 | | - 3. From Python: `sys.setrecursionlimit(100)`. |
266 | | - |
267 | | -*There is a significant performance cost when the recursion limit is enabled*. |
268 | | - |
269 | | -Note that IronPython 3.4 adopts the CPython 3.5 behavior and throws a `RecursionError` instead of a `RuntimeError`. |
| 1 | +This document has been moved to the IronPython wiki: [Differences from CPython](https://github.com/IronLanguages/ironpython3/wiki/Differences-from-CPython) |
0 commit comments