Skip to content

Commit b975817

Browse files
committed
Add comment explaining mUTF-7 to mbfilter_utf7imap.c
1 parent 648c1cb commit b975817

File tree

1 file changed

+48
-0
lines changed

1 file changed

+48
-0
lines changed

ext/mbstring/libmbfl/filters/mbfilter_utf7imap.c

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,54 @@
2727
*
2828
*/
2929

30+
/* Modified UTF-7 used for 'international mailbox names' in the IMAP protocol
31+
* Also known as mUTF-7
32+
* Defined in RFC 3501 5.1.3 (https://tools.ietf.org/html/rfc3501)
33+
*
34+
* Quoting from the RFC:
35+
*
36+
***********************************************************************
37+
* In modified UTF-7, printable US-ASCII characters, except for "&",
38+
* represent themselves; that is, characters with octet values 0x20-0x25
39+
* and 0x27-0x7e. The character "&" (0x26) is represented by the
40+
* two-octet sequence "&-".
41+
*
42+
* All other characters (octet values 0x00-0x1f and 0x7f-0xff) are
43+
* represented in modified BASE64, with a further modification from
44+
* UTF-7 that "," is used instead of "/". Modified BASE64 MUST NOT be
45+
* used to represent any printing US-ASCII character which can represent
46+
* itself.
47+
*
48+
* "&" is used to shift to modified BASE64 and "-" to shift back to
49+
* US-ASCII. There is no implicit shift from BASE64 to US-ASCII, and
50+
* null shifts ("-&" while in BASE64; note that "&-" while in US-ASCII
51+
* means "&") are not permitted. However, all names start in US-ASCII,
52+
* and MUST end in US-ASCII; that is, a name that ends with a non-ASCII
53+
* ISO-10646 character MUST end with a "-").
54+
***********************************************************************
55+
*
56+
* The purpose of all this is: 1) to keep all parts of IMAP messages 7-bit clean,
57+
* 2) to avoid giving special treatment to +, /, \, and ~, since these are
58+
* commonly used in mailbox names, and 3) to ensure there is only one
59+
* representation of any mailbox name (vanilla UTF-7 does allow multiple
60+
* representations of the same string, by Base64-encoding characters which
61+
* could have been included as ASCII literals.)
62+
*
63+
* RFC 2152 also applies, since it defines vanilla UTF-7 (minus IMAP modifications)
64+
* The following paragraph is notable:
65+
*
66+
***********************************************************************
67+
* Unicode is encoded using Modified Base64 by first converting Unicode
68+
* 16-bit quantities to an octet stream (with the most significant octet first).
69+
* Surrogate pairs (UTF-16) are converted by treating each half of the pair as
70+
* a separate 16 bit quantity (i.e., no special treatment). Text with an odd
71+
* number of octets is ill-formed. ISO 10646 characters outside the range
72+
* addressable via surrogate pairs cannot be encoded.
73+
***********************************************************************
74+
*
75+
* So after reversing the modified Base64 encoding on an encoded section,
76+
* the contents are interpreted as UTF-16BE. */
77+
3078
#include "mbfilter.h"
3179
#include "mbfilter_utf7imap.h"
3280

0 commit comments

Comments
 (0)