|
| 1 | +# CWE-180: Incorrect Behavior Order: Validate Before Canonicalize |
| 2 | + |
| 3 | +Normalize/canonicalize strings before validating them to prevent risky strings such as `../../../../passwd` allowing directory traversal attacks, and to reduce `XSS` attacks. |
| 4 | + |
| 5 | +The need for supporting multiple languages requires the use of an extended list of characters encoding such as `UTF-8` supporting __1,112,064__ displayable characters. |
| 6 | + |
| 7 | +Character Encoding systems such as `ASCII`, `Windows-1252`, or `UTF-8` consist of an agreed mapping between byte values and a human-readable character known as code points. Each code point represents a single relation between characters such as a fixed number "`\u002e`", its graphical representation "`.`", and name "`FULL STOP`" [[Batchelder 2022]](https://www.youtube.com/watch?v=sgHbC6udIqc). Using the same encoding assures that equivalent strings have a unique binary representation Unicode Standard _annex #15, Unicode Normalization Forms_ [[Davis 2008]](https://wiki.sei.cmu.edu/confluence/display/java/Rule+AA.+References#RuleAA.References-Davis08). Different or unexpected changes in encoding can allow attackers to workaround validation or input sanitation affords. |
| 8 | + |
| 9 | +> [!WARNING] |
| 10 | +> Ensure to use allow lists to avoid having to maintain an deny list on a continuous basis (as exclusion lists are a moving target) as per [CWE-184: Incomplete List of Disallowed Input - Development Environment](../../CWE-693/CWE-184/README.md). |
| 11 | +
|
| 12 | +<table> |
| 13 | + <tr> |
| 14 | + <th colspan="3">NFKC normalized</th> |
| 15 | + <th colspan="3">UTF-16 (hex)</th> |
| 16 | + </tr> |
| 17 | + <tr> |
| 18 | + <th>Print</th> |
| 19 | + <th>Hex</th> |
| 20 | + <th>Name</th> |
| 21 | + <th>Print</th> |
| 22 | + <th>Hex</th> |
| 23 | + <th>Name</th> |
| 24 | + </tr> |
| 25 | + <tr> |
| 26 | + <td >.</td> |
| 27 | + <td>\u002e</td> |
| 28 | + <td>FULL STOP</td> |
| 29 | + <td>․</td> |
| 30 | + <td>\u2024</td> |
| 31 | + <td>ONE DOT LEADER</td> |
| 32 | + </tr> |
| 33 | + <tr> |
| 34 | + <td >..</td> |
| 35 | + <td>\u002e\u002e</td> |
| 36 | + <td>FULL STOPFULL STOP</td> |
| 37 | + <td>‥</td> |
| 38 | + <td>\u2025</td> |
| 39 | + <td>TWO DOT LEADER</td> |
| 40 | + </tr> |
| 41 | + <tr> |
| 42 | + <td >/</td> |
| 43 | + <td>\u003f</td> |
| 44 | + <td>SOLIDUS</td> |
| 45 | + <td>/</td> |
| 46 | + <td>\uff0f</td> |
| 47 | + <td>FULLWIDTH SOLIDUS</td> |
| 48 | + </tr> |
| 49 | +</table> |
| 50 | + |
| 51 | +The `NFKC` and `NFKD`compatibility mode causes a `ONE DOT LEADER` to become a `FULL STOP` as demonstrated in `example01.py` [[python.org 2023]](https://docs.python.org/3/library/unicodedata.html) |
| 52 | + |
| 53 | +__[example01.py](example01.py):__ |
| 54 | + |
| 55 | +```py |
| 56 | +""" Code Example """ |
| 57 | + |
| 58 | +# SPDX-FileCopyrightText: OpenSSF project contributors |
| 59 | +# SPDX-License-Identifier: MIT |
| 60 | +import unicodedata |
| 61 | + |
| 62 | +print("\N{FULL STOP}" * 10) |
| 63 | +print("." == unicodedata.normalize("NFC", "\u2024") == "\N{FULL STOP}" == "\u002e") |
| 64 | +print("." == unicodedata.normalize("NFD", "\u2024") == "\N{FULL STOP}" == "\u002e") |
| 65 | +print("." == unicodedata.normalize("NFKC", "\u2024") == "\N{FULL STOP}" == "\u002e") |
| 66 | +print("." == unicodedata.normalize("NFKD", "\u2024") == "\N{FULL STOP}" == "\u002e") |
| 67 | +print("\N{FULL STOP}" * 10) |
| 68 | +``` |
| 69 | + |
| 70 | +The first two lines in `example01.py` return `False` due to the missing compatibility mode and the last two lines return `True`. The issue depends on whether normalization is used, its mode, and when it is applied. |
| 71 | + |
| 72 | +Using a compatibility mode `NFKC` and `NFKD` can allow attackers to disguise malicious strings by using characters that are beyond the `ASCII` range of `0-127` turning a `ONE DOT LEADER` `\u2024` into a `FULL STOP \u002E`. |
| 73 | + |
| 74 | +Using non-compatibility `NFC` and `NFD` or stripping of characters can lead to a harmless string such as `<script生>` turn into `<script>` as per _CWE-182: Collapse of Data into Unsafe Value (4.16)_ [[MITRE CWE-182 2024]](https://cwe.mitre.org/data/definitions/182.html) |
| 75 | + |
| 76 | +## Non-Compliant Code Example - Compatibility mode |
| 77 | + |
| 78 | +Reducing the list of allowed characters or switching between different encodings can be required by design in order to stay compatible between different systems. |
| 79 | + |
| 80 | +The `noncompliant01.py` code is attempting to detect a directory traversal attack but only normalizes for logging `unicodedata.normalize()` |
| 81 | + |
| 82 | +__[noncompliant01.py](noncompliant01.py):__ |
| 83 | + |
| 84 | +```python |
| 85 | +# SPDX-FileCopyrightText: OpenSSF project contributors |
| 86 | +# SPDX-License-Identifier: MIT |
| 87 | +"""Non-compliant Code Example""" |
| 88 | + |
| 89 | +import re |
| 90 | +import unicodedata |
| 91 | + |
| 92 | + |
| 93 | +def api_with_ids(suspicious_string: str): |
| 94 | + """Fancy intrusion detection system(IDS)""" |
| 95 | + if re.search("./", suspicious_string): |
| 96 | + normalized_string = unicodedata.normalize("NFKC", suspicious_string) |
| 97 | + print(f"detected an attack sequence {normalized_string}") |
| 98 | + else: |
| 99 | + print("Nothing suspicious") |
| 100 | + |
| 101 | + |
| 102 | +##################### |
| 103 | +# attempting to exploit above code example |
| 104 | +##################### |
| 105 | +# The MALICIOUS_INPUT is using: |
| 106 | +# \u2024 or "ONE DOT LEADER" |
| 107 | +# \uFF0F or 'FULLWIDTH SOLIDUS' |
| 108 | +api_with_ids("\u2024\u2024\uff0f" * 10 + "passwd") |
| 109 | +``` |
| 110 | + |
| 111 | +The `re.search("./"` can not detect the "`ONE DOT LEADER`" or "`FULLWIDTH SOLIDUS`" because it is not normalized at the right time, which allows a directory traversal attack. |
| 112 | + |
| 113 | +## Compliant Solution - Compatibility mode |
| 114 | + |
| 115 | +This compliant solution normalizes the string before testing it and according to _annex #15_ [[Davis 2008]](https://wiki.sei.cmu.edu/confluence/display/java/Rule+AA.+References#RuleAA.References-Davis08), and [[Batchelder 2022]](https://www.youtube.com/watch?v=sgHbC6udIqc) we want to ensure that strings have a unique binary representation within our code. |
| 116 | + |
| 117 | +__[compliant01.py](compliant01.py):__ |
| 118 | + |
| 119 | +```python |
| 120 | +# SPDX-FileCopyrightText: OpenSSF project contributors |
| 121 | +# SPDX-License-Identifier: MIT |
| 122 | +"""Compliant Code Example""" |
| 123 | + |
| 124 | +import re |
| 125 | +import unicodedata |
| 126 | + |
| 127 | + |
| 128 | +def api_with_ids(suspicious_string: str): |
| 129 | + """Fancy intrusion detection system(IDS)""" |
| 130 | + normalized_string = unicodedata.normalize("NFKC", suspicious_string) |
| 131 | + if re.search("./", normalized_string): |
| 132 | + print("detected an attack sequence with . or /") |
| 133 | + else: |
| 134 | + print("Nothing suspicious") |
| 135 | + |
| 136 | + |
| 137 | +##################### |
| 138 | +# attempting to exploit above code example |
| 139 | +##################### |
| 140 | +# The MALICIOUS_INPUT is using: |
| 141 | +# \u2024 or "ONE DOT LEADER" |
| 142 | +# \uFF0F or 'FULLWIDTH SOLIDUS' |
| 143 | +api_with_ids("\u2024\u2024\uff0f" * 10 + "passwd") |
| 144 | + |
| 145 | +``` |
| 146 | + |
| 147 | +Developers should be aware of the encoding of data printed to `HTML`. For example, the following string was an `XSS` vulnerability in chrome `숍訊昱穿刷奄剔㏆穽侘㈊섞昌侄從쒜` [[issues.chromium.org 2025]](https://issues.chromium.org/issues/40076480); if the charset of the webpage was set to `ISO-2022-KR` or another unknown charset. |
| 148 | +At the time the Korean language was unsupported so it attempted to fall back to Windows OS default encoding `Windows-1252` and executed the code [[Taylor 2009]](https://zaynar.co.uk/posts/charset-encoding-xss/). |
| 149 | + |
| 150 | +Note that some operating systems (Windows, Mac) have system encodings for various characters which do get executed on a webpage regardless of charset. These should be avoided as they can cause issues with devices that don't support that charset. Other character sets should be avoided too, such as ascii, because mobile phones or old SMS generally has a very limited charset and behave unexpectedly. |
| 151 | + |
| 152 | +## Automated Detection |
| 153 | + |
| 154 | +None known |
| 155 | + |
| 156 | +|Tool|Version|Checker|Description| |
| 157 | +|:---|:---|:---|:---| |
| 158 | +||||| |
| 159 | + |
| 160 | +## Related Guidelines |
| 161 | + |
| 162 | +||| |
| 163 | +|:---|:---| |
| 164 | +|[ISO/IEC TR 24772:2013](https://wiki.sei.cmu.edu/confluence/display/java/Rule+AA.+References#RuleAA.References-ISO/IECTR24772-2013)|Cross-site Scripting \[XYT\] \[online\], available from: <https://wiki.sei.cmu.edu/confluence/display/java/Rule+AA.+References#RuleAA.References-ISO/IECTR24772-2013>, \[Accessed April 2025\]| |
| 165 | +|[MITRE CWE](http://cwe.mitre.org/)|Pillar CWE - CWE-707: Improper Neutralization \[online\], available from:<https://cwe.mitre.org/data/definitions/707.html> \[Accessed April 2025\]| |
| 166 | +|[MITRE CWE](http://cwe.mitre.org/)|Variant: CWE-180, Incorrect behavior order: Validate before canonicalize \[online\], available from: <http://cwe.mitre.org/data/definitions/180.html>| |
| 167 | +|[MITRE CWE](http://cwe.mitre.org/)|Base: CWE-182: Collapse of Data into Unsafe Value (4.16) \[online\], available from: <http://cwe.mitre.org/data/definitions/182.html>| |
| 168 | +|[MITRE CWE](http://cwe.mitre.org/)|Base: CWE-184: Incomplete List of Disallowed Input - Development Environment. \[online\], available from: <http://cwe.mitre.org/data/definitions/184.html>| |
| 169 | +|[SEI CERT Coding Standard for Java](https://wiki.sei.cmu.edu/confluence/display/java/SEI+CERT+Oracle+Coding+Standard+for+Java)|IDS01-J. Normalize strings before validating them \[online\], available from: <https://wiki.sei.cmu.edu/confluence/display/java/IDS01-J.+Normalize+strings+before+validating+them>| |
| 170 | + |
| 171 | +## Bibliography |
| 172 | + |
| 173 | +||| |
| 174 | +|:---|:---| |
| 175 | +|\[Davis 2008\]|Mark Davis and Ken Whistler, Unicode Standard Annex #15, Unicode Normalization Forms, 2008. \[online\], available from: <http://unicode.org/reports/tr15/> \[Accessed April 2025\]<br>Mark Davis and Michel Suignard, Unicode Technical Report #36, Unicode Security Considerations, 2008.\[online\], Available from:<http://www.unicode.org/reports/tr36/> \[Accessed 4 April 2025\] | |
| 176 | +|\[Weber 2009\]|MUnraveling Unicode: A Bag of Tricks for Bug Hunting \[online\], available from: <http://www.lookout.net/wp-content/uploads/2009/03/chris_weber_exploiting-unicode-enabled-software-v15.pdf> \[Accessed April 2025\]| |
| 177 | +|\[Ned Batchelder 2022\]|Pragmatic Unicode, or, How do I stop the pain? - YouTube \[online\], available from: <https://www.youtube.com/watch?v=sgHbC6udIqc> \[Accessed April 2025\]| |
| 178 | +|\[Kuchling 2022\]|Unicode HOWTO \[online\], available from: <https://docs.python.org/3/howto/unicode.html> \[Accessed April 2025\]| |
| 179 | +|\[python.org 2023\]|unicodedata — Unicode Database — Python 3.12.0 documentation \[online\], available from: <https://docs.python.org/3/library/unicodedata.html> \[Accessed April 2025\]| |
| 180 | +|\[issues.chromium.org 2025\]|XSS issue due to the lack of support for ISO-2022-KR \[online\], available from: <https://issues.chromium.org/issues/40076480> \[Accessed April 2025\]| |
| 181 | +|\[Taylor 2009\]|XSS vulnerabilities with unusual character encodings \[online\], available from: <https://zaynar.co.uk/posts/charset-encoding-xss/> \[Accessed April 2025\]| |
0 commit comments