ossf · myteron · Apr 23, 2025 · Apr 3, 2025 · Apr 3, 2025 · Apr 3, 2025
diff --git a/docs/Secure-Coding-Guide-for-Python/CWE-693/CWE-184/README.md b/docs/Secure-Coding-Guide-for-Python/CWE-693/CWE-184/README.md
@@ -53,7 +53,7 @@ for name in names:
 
 ## Compliant Solution
 
-The `compliant01.py` uses an allow list instead of a deny list and prevents the use of unwanted characters by raising an exception even without canonicalization. The missing canonicalization in `compliant01.py` according to [CWE-180: Incorrect Behavior Order: Validate Before Canonicalize](https://github.com/ossf/wg-best-practices-os-developers/tree/main/docs/Secure-Coding-Guide-for-Python/CWE-707/CWE-180) must be added in order to make logging or displaying them safe!
+The `compliant01.py` uses an allow list instead of a deny list and prevents the use of unwanted characters by raising an exception even without canonicalization. The missing canonicalization in `compliant01.py` according to [CWE-180: Incorrect Behavior Order: Validate Before Canonicalize](../../CWE-707/CWE-180) must be added in order to make logging or displaying them safe!
 
 *[compliant01.py](compliant01.py):*
 
@@ -118,7 +118,7 @@ ValueError: Invalid input tag
 
 ```
 
-According to *Unicode Technical Report #36, Unicode Security Considerations [Davis 2008b]*, `\uFFFD`  is usually unproblematic, as a replacement for unwanted or dangerous characters. That is, `\uFFFD` will typically just cause a failure in parsing. Where the output character set is not Unicode, though, this character may not be available.
+According to *Unicode Technical Report #36, Unicode Security Considerations [Davis 2008b](https://www.unicode.org/reports/tr36/), `\uFFFD`  is usually unproblematic, as a replacement for unwanted or dangerous characters. That is, `\uFFFD` will typically just cause a failure in parsing. Where the output character set is not Unicode, though, this character may not be available.
 
 ## Automated Detection
 
@@ -139,6 +139,6 @@ According to *Unicode Technical Report #36, Unicode Security Considerations [Dav
 
 |||
 |:---|:---|
-|[Unicode 2024]|Unicode 16.0.0 [online]. Available from: [https://www.unicode.org/versions/Unicode16.0.0/](https://www.unicode.org/versions/Unicode16.0.0/) [accessed 20 March 2025] |
-|[Davis 2008b]|Unicode Technical Report #36, Unicode Security Considerations, Section 3.5 "Deletion of Code Points" [online]. Available from: [https://www.unicode.org/reports/tr36/](https://www.unicode.org/reports/tr36/) [accessed 20 March 2025] |
-|[Davis 2008b]|Unicode Technical Report #36, Unicode Security Considerations, Section 3.5 "Deletion of Code Points" [online]. Available from: [https://www.unicode.org/reports/tr36/](https://www.unicode.org/reports/tr36/) [accessed 20 March 2025] |
+|\[Unicode 2024\]|Unicode 16.0.0 \[online\]. Available from: [https://www.unicode.org/versions/Unicode16.0.0/](https://www.unicode.org/versions/Unicode16.0.0/) \[accessed 20 March 2025\] |
+|\[Davis 2008b\]|Unicode Technical Report #36, Unicode Security Considerations, Section 3.5 "Deletion of Code Points" \[online\]. Available from: [https://www.unicode.org/reports/tr36/](https://www.unicode.org/reports/tr36/) \[accessed 20 March 2025\] |
+|\[Davis 2008b\]|Unicode Technical Report #36, Unicode Security Considerations, Section 3.5 "Deletion of Code Points" \[online\]. Available from: [https://www.unicode.org/reports/tr36/](https://www.unicode.org/reports/tr36/) \[accessed 20 March 2025\] |
diff --git a/docs/Secure-Coding-Guide-for-Python/CWE-707/CWE-180/README.md b/docs/Secure-Coding-Guide-for-Python/CWE-707/CWE-180/README.md
@@ -0,0 +1,181 @@
+# CWE-180: Incorrect Behavior Order: Validate Before Canonicalize
+
+Normalize/canonicalize strings before validating them to prevent risky strings such as  `../../../../passwd` allowing directory traversal attacks, and to reduce `XSS` attacks.
+
+The need for supporting multiple languages requires the use of an extended list of characters encoding such as `UTF-8` supporting __1,112,064__ displayable characters.
+
+Character Encoding systems such as `ASCII`, `Windows-1252`, or `UTF-8` consist of an agreed mapping between byte values and a human-readable character known as code points. Each code point represents a single relation between characters such as a fixed number "`\u002e`", its graphical representation "`.`", and name "`FULL STOP`"  [[Batchelder 2022]](https://www.youtube.com/watch?v=sgHbC6udIqc). Using the same encoding assures that equivalent strings have a unique binary representation Unicode Standard _annex #15, Unicode Normalization Forms_ [[Davis 2008]](https://wiki.sei.cmu.edu/confluence/display/java/Rule+AA.+References#RuleAA.References-Davis08). Different or unexpected changes in encoding can allow attackers to workaround validation or input sanitation affords.
+
+> [!WARNING]
+> Ensure to use allow lists to avoid having to maintain an deny list on a continuous basis (as exclusion lists are a moving target) as per [CWE-184: Incomplete List of Disallowed Input - Development Environment](../../CWE-693/CWE-184/README.md).
+
+<table>
+    <tr>
+        <th colspan="3">NFKC normalized</th>
+        <th colspan="3">UTF-16 (hex)</th>
+    </tr>
+    <tr>
+        <th>Print</th>
+        <th>Hex</th>
+        <th>Name</th>
+        <th>Print</th>
+        <th>Hex</th>
+        <th>Name</th>
+    </tr>
+    <tr>
+        <td >.</td>
+        <td>\u002e</td>
+        <td>FULL STOP</td>
+        <td>․</td>
+        <td>\u2024</td>
+        <td>ONE DOT LEADER</td>
+    </tr>
+    <tr>
+        <td >..</td>
+        <td>\u002e\u002e</td>
+        <td>FULL STOPFULL STOP</td>
+        <td>‥</td>
+        <td>\u2025</td>
+        <td>TWO DOT LEADER</td>
+    </tr>
+    <tr>
+        <td >/</td>
+        <td>\u003f</td>
+        <td>SOLIDUS</td>
+        <td>／</td>
+        <td>\uff0f</td>
+        <td>FULLWIDTH SOLIDUS</td>
+    </tr>
+</table>
+
+The `NFKC` and `NFKD`compatibility mode causes a `ONE DOT LEADER` to become a `FULL STOP` as demonstrated in `example01.py` [[python.org 2023]](https://docs.python.org/3/library/unicodedata.html)
+
+__[example01.py](example01.py):__
+
+```py
+""" Code Example """
+
+# SPDX-FileCopyrightText: OpenSSF project contributors
+# SPDX-License-Identifier: MIT
+import unicodedata
+
+print("\N{FULL STOP}" * 10)
+print("." == unicodedata.normalize("NFC", "\u2024") == "\N{FULL STOP}" == "\u002e")
+print("." == unicodedata.normalize("NFD", "\u2024") == "\N{FULL STOP}" == "\u002e")
+print("." == unicodedata.normalize("NFKC", "\u2024") == "\N{FULL STOP}" == "\u002e")
+print("." == unicodedata.normalize("NFKD", "\u2024") == "\N{FULL STOP}" == "\u002e")
+print("\N{FULL STOP}" * 10)
+```
+
+The first two lines in `example01.py` return `False` due to the missing compatibility mode and the last two lines return `True`. The issue depends on whether normalization is used, its mode, and when it is applied.
+
+Using a compatibility mode `NFKC` and `NFKD` can allow attackers to disguise malicious strings by using characters that are beyond the `ASCII` range of `0-127` turning a `ONE DOT LEADER` `\u2024` into a `FULL STOP \u002E`.
+
+Using non-compatibility `NFC` and `NFD` or stripping of characters can lead to a harmless string such as `<script生>` turn into `<script>` as per _CWE-182: Collapse of Data into Unsafe Value (4.16)_ [[MITRE CWE-182 2024]](https://cwe.mitre.org/data/definitions/182.html)
+
+## Non-Compliant Code Example - Compatibility mode
+
+Reducing the list of allowed characters or switching between different encodings can be required by design in order to stay compatible between different systems.
+
+The `noncompliant01.py` code is attempting to detect a directory traversal attack but only normalizes for logging `unicodedata.normalize()`
+
+__[noncompliant01.py](noncompliant01.py):__
+
+```python
+# SPDX-FileCopyrightText: OpenSSF project contributors
+# SPDX-License-Identifier: MIT
+"""Non-compliant Code Example"""
+
+import re
+import unicodedata
+
+
+def api_with_ids(suspicious_string: str):
+    """Fancy intrusion detection system(IDS)"""
+    if re.search("./", suspicious_string):
+        normalized_string = unicodedata.normalize("NFKC", suspicious_string)
+        print(f"detected an attack sequence {normalized_string}")
+    else:
+        print("Nothing suspicious")
+
+
+#####################
+# attempting to exploit above code example
+#####################
+# The MALICIOUS_INPUT is using:
+# \u2024 or "ONE DOT LEADER"
+# \uFF0F or 'FULLWIDTH SOLIDUS'
+api_with_ids("\u2024\u2024\uff0f" * 10 + "passwd")
+```
+
+The `re.search("./"` can not detect the "`ONE DOT LEADER`" or "`FULLWIDTH SOLIDUS`" because it is not normalized at the right time, which allows a directory traversal attack.
+
+## Compliant Solution - Compatibility mode
+
+This compliant solution normalizes the string before testing it and according to _annex #15_ [[Davis 2008]](https://wiki.sei.cmu.edu/confluence/display/java/Rule+AA.+References#RuleAA.References-Davis08), and [[Batchelder 2022]](https://www.youtube.com/watch?v=sgHbC6udIqc) we want to ensure that strings have a unique binary representation within our code.
+
+__[compliant01.py](compliant01.py):__
+
+```python
+# SPDX-FileCopyrightText: OpenSSF project contributors
+# SPDX-License-Identifier: MIT
+"""Compliant Code Example"""
+
+import re
+import unicodedata
+
+
+def api_with_ids(suspicious_string: str):
+    """Fancy intrusion detection system(IDS)"""
+    normalized_string = unicodedata.normalize("NFKC", suspicious_string)
+    if re.search("./", normalized_string):
+        print("detected an attack sequence with . or /")
+    else:
+        print("Nothing suspicious")
+
+
+#####################
+# attempting to exploit above code example
+#####################
+# The MALICIOUS_INPUT is using:
+# \u2024 or "ONE DOT LEADER"
+# \uFF0F or 'FULLWIDTH SOLIDUS'
+api_with_ids("\u2024\u2024\uff0f" * 10 + "passwd")
+
+```
+
+Developers should be aware of the encoding of data printed to `HTML`. For example, the following string was an `XSS` vulnerability in chrome `숍訊昱穿刷奄剔㏆穽侘㈊섞昌侄從쒜` [[issues.chromium.org 2025]](https://issues.chromium.org/issues/40076480); if the charset of the webpage was set to `ISO-2022-KR` or another unknown charset.
+At the time the Korean language was unsupported so it attempted to fall back to Windows OS default encoding `Windows-1252` and executed the code [[Taylor 2009]](https://zaynar.co.uk/posts/charset-encoding-xss/).
+
+Note that some operating systems (Windows, Mac) have system encodings for various characters which do get executed on a webpage regardless of charset. These should be avoided as they can cause issues with devices that don't support that charset. Other character sets should be avoided too, such as ascii, because mobile phones or old SMS generally has a very limited charset and behave unexpectedly.
+
+## Automated Detection
+
+None known
+
+|Tool|Version|Checker|Description|
+|:---|:---|:---|:---|
+|||||
+
+## Related Guidelines
+
+|||
+|:---|:---|
+|[ISO/IEC TR 24772:2013](https://wiki.sei.cmu.edu/confluence/display/java/Rule+AA.+References#RuleAA.References-ISO/IECTR24772-2013)|Cross-site Scripting \[XYT\] \[online\], available from: <https://wiki.sei.cmu.edu/confluence/display/java/Rule+AA.+References#RuleAA.References-ISO/IECTR24772-2013>, \[Accessed April 2025\]|
+|[MITRE CWE](http://cwe.mitre.org/)|Pillar CWE - CWE-707: Improper Neutralization \[online\], available from:<https://cwe.mitre.org/data/definitions/707.html> \[Accessed April 2025\]|
+|[MITRE CWE](http://cwe.mitre.org/)|Variant: CWE-180, Incorrect behavior order: Validate before canonicalize \[online\], available from: <http://cwe.mitre.org/data/definitions/180.html>|
+|[MITRE CWE](http://cwe.mitre.org/)|Base: CWE-182: Collapse of Data into Unsafe Value (4.16) \[online\], available from: <http://cwe.mitre.org/data/definitions/182.html>|
+|[MITRE CWE](http://cwe.mitre.org/)|Base: CWE-184: Incomplete List of Disallowed Input - Development Environment. \[online\], available from: <http://cwe.mitre.org/data/definitions/184.html>|
+|[SEI CERT Coding Standard for Java](https://wiki.sei.cmu.edu/confluence/display/java/SEI+CERT+Oracle+Coding+Standard+for+Java)|IDS01-J. Normalize strings before validating them \[online\], available from: <https://wiki.sei.cmu.edu/confluence/display/java/IDS01-J.+Normalize+strings+before+validating+them>|
+
+## Bibliography
+
+|||
+|:---|:---|
+|\[Davis 2008\]|Mark Davis and Ken Whistler, Unicode Standard Annex #15, Unicode Normalization Forms, 2008. \[online\], available from: <http://unicode.org/reports/tr15/> \[Accessed April 2025\]<br>Mark Davis and Michel Suignard, Unicode Technical Report #36, Unicode Security Considerations, 2008.\[online\], Available from:<http://www.unicode.org/reports/tr36/> \[Accessed 4 April 2025\] |
+|\[Weber 2009\]|MUnraveling Unicode: A Bag of Tricks for Bug Hunting \[online\], available from: <http://www.lookout.net/wp-content/uploads/2009/03/chris_weber_exploiting-unicode-enabled-software-v15.pdf> \[Accessed April 2025\]|
+|\[Ned Batchelder 2022\]|Pragmatic Unicode, or, How do I stop the pain? - YouTube \[online\], available from: <https://www.youtube.com/watch?v=sgHbC6udIqc> \[Accessed April 2025\]|
+|\[Kuchling 2022\]|Unicode HOWTO \[online\], available from: <https://docs.python.org/3/howto/unicode.html> \[Accessed April 2025\]|
+|\[python.org 2023\]|unicodedata — Unicode Database — Python 3.12.0 documentation \[online\], available from: <https://docs.python.org/3/library/unicodedata.html> \[Accessed April 2025\]|
+|\[issues.chromium.org 2025\]|XSS issue due to the lack of support for ISO-2022-KR \[online\], available from: <https://issues.chromium.org/issues/40076480> \[Accessed April 2025\]|
+|\[Taylor 2009\]|XSS vulnerabilities with unusual character encodings \[online\], available from: <https://zaynar.co.uk/posts/charset-encoding-xss/> \[Accessed April 2025\]|
diff --git a/docs/Secure-Coding-Guide-for-Python/CWE-707/CWE-180/compliant01.py b/docs/Secure-Coding-Guide-for-Python/CWE-707/CWE-180/compliant01.py
@@ -1,11 +1,12 @@
 # SPDX-FileCopyrightText: OpenSSF project contributors
 # SPDX-License-Identifier: MIT
-""" Compliant Code Example """
+"""Compliant Code Example"""
+
 import re
 import unicodedata
 
 
-def api_with_ids(suspicious_string):
+def api_with_ids(suspicious_string: str):
     """Fancy intrusion detection system(IDS)"""
     normalized_string = unicodedata.normalize("NFKC", suspicious_string)
     if re.search("./", normalized_string):
@@ -20,4 +21,4 @@ def api_with_ids(suspicious_string):
 # The MALICIOUS_INPUT is using:
 # \u2024 or "ONE DOT LEADER"
 # \uFF0F or 'FULLWIDTH SOLIDUS'
-api_with_ids("\u2024\u2024\uFF0F" * 10 + "passwd")
+api_with_ids("\u2024\u2024\uff0f" * 10 + "passwd")
diff --git a/docs/Secure-Coding-Guide-for-Python/CWE-707/CWE-180/example01.py b/docs/Secure-Coding-Guide-for-Python/CWE-707/CWE-180/example01.py
@@ -3,8 +3,8 @@
 import unicodedata
 
 print("\N{FULL STOP}" * 10)
-print("." == unicodedata.normalize("NFC", "\u2024") == "\N{FULL STOP}" == "\u002E")
-print("." == unicodedata.normalize("NFD", "\u2024") == "\N{FULL STOP}" == "\u002E")
-print("." == unicodedata.normalize("NFKC", "\u2024") == "\N{FULL STOP}" == "\u002E")
-print("." == unicodedata.normalize("NFKD", "\u2024") == "\N{FULL STOP}" == "\u002E")
+print("." == unicodedata.normalize("NFC", "\u2024") == "\N{FULL STOP}" == "\u002e")
+print("." == unicodedata.normalize("NFD", "\u2024") == "\N{FULL STOP}" == "\u002e")
+print("." == unicodedata.normalize("NFKC", "\u2024") == "\N{FULL STOP}" == "\u002e")
+print("." == unicodedata.normalize("NFKD", "\u2024") == "\N{FULL STOP}" == "\u002e")
 print("\N{FULL STOP}" * 10)
diff --git a/docs/Secure-Coding-Guide-for-Python/CWE-707/CWE-180/noncompliant01.py b/docs/Secure-Coding-Guide-for-Python/CWE-707/CWE-180/noncompliant01.py
@@ -1,11 +1,12 @@
 # SPDX-FileCopyrightText: OpenSSF project contributors
 # SPDX-License-Identifier: MIT
-""" Non-compliant Code Example """
+"""Non-compliant Code Example"""
+
 import re
 import unicodedata
 
 
-def api_with_ids(suspicious_string):
+def api_with_ids(suspicious_string: str):
     """Fancy intrusion detection system(IDS)"""
     if re.search("./", suspicious_string):
         normalized_string = unicodedata.normalize("NFKC", suspicious_string)
@@ -20,4 +21,4 @@ def api_with_ids(suspicious_string):
 # The MALICIOUS_INPUT is using:
 # \u2024 or "ONE DOT LEADER"
 # \uFF0F or 'FULLWIDTH SOLIDUS'
-api_with_ids("\u2024\u2024\uFF0F" * 10 + "passwd")
+api_with_ids("\u2024\u2024\uff0f" * 10 + "passwd")
diff --git a/docs/Secure-Coding-Guide-for-Python/CWE-707/CWE-78/README.md b/docs/Secure-Coding-Guide-for-Python/CWE-707/CWE-78/README.md
@@ -75,7 +75,7 @@ The `FileOperations().list_dir()` method allows an attacker to add commands via
 
 The attack surface increases if a user is also allowed to upload or create files or folders.
 
-The `noncompliant02.py` example demonstrates the injection via file or folder name that is created prior to using the `list_dir()` method. We assume here that an untrusted user is allowed to create files or folders named `& calc.exe or ;ps aux` as part of another service such as upload area, submit form, or as a result of a zip-bomb as per *CWE-409: Improper Handling of Highly Compressed Data (Data Amplification)*. Encoding issues as described in *CWE-180: Incorrect Behavior Order: Validate Before Canonicalize* must also be considered.
+The `noncompliant02.py` example demonstrates the injection via file or folder name that is created prior to using the `list_dir()` method. We assume here that an untrusted user is allowed to create files or folders named `& calc.exe or ;ps aux` as part of another service such as upload area, submit form, or as a result of a zip-bomb as per *CWE-409: Improper Handling of Highly Compressed Data (Data Amplification)*. Encoding issues as described in *[CWE-180: Incorrect Behavior Order: Validate Before Canonicalize](../CWE-180/README.md)* must also be considered.
 
 The issue occurs when mixing shell commands with data from a lesser trusted source.
 

diff --git a/docs/Secure-Coding-Guide-for-Python/readme.md b/docs/Secure-Coding-Guide-for-Python/readme.md
@@ -97,7 +97,7 @@ It is __not production code__ and requires code-style or python best practices t
 |[CWE-89: Improper Neutralization of Special Elements used in an SQL Command ('SQL Injection')](CWE-707/CWE-89/README.md)|[CVE-2019-8600](https://www.cvedetails.com/cve/CVE-2019-8600/),<br/>CVSSv3.1: __9.8__,<br/>EPSS: __01.43__ (18.02.2024)|
 |[CWE-117: Improper Output Neutralization for Logs](CWE-707/CWE-117/.)||
 |[CWE-175: Improper Handling of Mixed Encoding](CWE-707/CWE-175/README.md)||
-|[CWE-180: Incorrect behavior order: Validate before Canonicalize](CWE-707/CWE-180/.)||
+|[CWE-180: Incorrect behavior order: Validate before Canonicalize](CWE-707/CWE-180/README.md)||
 
 |[CWE-710: Improper Adherence to Coding Standards](https://cwe.mitre.org/data/definitions/710.html)|Prominent CVE|
 |:----------------------------------------------------------------|:----|