For GSoC 2026 : FLOSS: Extract Language Specific Strings (.NET, Swift, Zig, ...).#1222
For GSoC 2026 : FLOSS: Extract Language Specific Strings (.NET, Swift, Zig, ...).#1222utam-1 wants to merge 1 commit intomandiant:masterfrom
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances FLOSS's capability to analyze .NET executables by introducing a dedicated mechanism for extracting language-specific strings. It focuses on recovering user strings embedded within the CLI metadata's #US heap, which are often overlooked by standard string scanning techniques. This addition provides a more complete picture of string artifacts present in .NET binaries, improving the depth of analysis. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant new feature for extracting language-specific strings from .NET binaries by parsing the #US heap. However, several security vulnerabilities related to improper handling of user-supplied file paths and file contents have been identified, which could lead to information disclosure, arbitrary file write, and denial of service. Additionally, a potential bug in the string deduplication logic and some minor style issues require attention.
| with open(file_path, "r", encoding="utf-8") as f: | ||
| d = json.load(f) |
There was a problem hiding this comment.
The generate_gp_db function processes a user-supplied path argument, iterating through all files within it and attempting to parse them as JSON. This design allows an attacker to specify sensitive directories or non-JSON files, leading to information disclosure of arbitrary file contents or denial of service through malformed input parsing. This is a critical security flaw as it allows reading of arbitrary files on the system.
| with gzip.open(args.outfile, "w") as f: | ||
| for k, v in sorted(gp.metadata_by_string.items(), key=lambda x: x[1][0].global_count, reverse=True): | ||
| for e in v: | ||
| f.write(msgspec.json.encode(e) + b"\n") | ||
| else: | ||
| with open(args.outfile, "w", encoding="utf-8") as f: |
There was a problem hiding this comment.
The script takes a user-supplied outfile argument to determine where the generated database will be written. This allows an attacker to specify arbitrary file paths, including sensitive system files, leading to arbitrary file write. This could result in denial of service by corrupting critical files or even privilege escalation if specific configuration files are overwritten.
| p = pathlib.Path(sample) | ||
| buf = p.read_bytes() |
There was a problem hiding this comment.
The extract_dotnet_strings function directly reads the entire content of a user-supplied sample file into memory. This poses a significant risk of path traversal, allowing an attacker to read sensitive files from the system. Additionally, if a very large file is provided, it could lead to memory exhaustion and a denial of service. While the floss/main.py caller has a size check, this function itself lacks such a safeguard, making it vulnerable if called directly.
| key = (s, file_offset) | ||
| if key in seen: | ||
| continue | ||
| seen.add(key) |
There was a problem hiding this comment.
This deduplication logic is a no-op. The file_offset is unique for each string yielded by iter_dotnet_user_strings, so the key (s, file_offset) will always be unique, and the if key in seen: condition will never be true.
If the intention is to report every occurrence of a string, this block (and the seen set initialization on line 316) should be removed for clarity. If the intention is to deduplicate by string value, the logic should be changed to use only s as the key.
| buf = p.read_bytes() | ||
|
|
||
| try: | ||
| pe = pefile.PE(data=buf, fast_load=True) |
There was a problem hiding this comment.
The extract_dotnet_strings function passes the raw byte content of a user-supplied file to the pefile.PE constructor. Parsing complex, untrusted file formats like PE files can expose the application to vulnerabilities within the parsing library. A specially crafted, malformed PE file could cause pefile to consume excessive CPU or memory, leading to resource exhaustion and a denial of service for the script. Although PEFormatError is caught, other internal issues or resource exhaustion might occur.
| try: | ||
| s = raw.decode("utf-16-le") | ||
| except UnicodeDecodeError: | ||
| pass |
There was a problem hiding this comment.
Silently ignoring UnicodeDecodeError can hide issues with the string extraction logic or malformed input files. It is better to log these errors at a debug level to aid in future debugging.
| pass | |
| logger.debug("unicode decode error in string entry at offset 0x%x", entry_file_offset) | |
| pass |
|
|
||
| def test_metadata_root_found(): | ||
| """The BSJB metadata root must be locatable in the test binary.""" | ||
| import pefile |
The code presented here is specifically related to : – extract strings from .NET binaries.
Issue: #718