Skip to content

For GSoC 2026 : FLOSS: Extract Language Specific Strings (.NET, Swift, Zig, ...).#1222

Draft
utam-1 wants to merge 1 commit intomandiant:masterfrom
utam-1:master
Draft

For GSoC 2026 : FLOSS: Extract Language Specific Strings (.NET, Swift, Zig, ...).#1222
utam-1 wants to merge 1 commit intomandiant:masterfrom
utam-1:master

Conversation

@utam-1
Copy link

@utam-1 utam-1 commented Mar 3, 2026

The code presented here is specifically related to : – extract strings from .NET binaries.
Issue: #718

@google-cla
Copy link

google-cla bot commented Mar 3, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances FLOSS's capability to analyze .NET executables by introducing a dedicated mechanism for extracting language-specific strings. It focuses on recovering user strings embedded within the CLI metadata's #US heap, which are often overlooked by standard string scanning techniques. This addition provides a more complete picture of string artifacts present in .NET binaries, improving the depth of analysis.

Highlights

  • .NET String Extraction: Implemented a new module to extract user strings from .NET PE binaries by parsing the CLI metadata #US heap, enabling recovery of strings not found by generic scanners.
  • FLOSS Integration: Integrated the new .NET string extraction logic into the main FLOSS analysis workflow, allowing FLOSS to process .NET executables more comprehensively.
  • Global Prevalence Database Utility: Added an example script to generate a global prevalence database from extracted string data, with options to filter by .NET or native string types.
  • Comprehensive Testing: Included new unit tests to validate the functionality and accuracy of the .NET string extractor, ensuring robust operation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • example_script/generate_gp_db.py
    • Added a script to generate a global prevalence database from extracted string data, supporting filtering by .NET or native types.
  • floss/language/dotnet/extract.py
    • Added a new module for extracting user strings from .NET PE binaries by parsing the CLI metadata #US heap.
    • Implemented functions for reading compressed unsigned integers, locating the CLI metadata root, and parsing the #US heap structure.
    • Provided a public API extract_dotnet_strings to retrieve StaticString instances from a given .NET PE file.
  • floss/main.py
    • Updated to import and integrate the new .NET string extraction logic into the main analysis flow.
    • Modified the .NET language handling to call extract_dotnet_strings when static string analysis is enabled.
    • Disabled deobfuscation algorithms (stack, tight, decoded strings) for .NET binaries, noting that they are not meaningful for pure managed assemblies.
  • tests/test_language_extract_dotnet.py
    • Added new unit tests to validate the functionality of the .NET string extractor.
    • Verified the extraction of known strings, the location of the metadata root and #US heap, UTF-16LE encoding, minimum length filtering, and overall string count using a sample .NET binary.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature for extracting language-specific strings from .NET binaries by parsing the #US heap. However, several security vulnerabilities related to improper handling of user-supplied file paths and file contents have been identified, which could lead to information disclosure, arbitrary file write, and denial of service. Additionally, a potential bug in the string deduplication logic and some minor style issues require attention.

Comment on lines +46 to +47
with open(file_path, "r", encoding="utf-8") as f:
d = json.load(f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The generate_gp_db function processes a user-supplied path argument, iterating through all files within it and attempting to parse them as JSON. This design allows an attacker to specify sensitive directories or non-JSON files, leading to information disclosure of arbitrary file contents or denial of service through malformed input parsing. This is a critical security flaw as it allows reading of arbitrary files on the system.

Comment on lines +107 to +112
with gzip.open(args.outfile, "w") as f:
for k, v in sorted(gp.metadata_by_string.items(), key=lambda x: x[1][0].global_count, reverse=True):
for e in v:
f.write(msgspec.json.encode(e) + b"\n")
else:
with open(args.outfile, "w", encoding="utf-8") as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The script takes a user-supplied outfile argument to determine where the generated database will be written. This allows an attacker to specify arbitrary file paths, including sensitive system files, leading to arbitrary file write. This could result in denial of service by corrupting critical files or even privilege escalation if specific configuration files are overwritten.

Comment on lines +292 to +293
p = pathlib.Path(sample)
buf = p.read_bytes()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The extract_dotnet_strings function directly reads the entire content of a user-supplied sample file into memory. This poses a significant risk of path traversal, allowing an attacker to read sensitive files from the system. Additionally, if a very large file is provided, it could lead to memory exhaustion and a denial of service. While the floss/main.py caller has a size check, this function itself lacks such a safeguard, making it vulnerable if called directly.

Comment on lines +332 to +335
key = (s, file_offset)
if key in seen:
continue
seen.add(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This deduplication logic is a no-op. The file_offset is unique for each string yielded by iter_dotnet_user_strings, so the key (s, file_offset) will always be unique, and the if key in seen: condition will never be true.

If the intention is to report every occurrence of a string, this block (and the seen set initialization on line 316) should be removed for clarity. If the intention is to deduplicate by string value, the logic should be changed to use only s as the key.

buf = p.read_bytes()

try:
pe = pefile.PE(data=buf, fast_load=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The extract_dotnet_strings function passes the raw byte content of a user-supplied file to the pefile.PE constructor. Parsing complex, untrusted file formats like PE files can expose the application to vulnerabilities within the parsing library. A specially crafted, malformed PE file could cause pefile to consume excessive CPU or memory, leading to resource exhaustion and a denial of service for the script. Although PEFormatError is caught, other internal issues or resource exhaustion might occur.

try:
s = raw.decode("utf-16-le")
except UnicodeDecodeError:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Silently ignoring UnicodeDecodeError can hide issues with the string extraction logic or malformed input files. It is better to log these errors at a debug level to aid in future debugging.

Suggested change
pass
logger.debug("unicode decode error in string entry at offset 0x%x", entry_file_offset)
pass


def test_metadata_root_found():
"""The BSJB metadata root must be locatable in the test binary."""
import pefile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to PEP 8, imports should be at the top of the file, not inside functions. This import pefile statement (and the one on line 105) should be moved to the top of the file with the other imports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant