-
Notifications
You must be signed in to change notification settings - Fork 175
Description
Describe the bug
opengrep scan --autofix --dryrun --json crashes with exit code 2 and empty stdout when a scanned repo contains files with non-UTF-8 bytes. The crash is in the bundled pysemgrep's rpc.py, where subprocess.Popen uses text=True, encoding="utf-8" (strict mode). When semgrep-core returns RPC responses containing non-UTF-8 source
bytes, io.read() in _really_read() raises UnicodeDecodeError before application code can handle it.
'utf-8' codec can't decode byte 0xf5 in position 5908: invalid start byte
Traceback (most recent call last):
File "/root/.cache/opengrep/v1.15.1/semgrep/commands/wrapper.py", line 37, in wrapper
File "/root/.cache/opengrep/v1.15.1/semgrep/commands/scan.py", line 855, in scan
File "/root/.cache/opengrep/v1.15.1/semgrep/run_scan.py", line 1011, in run_scan
File "/root/.cache/opengrep/v1.15.1/semgrep/autofix.py", line 60, in apply_fixes
File "/root/.cache/opengrep/v1.15.1/semgrep/rpc_call.py", line 28, in apply_fixes
File "/root/.cache/opengrep/v1.15.1/semgrep/rpc.py", line 136, in rpc_call
File "/root/.cache/opengrep/v1.15.1/semgrep/rpc.py", line 71, in _read_packet
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf5 in position 5908: invalid start byteRule matching completes successfully — the crash only happens in the autofix apply_fixes RPC path. Without
--autofix the scan works fine. This means all scan results are lost (empty stdout), not just the fix suggestions.
To Reproduce
Scan any repo containing files with non-UTF-8 bytes (e.g. PHP files with ISO-8859-1 encoding) using a rule that has a
fix key:
mkdir /tmp/og-repro && cd /tmp/og-repro && git init
# Rule with a fix pattern
cat > rule.yaml << 'EOF'
rules:
- id: test-rule
pattern: $X = "..."
message: test
severity: WARNING
languages: [python]
fix: $X = "fixed"
EOF
# File with non-UTF-8 bytes
python3 -c "
with open('test.py', 'wb') as f:
f.write(b'x = \"hello ' + bytes([0xf5]) * 6000 + b' world\"\n')
"
git add . && git commit -m init
opengrep scan --config=rule.yaml --autofix --dryrun --json .Note: requires the opengrep_manylinux_x86 binary (which bundles pysemgrep). We hit this in production scanning ~48K-file PHP repos.
Expected behavior
Opengrep should handle non-UTF-8 source content gracefully and still produce valid JSON output. Fix suggestions for affected files can be skipped, but scan results should not be lost entirely.
What is the priority of the bug to you?
- P1: important to fix or quite annoying
Environment
Official opengrep_manylinux_x86 binary, v1.15.1, Linux (GKE). The crash is in the bundled pysemgrep at ~/.cache/opengrep/v1.15.1/semgrep/rpc.py.
Use case
We run opengrep with --autofix --dryrun at scale to extract fix suggestions for security findings. When this crashes, we lose all scan results for the repo — not just fix suggestions — since stdout is empty.
Suggested fix
Upstream semgrep fixed this in commit semgrep/semgrep@3ab3b130c9 (semgrep v1.150.0, Dec 5 2025) by changing rpc.py to use text=False (binary mode) with explicit decode("utf-8", errors="replace"). Porting that change to opengrep's bundled pysemgrep should resolve this.