Skip to content

UnicodeDecodeError in RPC layer when autofix encounters non-UTF-8 source files #560

@brendanator

Description

@brendanator

Describe the bug

opengrep scan --autofix --dryrun --json crashes with exit code 2 and empty stdout when a scanned repo contains files with non-UTF-8 bytes. The crash is in the bundled pysemgrep's rpc.py, where subprocess.Popen uses text=True, encoding="utf-8" (strict mode). When semgrep-core returns RPC responses containing non-UTF-8 source
bytes, io.read() in _really_read() raises UnicodeDecodeError before application code can handle it.

'utf-8' codec can't decode byte 0xf5 in position 5908: invalid start byte
Traceback (most recent call last):
    File "/root/.cache/opengrep/v1.15.1/semgrep/commands/wrapper.py", line 37, in wrapper
    File "/root/.cache/opengrep/v1.15.1/semgrep/commands/scan.py", line 855, in scan
    File "/root/.cache/opengrep/v1.15.1/semgrep/run_scan.py", line 1011, in run_scan
    File "/root/.cache/opengrep/v1.15.1/semgrep/autofix.py", line 60, in apply_fixes
    File "/root/.cache/opengrep/v1.15.1/semgrep/rpc_call.py", line 28, in apply_fixes
    File "/root/.cache/opengrep/v1.15.1/semgrep/rpc.py", line 136, in rpc_call
    File "/root/.cache/opengrep/v1.15.1/semgrep/rpc.py", line 71, in _read_packet
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf5 in position 5908: invalid start byte

Rule matching completes successfully — the crash only happens in the autofix apply_fixes RPC path. Without
--autofix the scan works fine. This means all scan results are lost (empty stdout), not just the fix suggestions.

To Reproduce

Scan any repo containing files with non-UTF-8 bytes (e.g. PHP files with ISO-8859-1 encoding) using a rule that has a
fix key:

mkdir /tmp/og-repro && cd /tmp/og-repro && git init

# Rule with a fix pattern
cat > rule.yaml << 'EOF'
rules:
- id: test-rule
  pattern: $X = "..."
  message: test
  severity: WARNING
  languages: [python]
  fix: $X = "fixed"
EOF

# File with non-UTF-8 bytes
python3 -c "
with open('test.py', 'wb') as f:
  f.write(b'x = \"hello ' + bytes([0xf5]) * 6000 + b' world\"\n')
"

git add . && git commit -m init
opengrep scan --config=rule.yaml --autofix --dryrun --json .

Note: requires the opengrep_manylinux_x86 binary (which bundles pysemgrep). We hit this in production scanning ~48K-file PHP repos.

Expected behavior

Opengrep should handle non-UTF-8 source content gracefully and still produce valid JSON output. Fix suggestions for affected files can be skipped, but scan results should not be lost entirely.

What is the priority of the bug to you?

  • P1: important to fix or quite annoying

Environment

Official opengrep_manylinux_x86 binary, v1.15.1, Linux (GKE). The crash is in the bundled pysemgrep at ~/.cache/opengrep/v1.15.1/semgrep/rpc.py.

Use case

We run opengrep with --autofix --dryrun at scale to extract fix suggestions for security findings. When this crashes, we lose all scan results for the repo — not just fix suggestions — since stdout is empty.

Suggested fix

Upstream semgrep fixed this in commit semgrep/semgrep@3ab3b130c9 (semgrep v1.150.0, Dec 5 2025) by changing rpc.py to use text=False (binary mode) with explicit decode("utf-8", errors="replace"). Porting that change to opengrep's bundled pysemgrep should resolve this.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions