UnicodeDecodeError in RPC layer when autofix encounters non-UTF-8 source files

**Describe the bug**

`opengrep scan --autofix --dryrun --json` crashes with exit code 2 and empty stdout when a scanned repo contains files with non-UTF-8 bytes. The crash is in the bundled pysemgrep's `rpc.py`, where `subprocess.Popen` uses `text=True, encoding="utf-8"` (strict mode). When `semgrep-core` returns RPC responses containing non-UTF-8 source
bytes, `io.read()` in `_really_read()` raises `UnicodeDecodeError` before application code can handle it.

```python
'utf-8' codec can't decode byte 0xf5 in position 5908: invalid start byte
Traceback (most recent call last):
    File "/root/.cache/opengrep/v1.15.1/semgrep/commands/wrapper.py", line 37, in wrapper
    File "/root/.cache/opengrep/v1.15.1/semgrep/commands/scan.py", line 855, in scan
    File "/root/.cache/opengrep/v1.15.1/semgrep/run_scan.py", line 1011, in run_scan
    File "/root/.cache/opengrep/v1.15.1/semgrep/autofix.py", line 60, in apply_fixes
    File "/root/.cache/opengrep/v1.15.1/semgrep/rpc_call.py", line 28, in apply_fixes
    File "/root/.cache/opengrep/v1.15.1/semgrep/rpc.py", line 136, in rpc_call
    File "/root/.cache/opengrep/v1.15.1/semgrep/rpc.py", line 71, in _read_packet
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf5 in position 5908: invalid start byte
```

Rule matching completes successfully — the crash only happens in the autofix `apply_fixes` RPC path. Without
`--autofix` the scan works fine. This means all scan results are lost (empty stdout), not just the fix suggestions.

**To Reproduce**

Scan any repo containing files with non-UTF-8 bytes (e.g. PHP files with ISO-8859-1 encoding) using a rule that has a
`fix` key:

```bash
mkdir /tmp/og-repro && cd /tmp/og-repro && git init

# Rule with a fix pattern
cat > rule.yaml << 'EOF'
rules:
- id: test-rule
  pattern: $X = "..."
  message: test
  severity: WARNING
  languages: [python]
  fix: $X = "fixed"
EOF

# File with non-UTF-8 bytes
python3 -c "
with open('test.py', 'wb') as f:
  f.write(b'x = \"hello ' + bytes([0xf5]) * 6000 + b' world\"\n')
"

git add . && git commit -m init
opengrep scan --config=rule.yaml --autofix --dryrun --json .
```

Note: requires the opengrep_manylinux_x86 binary (which bundles pysemgrep). We hit this in production scanning ~48K-file PHP repos.

**Expected behavior**

Opengrep should handle non-UTF-8 source content gracefully and still produce valid JSON output. Fix suggestions for affected files can be skipped, but scan results should not be lost entirely.

What is the priority of the bug to you?

- P1: important to fix or quite annoying

**Environment**

Official opengrep_manylinux_x86 binary, v1.15.1, Linux (GKE). The crash is in the bundled pysemgrep at ~/.cache/opengrep/v1.15.1/semgrep/rpc.py.

**Use case**

We run opengrep with --autofix --dryrun at scale to extract fix suggestions for security findings. When this crashes, we lose all scan results for the repo — not just fix suggestions — since stdout is empty.

**Suggested fix**

Upstream semgrep fixed this in commit https://github.com/semgrep/semgrep/commit/3ab3b130c9 (semgrep v1.150.0, Dec 5 2025) by changing rpc.py to use text=False (binary mode) with explicit decode("utf-8", errors="replace"). Porting that change to opengrep's bundled pysemgrep should resolve this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError in RPC layer when autofix encounters non-UTF-8 source files #560

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

UnicodeDecodeError in RPC layer when autofix encounters non-UTF-8 source files #560

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions