Skip to content

Major CC Updates and Fixes#569

Merged
jgarzik merged 99 commits intomainfrom
updates
Mar 22, 2026
Merged

Major CC Updates and Fixes#569
jgarzik merged 99 commits intomainfrom
updates

Conversation

@jgarzik
Copy link
Copy Markdown
Contributor

@jgarzik jgarzik commented Mar 22, 2026

No description provided.

jgarzik and others added 30 commits March 15, 2026 02:19
Fix bugs found while building CPython with pcc:

- Bug A: Preprocessor drops line after #define starting with /* comment */
- Bug B: Inline asm +r operand numbering and constraint handling (6 sub-fixes)
- Bug C: Float constant expressions in global initializers
- Bug D: Arrow expressions in eval_static_address
- Bug E: Inliner drops second half of 16-byte struct returns (two-reg ABI)
- Bug 5: u32 overflow in size_bits for large types
- Bug 6: Directive::Zero(u32) → Zero(usize) for large zero-fills

CPython now compiles fully and _freeze_module works at -O0/-O1. A separate
inliner bug (Bug F) remains at -O2+ when functions with 10+ IR instructions
are inlined.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… trigger

Bisection reveals the CPython _freeze_module -O2 crash is caused specifically
by inlining PyTuple_SET_ITEM (size 19, void, non-inline-hinted). Skipping its
inlining produces a working binary. Standalone tests with the same pattern
pass — the bug only manifests in complex CPython contexts with many other
inlined functions in the same caller.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…TEM inlining

Key finding: inlining PyTuple_SET_ITEM into any single caller works fine.
The crash only occurs when it's inlined into multiple callers across many
files simultaneously (35 callers total). This points to an interaction
between the module-level inlining pass and subsequent codegen phases,
not a single-function inlining correctness issue.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t -O2

The ALWAYS_INLINE_SIZE * 2 rule at -O2 inlined non-inline-hinted functions
of size 11-20 with multiple call sites. This triggered a latent bug causing
uninitialized stack values in large callers (CPython's _PySys_InitCore
segfaulted via PyTuple_SET_ITEM inlining).

Fix: Remove the multi-call-site rule. Single-call-site and inline-hinted
functions are still aggressively inlined at O2. The underlying inliner
defect (likely in block reordering or register allocation liveness after
complex inlining) remains for future investigation.

Verified: CPython _freeze_module passes at -O0 through -O3 with 0 valgrind
errors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…initialized

Root cause: emit_copy_with_type() stored 32-bit values to stack slots with
movl (4 bytes), leaving the upper 4 bytes of the 8-byte slot uninitialized.
When a subsequent 64-bit load read the full 8 bytes, it picked up garbage
in the upper half.

This manifested when a C `int` value (e.g. `int pos` in make_version_info)
was inlined into a function that passes it as `Py_ssize_t` (64-bit) to
PyTuple_SET_ITEM. The inlined store.64 to the index local read back
garbage from the uninitialized upper bytes.

Fix: In emit_copy_with_type(), store 32-bit values to stack using 64-bit
width. On x86-64, movl to a 32-bit register zero-extends to 64-bit, so
the subsequent movq to stack writes all 8 bytes correctly.

Also removes the workaround that disabled multi-call-site inlining at -O2
(that rule now works correctly) and cleans up dead debug code.

Verified: CPython _freeze_module passes at -O0 through -O3 with 0 valgrind
errors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bug F: The linearizer didn't insert implicit integer widening conversions
when passing a narrower argument (e.g., int) to a wider parameter (e.g.,
Py_ssize_t). After inlining, the 32-bit pseudo was used in 64-bit
contexts with garbage upper bits. Fix: insert sign/zero extension at
call sites when actual argument type is narrower than formal parameter.

Bug G: emit_assign() only used block-copy for structs > 64 bits. Structs
of exactly 64 bits (like PyCompilerFlags: 2 ints) fell through to
linearize_expr() which returns the struct's ADDRESS, not its value. The
address was stored at the target instead of the dereferenced struct data.
Fix: use block-copy for ALL struct sizes (target_size > 0).

Also reverts the incorrect Bug F workaround (64-bit stack stores for
32-bit copies) which clobbered adjacent stack data.

Verified: _freeze_module output now byte-identical to gcc at -O3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… slot stale bytes

Bug H (Zext): emit_extend() for Opcode::Zext loaded 8-bit values at 32-bit
width via .max(32), causing sign-extended char values (e.g., 0xF3 → -13)
to pass through unmasked. Fix: AND-mask to source width after loading.
This fixes (unsigned char) casts and CPython's marshal type code parsing.

Bug I (stack stores): Created store_to_stack_slot() that always writes
64-bit to regalloc stack slots (which are 8 bytes each). Updated
emit_move_to_loc() and emit_store() offset-0 local stores to use it.
Prevents stale upper bytes from subsequent 64-bit loads.

Also adds implicit integer widening at call sites (Bug F) when actual
argument type is narrower than formal parameter type.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ring config init

After fixing Bug H (Zext), _bootstrap_python gets past marshal parsing
into importlib execution. New crash: NULL pointer dereference in
_Py_Instrument called from config_init_stdio_encoding. Likely another
codegen correctness issue exposed by getting further in CPython init.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Progress: _bootstrap_python now loads frozen importlib and begins executing
bytecode. Crashes in _PyDict_GetItemWithError from LOAD_BUILD_CLASS
instruction — dict or builtins pointer corruption from pcc codegen.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pcc ignored __attribute__((packed)) on struct definitions, causing packed
structs to have wrong sizes. CPython's tracemalloc_frame (packed: 12 bytes)
was laid out as 16 bytes, shifting all fields after it in _PyRuntimeState
by 8 bytes — corrupting every runtime data access.

Fix: Parse packed attribute at all three positions on struct definitions
(before tag, between tag and {, after }). Pass packed flag to
compute_struct_layout() which suppresses alignment padding between members
and in trailing padding.

Also includes store_to_stack_slot() function and emit_store offset-0
widening for Bug I.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ilure

The _bootstrap_python crashes in the opcode dispatch switch of ceval.c.
pcc defines __GNUC__=4 which makes Py_UNREACHABLE() expand to
__builtin_unreachable() instead of Py_FatalError(). The switch on
uint8_t opcode doesn't match valid opcodes — likely a switch statement
codegen bug for large case counts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Switch codegen used insn.size.max(32) to load the switch variable,
upgrading uint8_t (8-bit) to 32-bit. emit_move with size=32 did a movl
from a stack slot that was stored as 8-bit, reading 3 garbage bytes.
This caused CPython's opcode dispatch (switch on uint8_t opcode) to read
wrong opcodes and hit __builtin_unreachable().

Fix: load at insn.size (actual type width), then CMP at B32. For sub-32
bit values, emit_move uses Movzx (zero-extending load) which correctly
reads only the stored bytes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All function prologues now zero the stack frame using rep stosq after
allocating stack space but before storing arguments. This ensures all
8-byte regalloc stack slots start as zero, so narrow writes (8/16/32-bit)
leave zero in the unwritten upper bytes instead of garbage from prior
stack usage.

Implementation: save RDI/RCX to R10/R11 scratch registers, set up rep
stosq (RDI=RSP, RCX=qwords, RAX=0), zero the frame, restore RDI/RCX.
Added RepStosq LIR instruction type.

This is a comprehensive fix for the class of bugs where pcc codegen
writes narrow values to 8-byte stack slots and later loads them at wider
widths. Verified: _freeze_module produces 0 valgrind errors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…f value

linearize_expr for Deref of struct/union ALWAYS returned the pointer
address, even for small types (<= 64 bits). Callers that used the result
as a value (e.g., _Py_CODEUNIT word = *next_instr) stored the 64-bit
pointer where a 16-bit value was expected, causing total data corruption.

Fix: only return the address for large structs (> 64 bits). For small
structs/unions, emit a Load instruction to read the value through the
pointer, matching how scalars are handled.

This fixes CPython's bytecode opcode dispatch — the switch now receives
correct opcode values and no longer hits __builtin_unreachable().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Struct types must return addresses from Deref (needed for member-offset
access like s->field). Only union types <= 64 bits get the value-load
treatment, since unions are accessed as whole values.

Also updates bugs.md with Bug M analysis: 4 specific CPython files
(pylifecycle, import, ceval, sysmodule) break _bootstrap_python when
pcc-compiled, while other files work fine.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pcc emitted compilation errors during linearization (e.g., "unsupported
expression in global initializer") but exited with code 0, producing
corrupt .o files with missing/zero data. Build systems like CPython's
Makefile don't detect the error and link the corrupt objects, causing
mysterious runtime crashes.

Fix: add diag::has_error() check after linearization in process_file().
pcc now correctly fails compilation when linearization produces errors.

The "unsupported expression" error itself (in pycore_pyerrors.h's
Py_CLEAR macro) still needs to be fixed separately — it prevents
compilation of pylifecycle.c, import.c, ceval.c, sysmodule.c.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…heck

Three fixes for global initializer handling:

1. Added diag::has_error() check after linearization in main.rs. Previously
   pcc emitted errors during linearization but exited 0, producing corrupt
   .o files that caused mysterious runtime crashes.

2. Added eval_static_address() as fallback in ast_init_to_ir() catch-all.
   Handles complex &global.field->subfield address chains used in CPython's
   _PyRuntimeState_INIT macro (PYDBGRAW_ALLOC etc.).

3. Added Conditional (ternary) expression handling in ast_init_to_ir().
   CPython's _Py_LATIN1_CHR() macro uses compile-time ternaries in static
   initializers: 'r' < 128 ? &ascii['r'] : &latin1['r'-128].

All CPython .c files now compile without errors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ks _bootstrap_python

From a clean gcc baseline, swapping ANY single CPython file to pcc-compiled
causes _bootstrap_python to crash. This is NOT stale stack bytes (zeroing
doesn't fix), NOT struct layout (all match), NOT the linker, NOT pyconfig.h.
The issue is a fundamental codegen pattern that affects virtually every file.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ROOT CAUSE CONFIRMED: pcc passes struct parameters > 16 bytes as hidden
pointers (sret-style, pointer in RDI), while the SysV AMD64 ABI requires
them to be passed BY VALUE on the stack (MEMORY class).

Evidence: PyStatus_Exception(PyStatus status) — GCC reads status from
16(%rbp) (stack argument), PCC reads from (%rdi) (pointer dereference).
PyStatus is 32 bytes, which triggers MEMORY classification in the ABI.

This is the root cause of the systematic crash where ANY pcc-compiled
file breaks _bootstrap_python — every file uses PyStatus extensively.

Fix requires changes to: ABI classification (classify_param), caller
codegen (push struct bytes to stack), callee codegen (receive via
IncomingArg), and callee linearizer (remove pointer conversion).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two fixes:

1. Large struct params (> 16 bytes) now passed by value on the stack
   per SysV AMD64 ABI MEMORY class, instead of incorrectly as hidden
   pointers. Caller pushes all qwords; callee reads via IncomingArg
   with SymAddr. Fixes PyStatus (32 bytes) ABI mismatch that broke
   every CPython init function.

2. Remove incorrect 32→64 bit store widening at offset 0 in emit_store.
   A 32-bit store to a struct's first field was widened to 64-bit,
   clobbering the adjacent field at offset 4. Fixes compound literal
   + designated initializer failures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Store-widening (32→64 bit at offset 0) now only applies to scalar
locals (type size ≤ 32 bits). For structs, the first field at offset 0
gets exact-size stores to avoid clobbering the adjacent field.

Uses sym_type_sizes map populated during function codegen to distinguish
scalar locals from struct fields.

Also: removed dead regalloc post-pass code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pre-existing bug: stale caller-saved register across call in
goto-dispatch loops. Blocks CPython make test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace implicit phi operand references with explicit PhiSource
instructions in predecessor blocks, following sparse's OP_PHISOURCE
design. This makes phi data flow visible to all passes without
special-casing phi_list semantics.

Key changes:
- Add Opcode::PhiSource to IR with back-pointer to owning phi
- SSA pass (fill_phi_operands) emits PhiSource in predecessors
- Linearizer emits PhiSource for ternary/logical-or/logical-and phis
- Phi elimination converts PhiSource→Copy with proper sequentialization
- Inliner clone_instruction handles PhiSource remapping
- DCE excludes PhiSource back-pointer from use tracking
- Regalloc removes dead phi interval extension code (PhiSource handles it)
- Remove decl_block dominance filter from SSA phi insertion
- Fix Function::dominates warning (#[cfg(test)])

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bug P — integer argument sign-extension: when passing a 32-bit int
literal (e.g. -1) to a 64-bit long parameter, the linearizer emitted
a widening conversion but kept the original 32-bit type in arg_types_vec,
causing codegen to emit movl (zero-extending) instead of movq.
Fix: update arg_type to formal parameter type after widening.

Bug Q — function pointer dereference as no-op: *func_ptr must be a
no-op in C (6.5.3.2). Two fixes: (1) linearizer Deref handler now
returns src for TypeKind::Function, (2) parser Deref type computation
keeps function type instead of computing base_type (return type).

-O flag handling: pcc kept first -O flag, discarding subsequent ones.
GCC convention is last wins. CPython passes -O3 then -O0; pcc was
compiling at -O3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
In `cond ? long_count : -1`, the int literal -1 was not promoted
to the 64-bit result type. pcc generated `movl $1; negl` (32-bit)
producing 0x00000000FFFFFFFF instead of 0xFFFFFFFFFFFFFFFF.

This broke CPython's string search (fastsearch default_find returns
`mode == 2 ? count : -1`) causing str.find() to return 4294967295
instead of -1 for multi-char not-found, which broke Python's regex
module and blocked the deepfreeze build step.

Fix: insert emit_convert() for narrower operands in both the Select
(pure/cmov) and control-flow (impure/phi) paths of Conditional
expression linearization.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The cbr instruction never captured the condition value's size, so
emit_cbr defaulted to testl (32-bit test) via insn.size.max(32).
When the condition is a 64-bit AND result (e.g., value & 0x8080...80),
and only the upper 32 bits are non-zero, testl sees zero and skips
the branch.

This broke CPython's UTF-8 decoder (ascii_decode fast path) which
checks 8-byte chunks via `if (value & ASCII_CHAR_MASK)`. Non-ASCII
bytes in the upper half of a chunk were missed, causing the decoder
to advance past multi-byte UTF-8 lead bytes.

Fix: when insn.size is 0 (unset), use 64-bit testq instead of 32-bit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
R1: ternary int-to-long promotion (str.find/regex)
R2: 64-bit cbr conditional (UTF-8 decoder)
R3: finalization crash in _PyArg_Fini (still open)
R4: deepfreeze struct init produces wrong opcodes (workaround: disable)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extended store-widening: ALL 32-bit stores at offset 0 to local
variables are now widened to 64-bit, except for struct/union types
where the first field must keep exact size.

Previously, only scalar locals with type <= 32 bits were widened.
This missed cases where a 32-bit intermediate (e.g., from int-to-
pointer conversion) was stored into a 64-bit local (long, pointer),
truncating the upper 32 bits. This caused the CPython finalization
crash: signal handler PyObject* pointers were stored as 32-bit,
losing the upper address bits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CPython 3.12.9 compiled entirely by pcc at -O0:
- Compiles, links, starts, runs Python code correctly
- Imports standard library modules (os, json partial, re)
- Crashes during Py_FinalizeEx in _PySignal_Fini (pointer truncation)
- Blocks make test due to exit code 139 on every process

Bugs fixed this session: P (arg sign-ext), Q (funcptr deref),
R1 (ternary promotion), R2 (64-bit cbr), R3 partial (store-widening),
-O flag handling.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jgarzik and others added 23 commits March 21, 2026 04:22
Update TODO.md status tables for atomics, alignment, TLS, and other
C11 features to reflect current implementation state. Update BUILTIN.md
to document all implemented builtins (memory, FP math, FP constants,
stack introspection, C11 atomics). Trim README.md not-yet-implemented
list to match reality.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire __attribute__((aligned(N))) into the explicit_align infrastructure
alongside _Alignas. Add stdalign.h and __has_feature(c_alignas).

Parser:
- AttributeList::get_alignment() extracts aligned/\__aligned__ attrs
- skip_extensions() now wires aligned attrs into pending_alignas
- Struct-level aligned attr tracked across all 4 attribute positions
- Typedef alignment: add explicit_align field to Type, apply at all
  typedef binding sites, skip type deduplication for aligned types
- validated_explicit_align() propagates typedef alignment to variables
- pending_alignas cleared at parse_external_decl entry to prevent leakage

Type system:
- Type::explicit_align overrides natural alignment in TypeTable::alignment()
- Type constructors cleaned up to use ..Default::default()

x86-64 backend:
- Pad callee_saved_offset to multiple of 16 for correct local alignment
- Track max_local_align in regalloc; round stack_size to max alignment
- Dynamic stack alignment (>16): emit andq $-N,%rsp; address locals
  via RSP-relative offsets. Refactor all 24 callee_saved_offset sites
  to use stack_mem() helper for correct RSP/RBP switching.

aarch64 backend:
- Track max_local_align in regalloc; over-allocate frame for >16
- Compute aligned base register (x19) in prologue for over-aligned locals
- Wire stack_base_reg()/stack_mem()/stack_mem_plus() into all 16 codegen
  sites that access local variables

Tests:
- 20-section integration mega-test: globals, locals, struct members,
  struct tags, typedef propagation, stdalign.h, callee-saved pressure,
  dynamic alignment (32/64), combined _Alignas+attr, typedef-as-member,
  non-power-of-2 ignored, trailing struct attr
- Optimized (-O1) integration test
- cfg-guarded aarch64 over-aligned locals test (x19 path)
- 13 parser unit tests covering all alignment mechanisms

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The stack_offset() formula for use_aligned_base mode was wrong:
stack_alloc_size + offset produced offsets that didn't preserve
alignment (e.g., x19+31 for a 32-byte aligned local, which is
NOT 32-aligned).

Fix: compute base_rounded = stack_alloc_size - (max_align - 1),
then use base_rounded + offset. Since base_rounded is a multiple
of max_align (from regalloc rounding) and regalloc aligns each
local's position, the result preserves alignment.

Also fix regalloc stack_size() to round the base to max_align
(not just 16) before adding over-allocation padding, ensuring
base_rounded is correctly aligned.

Before: str w9, [x19, #31]  (31 is NOT 32-aligned)
After:  str w9, [x19]       (0 is 32-aligned)

Verified: all three CI-failing aarch64 tests now generate valid
assembly that passes aarch64-linux-gnu-as.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Multi-char constants in #if: pack all chars big-endian (GCC-compatible)
  instead of using only the first character
- #line directive: macro-expand tokens before parsing (C99 6.10.4)
- #pragma STDC: recognize FP_CONTRACT/FENV_ACCESS/CX_LIMITED_RANGE with
  ON/OFF/DEFAULT arguments; warn on invalid combinations
- Stringification (#): new stringify_arg() escapes " and \ in string/char
  literals per C99 6.10.3.2p2, replacing tokens_to_text at stringify sites
- _Pragma: document C99 6.10.9p1 destringification requirement; pragmas
  are no-ops so the result is correctly discarded
- Lexer: add C99 6.4.6 digraph support (<: :> <% %> %: %:%:)
- #line directive: implement line_offset/line_file_override for __LINE__
  and __FILE__ builtin macros

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Four root causes fixed:

1. NaN comparison: aarch64 fcmp uses Slt/Sle condition codes which
   return true for NaN (N≠V when V=1). Changed to Ult/Ule (lo/ls)
   which are NaN-safe (C=1 for NaN → false).

2. FP ternary select: emit_select only handled integers via CSEL.
   Added emit_select_fp using BCond/B branches to load FP values,
   mirroring x86_64's branch-based approach.

3. HFA struct passing: {double,double} structs classified as
   ArgClass::Hfa on aarch64 weren't recognized by is_two_sse checks.
   Extended linearizer (2 locations), store_args_to_stack prologue,
   and setup_register_args to handle HFA count=2 alongside complex.

4. macOS compatibility:
   - Added _Nonnull/_Nullable qualifier parsing (consumed, no effect)
   - Added __builtin_object_size (returns (size_t)-1) and 13 fortified
     __builtin___*_chk builtins (strip prefix, call libc __*_chk)
   - Made inline asm test arch-portable with #[cfg(target_arch)]
   - Added StringTable::lookup() for immutable name resolution

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- HFA-2 struct return: emit_ret now detects Hfa{count:2} returns via
  ABI classification. Two-source path moves GP values to V0/V1 via
  FmovFromGp; single-source path loads from struct address. Fixes
  codegen_two_sse_struct_abi exit 1.

- Nullability qualifiers: added _Nonnull/_Nullable/_Null_unspecified
  to 3 inline pointer-qualifier parsers in parser.rs (parse_declarator
  and parse_function_def) that were separate from consume_type_qualifiers.
  Fixes codegen_ternary_fptr_return_type parse error on macOS stdlib.h.

- New test: codegen_nullability_qualifiers — portable integration test
  covering nullability on pointer declarators, function params, and
  function pointer typedefs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Nullability qualifiers: extract single-source-of-truth function
  is_nullability_qualifier() in parse/mod.rs. All 5 call sites now
  reference it. Also handle qualifiers in skip_extensions() so they're
  consumed after function declarator param lists (fixes stdlib.h parse
  error at line 540).

- Inliner: clone_instruction now remaps pseudos inside asm_data
  (outputs[].pseudo and inputs[].pseudo). Previously inlined inline
  asm kept stale callee pseudo IDs, causing wrong register allocation
  and 32-bit truncation on aarch64.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added skip_extensions() before expect_special(b';') at two more
declaration paths in parse_external_decl (non-function declarations
and declarations with initializers). These paths lacked attribute/
nullability consumption, causing "expected ';'" on macOS headers.

Also improved expect_special error diagnostic to show the actual
token found, aiding future debugging of header parse failures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds codegen_ternary_fptr_diag test that preprocesses #include <stdlib.h>
with -E and prints lines around 540 plus any lines containing __v.
Always passes — diagnostic output will appear in CI stderr to identify
the exact macOS header construct causing "expected ';', found '__v'".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous diag test used eprintln which CI swallowed. Now builds a
diagnostic string and panics with it on macOS, ensuring the output
appears in CI test failure logs. Shows lines 530-555 of preprocessed
stdlib.h plus all lines containing __v.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add complete software-emulated 128-bit integer support across all
compiler layers. Int128 values live on the stack (16-byte aligned,
never in GP registers); operations load lo/hi 64-bit halves into
scratch registers, operate, and store back.

Type system: ExprKind::Int128Lit(i128), PseudoKind::Val128(i128),
Initializer::Int128(i128), Loc::Imm128(i128) on both backends.

ABI: Int128 classified as Direct{[Integer, Integer], 128} on both
SysV AMD64 and AAPCS64 (two GP registers for param/return).

Runtime library: int128_divmod() for __divti3/__modti3/__udivti3/
__umodti3, int128_convert() for 20 float<->int128 conversion
functions. Division/modulo and float conversions dispatch to rtlib
calls from the linearizer.

x86-64 backend: add/adc, sub/sbb, mulq+imulq cross products,
shld/shrd shifts (constant and variable with >=64 branching),
eq/ne via xor+or, ordered comparisons via hi-first branching,
neg (not+add+adc), not, zext/sext/trunc. New LIR instructions:
Adc, Sbb, Mul1, Shld, Shrd. Register allocator detects int128
pseudos by TypeKind (excluding comparison results and Load
addresses) and forces 16-byte aligned stack slots. Mul constraint
registers RAX/RDX clobber.

aarch64 backend: adds/adc, subs/sbc, mul/umulh/madd, branch-based
ordered comparisons (replacing incorrect ccmp approach), eq/ne via
eor+orr, negs/ngc, mvn, shifts with zero-amount short-circuit.
New LIR instructions: Adds, Adc, Subs, Sbc, Umulh, MAdd, Negs,
Ngc. CBR handles 128-bit stack values by ORing both halves.

Constant evaluation: eval_const_expr returns Option<i128> (was i64),
enabling full-precision compile-time arithmetic for __int128 types.
Parser folds (__int128)constant → Int128Lit at the AST level.

InstCombine: full 128-bit support via FoldToConst128(i128) variant,
const_val128(), and create_const128_pseudo(). All algebraic
identities (x-x, x*0, x+0, x*1, etc.) and constant folding work
correctly for 128-bit values.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BlockItem::Statement(Stmt) was 256 bytes vs Declaration at 24 bytes.
Box the Stmt to reduce enum size from 256 to 32 bytes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The aarch64 regalloc hardcoded alignment=8 for all local variables,
violating __int128's natural 16-byte alignment. LDP/STP on macOS
aarch64 enforces strict alignment, causing SIGBUS crashes.

Fix: use types.alignment() for Sym locals (returns 16 for Int128),
and force int128 pseudos to 16-byte aligned stack slots before the
main linear scan (matching the x86-64 pattern).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Four bugs found by auditing cc/arch/ for ignored type system attributes:

1. x86-64 Sym alignment: replace size>=16?16:8 heuristic with
   types.alignment(typ), matching the aarch64 fix. Correctly handles
   __attribute__((aligned(N))) on typedefs.

2. store_args_to_stack: __int128 params use two GP registers per ABI
   but were treated as single-register. Now stores both halves and
   increments int_arg_idx by 2. Fixed in both x86-64 and aarch64.

3. emit_ret: __int128 return values now load both halves into
   RAX+RDX (x86-64). The linearizer didn't set two_reg_return for
   Int128, so only lo half was returned.

4. aarch64 multi-reg return stack allocation: replaced open-coded
   stack_offset manipulation with alloc_stack_slot using
   types.alignment(typ) and types.size_bits(typ).

Tests: codegen_int128_param_return and _optimized cover int128
function params (including mixed with regular args) and return values.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three aarch64-specific fixes for macOS CI failures:

1. Variable shift amount: load_int128() was called for the shift
   amount (a regular int), causing LDP from an 8-byte slot. Now
   uses emit_move() for 64-bit load of shift operand.

2. allocate_arguments: Int128 params consume two GP registers per
   AAPCS64 but were assigned a single register. Now allocates a
   16-byte aligned stack slot and reserves both GP registers.

3. emit_ret: Int128 return values need lo→X0, hi→X1 via LDP.
   Added Int128 branch before the generic integer return path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a compile_and_run test fails, re-compile with -S and print
the generated assembly to stderr. This makes aarch64-only failures
debuggable from CI logs without needing access to the runner.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comparison results (cset → 0 or 1) were stored as 32-bit (str w__)
but CBR loads them as 64-bit (ldr x__). The upper 32 bits contained
stack garbage, causing wrong branch decisions when reused stack
slots had non-zero upper halves.

Fix: store comparison results as 64-bit. cset already zero-extends
into the full 64-bit register, so str x__ is correct and matches
what CBR expects.

This was a pre-existing bug exposed by int128 tests which generate
more stack-spilled comparison results. Affects both regular and
int128 comparison paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
setup_register_args in call.rs treated __int128 args as single-register,
loading only the lo half and advancing int_arg_idx by 1. This shifted
all subsequent arguments to wrong registers.

Fix: detect Int128 arg type, load both halves via LDP into two
consecutive GP registers, advance int_arg_idx by 2.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
store_args_to_stack stored X1:X2 to the local variable's stack slot,
but the IR then generates a Copy from the arg pseudo (at a different
slot) to the local, overwriting the correct value with uninitialized
data. Fix: store to the arg pseudo's allocated location so the IR's
Copy correctly transfers the value.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix four int128 codegen bugs and add comprehensive test coverage:

Bug 1: Ternary ?: with int128 operands used Select instruction which
caps at 64 bits. Add size<=64 guard to force branch+phi path.

Bug 2: x86_64 stack-spilled int128 call args pushed only 8 bytes
instead of 16. Add int128 branch in push_stack_args, fix stack qword
counting in classify_call_args, and copy incoming stack int128 params
to local slots in the callee prologue (store_args_to_stack).

Bug 3: aarch64 stack-spilled int128 call args allocated 8 bytes
instead of 16. Fix allocation calculation and add LDP+STR pair for
16-byte stack arg stores.

Bug 4: Global int128 loads on x86_64 double-dereferenced the GOT
address (treating the int128 value as a pointer). Add Loc::Global
branch in emit_load that loads the address without dereferencing.

Add 11 new tests: ternary, many_args (stack spill), divmod, globals,
ptr_deref, compound_assign, shift_boundaries, float_convert,
struct_array, inc_dec, optimized_mega.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
emit_struct_store used ADD/LEA to compute the destination address from
a Stack location, always treating it as a direct local variable address.
When the destination is a spilled pointer (e.g., `*p = val` where p is
on the stack), the code must load the pointer value from the slot first,
not take the address of the slot itself.

Add is_symbol check (matching compute_mem_addr) to distinguish local
variables from spilled pointers. For spilled pointers, emit LDR/MOV
to load the pointer value before using it as the store destination.

Fixes codegen_int128_ptr_deref and codegen_int128_struct_array on
aarch64-apple-darwin.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jgarzik jgarzik self-assigned this Mar 22, 2026
@jgarzik jgarzik added bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request labels Mar 22, 2026
@jgarzik jgarzik merged commit fd947ba into main Mar 22, 2026
6 checks passed
@jgarzik jgarzik deleted the updates branch March 22, 2026 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants