Skip to content

Support parsing bitcode produced by Zig #302

@RyanGlScott

Description

@RyanGlScott

The Zig compiler produces somewhat unusually shaped LLVM bitcode as compared to Clang. This issue aims to document what steps we would need to perform in order to support Zig-generated bitcode properly.

Throughout this issue, I will be using the bitcode that Zig generated from this program:

// test.zig
export fn add(a: i32, b: i32) i32 {
    return a + b;
}

Compiled like so:

$ zig version
0.16.0-dev.27+83f773fc6
$ zig build-lib -femit-llvm-bc -OReleaseFast test.zig

This produces a test.bc bitcode file. Here are the issues (in order) that I encountered when loading this bitcode file into llvm-pretty-bc-parser:

match failed [...] TYPE_BLOCK

The first issue I ran into is:

> parseBitCodeFromFileWithWarnings "test.bc" >>= \x -> case x of Left err -> putStrLn (formatError err); Right _ -> pure ()
match failed
from:
	TYPE_BLOCK
	type symbol table
	MODULE_BLOCK
	Bitstream

This ultimately arises from how the type table is parsed here:

-- drop everything until we hit TYPE_CODE_NUMENTRY
(r,ents) <- match (dropUntil numEntry) es

Where numEntry is defined here:

-- | Pattern match the TYPE_CODE_NUMENTRY unabbreviated record.
numEntry :: Match Entry Record
numEntry = hasRecordCode 1 <=< fromUnabbrev <=< unabbrev

llvm-pretty-bc-parser expects TYPE_CODE_NUMENTRY to live in an unabbreviated record, but Zig's compiler happens to put TYPE_CODE_NUMENTRY in an abbreviated record instead. Fair enough, I suppose—I'm not sure why llvm-pretty-bc-parser is so picky here. The following (untested) patch appears to fix that issue:

diff --git a/src/Data/LLVM/BitCode/IR/Types.hs b/src/Data/LLVM/BitCode/IR/Types.hs
index ef564bd..a5be591 100644
--- a/src/Data/LLVM/BitCode/IR/Types.hs
+++ b/src/Data/LLVM/BitCode/IR/Types.hs
@@ -24,7 +24,7 @@ import           Data.Ord (comparing)

 -- | Pattern match the TYPE_CODE_NUMENTRY unabbreviated record.
 numEntry :: Match Entry Record
-numEntry  = hasRecordCode 1 <=< fromUnabbrev <=< unabbrev
+numEntry  = hasRecordCode 1 <=< fromEntry

 resolveTypeDecls :: Parse [TypeDecl]
 resolveTypeDecls  = do

Unimplemented types

After applying the patch above, the next stumbling point is:

> parseBitCodeFromFileWithWarnings "test.bc" >>= \x -> case x of Left err -> putStrLn (formatError err); Right _ -> pure ()
not implemented
from:
	TYPE_CODE_BFLOAT
	TYPE_BLOCK
	type symbol table
	MODULE_BLOCK
	Bitstream

This happens because the bitcode file's type table contains an entry for bfloats, even though the program itself never uses bfloats directly. Quite odd.

In any case, this has been reported previously as #214. Fixing that issue properly would require some API changes downstream in llvm-pretty first. In the pursuit of making progress, I applied a quick hack here:

@@ -194,7 +194,7 @@ parseTypeBlockEntry (fromEntry -> Just r) = case recordCode r of
     notImplemented

   23 -> label "TYPE_CODE_BFLOAT" $ do
-    notImplemented
+    noType

   24 -> label "TYPE_CODE_X86_AMX" $ do
     notImplemented

I also had to apply similar hacks to work around other unimplemented types, which have been reported in #213 and #215:

@@ -191,13 +191,13 @@ parseTypeBlockEntry (fromEntry -> Just r) = case recordCode r of
       []       -> fail "function expects a return type"

   22 -> label "TYPE_CODE_TOKEN" $ do
-    notImplemented
+    noType

   23 -> label "TYPE_CODE_BFLOAT" $ do
-    notImplemented
+    noType

   24 -> label "TYPE_CODE_X86_AMX" $ do
-    notImplemented
+    noType

   25 -> label "TYPE_CODE_OPAQUE_POINTER" $ do
     let field = parseField r

parseField: unable to parse record field 1 of record [...] (TYPE_CODE_FUNCTION)

The next stumbling block is:

> parseBitCodeFromFileWithWarnings "test.bc" >>= \x -> case x of Left err -> putStrLn (formatError err); Right _ -> pure ()
parseField: unable to parse record field 1 of record Record {recordCode = 21, recordFields = [FieldFixed (BitString {bsLength = NumBits 1, bsData = 0}),FieldFixed (BitString {bsLength = NumBits 5, bsData = 17}),FieldArray [FieldFixed (BitString {bsLength = NumBits 5, bsData = 17}),FieldFixed (BitString {bsLength = NumBits 5, bsData = 17})]]}
from:
	parameters
	TYPE_CODE_FUNCTION
	TYPE_BLOCK
	type symbol table
	MODULE_BLOCK
	Bitstream

What is going on here? This ultimately arises from how llvm-pretty-bc-parser parses TYPE_CODE_FUNCTION records (i.e., function types):

-- [vararg, [retty, paramty x N]]
21 -> label "TYPE_CODE_FUNCTION" $ do
let field = parseField r
vararg <- label "vararg" (field 0 boolean)
tys <- label "parameters" (field 1 (fieldArray typeRef))
case tys of
rty:ptys -> addType (FunTy rty ptys vararg)
[] -> fail "function expects a return type"

Specifically, llvm-pretty-bc-parser expects the convention that the record will have two fields:

  • A FieldFixed at index 0 containing the vararg information
  • A FieldArray at index 1 containing the function's result and argument types (rty and ptys, respectively)

Zig, on the other hand, does it slightly differently. It has a TYPE_CODE_FUNCTION record with the following fields:

  • A FieldFixed at index 0 containing the vararg information
  • A FieldFixed at index 1 containing the return type (what is called rty in the code above)
  • A FieldArray at index 2 containing the argument types (what is called ptys in the code above)

This is just different enough to confuse llvm-pretty-bc-parser. Interestingly, the official LLVM bitcode specification's documentation for TYPE_CODE_RECORD suggests that the latter convention is closer to how it is supposed to work, although in practice LLVM appears to accept either convention. (For whatever reason, Clang itself always uses the former convention, which is most likely why llvm-pretty-bc-parser's code was designed with the former convention in mind.)

Given that LLVM accepts either convention, we should make llvm-pretty-bc-parser follow suit. I am unclear how much work this would require, however. Here is a very quick-and-dirty hack to make things work with Zig-generated bitcode (but not with Clang-generated bitcode):

@@ -185,19 +185,18 @@ parseTypeBlockEntry (fromEntry -> Just r) = case recordCode r of
   21 -> label "TYPE_CODE_FUNCTION" $ do
     let field = parseField r
     vararg <- label "vararg"     (field 0 boolean)
-    tys    <- label "parameters" (field 1 (fieldArray typeRef))
-    case tys of
-      rty:ptys -> addType (FunTy rty ptys vararg)
-      []       -> fail "function expects a return type"
+    rty    <- label "result"     (field 1 typeRef)
+    ptys   <- label "parameters" (field 2 (fieldArray typeRef))
+    addType (FunTy rty ptys vararg)

   22 -> label "TYPE_CODE_TOKEN" $ do

(I have also opened #304 about the subject of how best to handle FieldArrays.)

parseSlice: unable to parse record field 1 of record [...] (value symbol table)

The next stumbling block is:

parseSlice: unable to parse record field 1 of record Record {recordCode = 2, recordFields = [FieldArray [FieldFixed (BitString {bsLength = NumBits 8, bsData = 120}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 56}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 54}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 95}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 54}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 52}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 45}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 117}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 110}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 107}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 110}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 111}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 119}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 110}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 45}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 108}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 105}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 110}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 117}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 120}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 54}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 46}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 56}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 46}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 48}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 45}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 103}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 110}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 117}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 50}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 46}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 51}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 57}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 46}),FieldFixed (BitString {bsLength = NumBits 8, bsData = 48})]]}
from:
	value symbol table
	MODULE_BLOCK
	Bitstream

I haven't gotten to the bottom of this yet, but I wonder if this is due to yet another place in the code that assumes that FieldArrays can't happen in places where they actually can.


For the sake of completeness, here is a full diff for all the hacks that I have used up to this point:

diff --git a/src/Data/LLVM/BitCode/IR/Types.hs b/src/Data/LLVM/BitCode/IR/Types.hs
index ef564bd..916b180 100644
--- a/src/Data/LLVM/BitCode/IR/Types.hs
+++ b/src/Data/LLVM/BitCode/IR/Types.hs
@@ -24,7 +24,7 @@ import           Data.Ord (comparing)

 -- | Pattern match the TYPE_CODE_NUMENTRY unabbreviated record.
 numEntry :: Match Entry Record
-numEntry  = hasRecordCode 1 <=< fromUnabbrev <=< unabbrev
+numEntry  = hasRecordCode 1 <=< fromEntry

 resolveTypeDecls :: Parse [TypeDecl]
 resolveTypeDecls  = do
@@ -185,19 +185,18 @@ parseTypeBlockEntry (fromEntry -> Just r) = case recordCode r of
   21 -> label "TYPE_CODE_FUNCTION" $ do
     let field = parseField r
     vararg <- label "vararg"     (field 0 boolean)
-    tys    <- label "parameters" (field 1 (fieldArray typeRef))
-    case tys of
-      rty:ptys -> addType (FunTy rty ptys vararg)
-      []       -> fail "function expects a return type"
+    rty    <- label "result"     (field 1 typeRef)
+    ptys   <- label "parameters" (field 2 (fieldArray typeRef))
+    addType (FunTy rty ptys vararg)

   22 -> label "TYPE_CODE_TOKEN" $ do
-    notImplemented
+    noType

   23 -> label "TYPE_CODE_BFLOAT" $ do
-    notImplemented
+    noType

   24 -> label "TYPE_CODE_X86_AMX" $ do
-    notImplemented
+    noType

   25 -> label "TYPE_CODE_OPAQUE_POINTER" $ do
     let field = parseField r

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions