|
| 1 | +Date: Wed, 16 Oct 2013 04:34:01 -0400 |
| 2 | +From: Jeff King < [email protected]> |
| 3 | +Subject: pack corruption post-mortem |
| 4 | +Abstract: Recovering a corrupted object when no good copy is available. |
| 5 | +Content-type: text/asciidoc |
| 6 | + |
| 7 | +How to recover an object from scratch |
| 8 | +===================================== |
| 9 | + |
| 10 | +I was recently presented with a repository with a corrupted packfile, |
| 11 | +and was asked if the data was recoverable. This post-mortem describes |
| 12 | +the steps I took to investigate and fix the problem. I thought others |
| 13 | +might find the process interesting, and it might help somebody in the |
| 14 | +same situation. |
| 15 | + |
| 16 | +******************************** |
| 17 | +Note: In this case, no good copy of the repository was available. For |
| 18 | +the much easier case where you can get the corrupted object from |
| 19 | +elsewhere, see link:recover-corrupted-blob-object.html[this howto]. |
| 20 | +******************************** |
| 21 | + |
| 22 | +I started with an fsck, which found a problem with exactly one object |
| 23 | +(I've used $pack and $obj below to keep the output readable, and also |
| 24 | +because I'll refer to them later): |
| 25 | + |
| 26 | +----------- |
| 27 | + $ git fsck |
| 28 | + error: $pack SHA1 checksum mismatch |
| 29 | + error: index CRC mismatch for object $obj from $pack at offset 51653873 |
| 30 | + error: inflate: data stream error (incorrect data check) |
| 31 | + error: cannot unpack $obj from $pack at offset 51653873 |
| 32 | +----------- |
| 33 | + |
| 34 | +The pack checksum failing means a byte is munged somewhere, and it is |
| 35 | +presumably in the object mentioned (since both the index checksum and |
| 36 | +zlib were failing). |
| 37 | + |
| 38 | +Reading the zlib source code, I found that "incorrect data check" means |
| 39 | +that the adler-32 checksum at the end of the zlib data did not match the |
| 40 | +inflated data. So stepping the data through zlib would not help, as it |
| 41 | +did not fail until the very end, when we realize the crc does not match. |
| 42 | +The problematic bytes could be anywhere in the object data. |
| 43 | + |
| 44 | +The first thing I did was pull the broken data out of the packfile. I |
| 45 | +needed to know how big the object was, which I found out with: |
| 46 | + |
| 47 | +------------ |
| 48 | + $ git show-index <$idx | cut -d' ' -f1 | sort -n | grep -A1 51653873 |
| 49 | + 51653873 |
| 50 | + 51664736 |
| 51 | +------------ |
| 52 | + |
| 53 | +Show-index gives us the list of objects and their offsets. We throw away |
| 54 | +everything but the offsets, and then sort them so that our interesting |
| 55 | +offset (which we got from the fsck output above) is followed immediately |
| 56 | +by the offset of the next object. Now we know that the object data is |
| 57 | +10863 bytes long, and we can grab it with: |
| 58 | + |
| 59 | +------------ |
| 60 | + dd if=$pack of=object bs=1 skip=51653873 count=10863 |
| 61 | +------------ |
| 62 | + |
| 63 | +I inspected a hexdump of the data, looking for any obvious bogosity |
| 64 | +(e.g., a 4K run of zeroes would be a good sign of filesystem |
| 65 | +corruption). But everything looked pretty reasonable. |
| 66 | + |
| 67 | +Note that the "object" file isn't fit for feeding straight to zlib; it |
| 68 | +has the git packed object header, which is variable-length. We want to |
| 69 | +strip that off so we can start playing with the zlib data directly. You |
| 70 | +can either work your way through it manually (the format is described in |
| 71 | +link:../technical/pack-format.html[Documentation/technical/pack-format.txt]), |
| 72 | +or you can walk through it in a debugger. I did the latter, creating a |
| 73 | +valid pack like: |
| 74 | + |
| 75 | +------------ |
| 76 | + # pack magic and version |
| 77 | + printf 'PACK\0\0\0\2' >tmp.pack |
| 78 | + # pack has one object |
| 79 | + printf '\0\0\0\1' >>tmp.pack |
| 80 | + # now add our object data |
| 81 | + cat object >>tmp.pack |
| 82 | + # and then append the pack trailer |
| 83 | + /path/to/git.git/test-sha1 -b <tmp.pack >trailer |
| 84 | + cat trailer >>tmp.pack |
| 85 | +------------ |
| 86 | + |
| 87 | +and then running "git index-pack tmp.pack" in the debugger (stop at |
| 88 | +unpack_raw_entry). Doing this, I found that there were 3 bytes of header |
| 89 | +(and the header itself had a sane type and size). So I stripped those |
| 90 | +off with: |
| 91 | + |
| 92 | +------------ |
| 93 | + dd if=object of=zlib bs=1 skip=3 |
| 94 | +------------ |
| 95 | + |
| 96 | +I ran the result through zlib's inflate using a custom C program. And |
| 97 | +while it did report the error, I did get the right number of output |
| 98 | +bytes (i.e., it matched git's size header that we decoded above). But |
| 99 | +feeding the result back to "git hash-object" didn't produce the same |
| 100 | +sha1. So there were some wrong bytes, but I didn't know which. The file |
| 101 | +happened to be C source code, so I hoped I could notice something |
| 102 | +obviously wrong with it, but I didn't. I even got it to compile! |
| 103 | + |
| 104 | +I also tried comparing it to other versions of the same path in the |
| 105 | +repository, hoping that there would be some part of the diff that didn't |
| 106 | +make sense. Unfortunately, this happened to be the only revision of this |
| 107 | +particular file in the repository, so I had nothing to compare against. |
| 108 | + |
| 109 | +So I took a different approach. Working under the guess that the |
| 110 | +corruption was limited to a single byte, I wrote a program to munge each |
| 111 | +byte individually, and try inflating the result. Since the object was |
| 112 | +only 10K compressed, that worked out to about 2.5M attempts, which took |
| 113 | +a few minutes. |
| 114 | + |
| 115 | +The program I used is here: |
| 116 | + |
| 117 | +---------------------------------------------- |
| 118 | +#include <stdio.h> |
| 119 | +#include <unistd.h> |
| 120 | +#include <string.h> |
| 121 | +#include <signal.h> |
| 122 | +#include <zlib.h> |
| 123 | + |
| 124 | +static int try_zlib(unsigned char *buf, int len) |
| 125 | +{ |
| 126 | + /* make this absurdly large so we don't have to loop */ |
| 127 | + static unsigned char out[1024*1024]; |
| 128 | + z_stream z; |
| 129 | + int ret; |
| 130 | + |
| 131 | + memset(&z, 0, sizeof(z)); |
| 132 | + inflateInit(&z); |
| 133 | + |
| 134 | + z.next_in = buf; |
| 135 | + z.avail_in = len; |
| 136 | + z.next_out = out; |
| 137 | + z.avail_out = sizeof(out); |
| 138 | + |
| 139 | + ret = inflate(&z, 0); |
| 140 | + inflateEnd(&z); |
| 141 | + return ret >= 0; |
| 142 | +} |
| 143 | + |
| 144 | +/* eye candy */ |
| 145 | +static int counter = 0; |
| 146 | +static void progress(int sig) |
| 147 | +{ |
| 148 | + fprintf(stderr, "\r%d", counter); |
| 149 | + alarm(1); |
| 150 | +} |
| 151 | + |
| 152 | +int main(void) |
| 153 | +{ |
| 154 | + /* oversized so we can read the whole buffer in */ |
| 155 | + unsigned char buf[1024*1024]; |
| 156 | + int len; |
| 157 | + unsigned i, j; |
| 158 | + |
| 159 | + signal(SIGALRM, progress); |
| 160 | + alarm(1); |
| 161 | + |
| 162 | + len = read(0, buf, sizeof(buf)); |
| 163 | + for (i = 0; i < len; i++) { |
| 164 | + unsigned char c = buf[i]; |
| 165 | + for (j = 0; j <= 0xff; j++) { |
| 166 | + buf[i] = j; |
| 167 | + |
| 168 | + counter++; |
| 169 | + if (try_zlib(buf, len)) |
| 170 | + printf("i=%d, j=%x\n", i, j); |
| 171 | + } |
| 172 | + buf[i] = c; |
| 173 | + } |
| 174 | + |
| 175 | + alarm(0); |
| 176 | + fprintf(stderr, "\n"); |
| 177 | + return 0; |
| 178 | +} |
| 179 | +---------------------------------------------- |
| 180 | + |
| 181 | +I compiled and ran with: |
| 182 | + |
| 183 | +------- |
| 184 | + gcc -Wall -Werror -O3 munge.c -o munge -lz |
| 185 | + ./munge <zlib |
| 186 | +------- |
| 187 | + |
| 188 | + |
| 189 | +There were a few false positives early on (if you write "no data" in the |
| 190 | +zlib header, zlib thinks it's just fine :) ). But I got a hit about |
| 191 | +halfway through: |
| 192 | + |
| 193 | +------- |
| 194 | + i=5642, j=c7 |
| 195 | +------- |
| 196 | + |
| 197 | +I let it run to completion, and got a few more hits at the end (where it |
| 198 | +was munging the crc to match our broken data). So there was a good |
| 199 | +chance this middle hit was the source of the problem. |
| 200 | + |
| 201 | +I confirmed by tweaking the byte in a hex editor, zlib inflating the |
| 202 | +result (no errors!), and then piping the output into "git hash-object", |
| 203 | +which reported the sha1 of the broken object. Success! |
| 204 | + |
| 205 | +I fixed the packfile itself with: |
| 206 | + |
| 207 | +------- |
| 208 | + chmod +w $pack |
| 209 | + printf '\xc7' | dd of=$pack bs=1 seek=51659518 conv=notrunc |
| 210 | + chmod -w $pack |
| 211 | +------- |
| 212 | + |
| 213 | +The `\xc7` comes from the replacement byte our "munge" program found. |
| 214 | +The offset 51659518 is derived by taking the original object offset |
| 215 | +(51653873), adding the replacement offset found by "munge" (5642), and |
| 216 | +then adding back in the 3 bytes of git header we stripped. |
| 217 | + |
| 218 | +After that, "git fsck" ran clean. |
| 219 | + |
| 220 | +As for the corruption itself, I was lucky that it was indeed a single |
| 221 | +byte. In fact, it turned out to be a single bit. The byte 0xc7 was |
| 222 | +corrupted to 0xc5. So presumably it was caused by faulty hardware, or a |
| 223 | +cosmic ray. |
| 224 | + |
| 225 | +And the aborted attempt to look at the inflated output to see what was |
| 226 | +wrong? I could have looked forever and never found it. Here's the diff |
| 227 | +between what the corrupted data inflates to, versus the real data: |
| 228 | + |
| 229 | +-------------- |
| 230 | + - cp = strtok (arg, "+"); |
| 231 | + + cp = strtok (arg, "."); |
| 232 | +-------------- |
| 233 | + |
| 234 | +It tweaked one byte and still ended up as valid, readable C that just |
| 235 | +happened to do something totally different! One takeaway is that on a |
| 236 | +less unlucky day, looking at the zlib output might have actually been |
| 237 | +helpful, as most random changes would actually break the C code. |
| 238 | + |
| 239 | +But more importantly, git's hashing and checksumming noticed a problem |
| 240 | +that easily could have gone undetected in another system. The result |
| 241 | +still compiled, but would have caused an interesting bug (that would |
| 242 | +have been blamed on some random commit). |
0 commit comments