Commit 349928d
committed
Add arith+bzip2 support to name tokeniser
Also optimises method used per data type. For example CHAR and ALPHA
are letters/symbols and don't benefit from STRIPE mode. Conversely
DUP, DIFF, DIGITS and DIGITS0 are always guaranteed to be 32-bit ints
so do benefit greatly.
This allows us to cut out a lot of the brute force work, offerring
faster encoding.
Benchmarks vs develop:
Level Old size/time New
1 6052123 0m1.658s 4909581 0m1.567s
3 4924296 0m1.755s 4808368 0m1.690s (default cram3.1)
5 4780927 0m2.859s 4768044 0m2.099s
7 4754297 0m4.028s 4758883 0m2.279s
9 4753731 0m4.353s 4753732 0m3.174s
11 5998241 0m2.661s 4787219 0m2.483s
13 4896975 0m2.994s 4703920 0m3.052s
15 4677274 0m4.809s 4656469 0m3.915s
17 4620982 0m7.845s 4629877 0m4.097s
19 4620851 0m9.694s 4621212 0m6.346s
It's particularly faster at the higher compression levels, and
noticably smaller at level 1 as we only try one method but it's the
best for that type of data.
Tested on a broad mix of read names from multiple platforms and in
pos-sorted and name-sorted order.1 parent 0fa0c9c commit 349928d
1 file changed
+128
-33
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1241 | 1241 | | |
1242 | 1242 | | |
1243 | 1243 | | |
1244 | | - | |
| 1244 | + | |
| 1245 | + | |
1245 | 1246 | | |
1246 | 1247 | | |
1247 | | - | |
1248 | 1248 | | |
| 1249 | + | |
1249 | 1250 | | |
1250 | | - | |
1251 | | - | |
1252 | | - | |
1253 | | - | |
1254 | | - | |
1255 | | - | |
1256 | | - | |
1257 | | - | |
1258 | | - | |
1259 | | - | |
1260 | | - | |
| 1251 | + | |
1261 | 1252 | | |
1262 | 1253 | | |
1263 | 1254 | | |
1264 | 1255 | | |
1265 | | - | |
| 1256 | + | |
| 1257 | + | |
| 1258 | + | |
| 1259 | + | |
| 1260 | + | |
| 1261 | + | |
| 1262 | + | |
| 1263 | + | |
| 1264 | + | |
| 1265 | + | |
| 1266 | + | |
| 1267 | + | |
| 1268 | + | |
| 1269 | + | |
| 1270 | + | |
| 1271 | + | |
| 1272 | + | |
| 1273 | + | |
| 1274 | + | |
| 1275 | + | |
| 1276 | + | |
| 1277 | + | |
| 1278 | + | |
| 1279 | + | |
| 1280 | + | |
| 1281 | + | |
| 1282 | + | |
| 1283 | + | |
| 1284 | + | |
| 1285 | + | |
| 1286 | + | |
| 1287 | + | |
| 1288 | + | |
| 1289 | + | |
| 1290 | + | |
| 1291 | + | |
| 1292 | + | |
| 1293 | + | |
| 1294 | + | |
| 1295 | + | |
| 1296 | + | |
| 1297 | + | |
| 1298 | + | |
| 1299 | + | |
| 1300 | + | |
| 1301 | + | |
| 1302 | + | |
| 1303 | + | |
| 1304 | + | |
| 1305 | + | |
| 1306 | + | |
| 1307 | + | |
| 1308 | + | |
| 1309 | + | |
| 1310 | + | |
| 1311 | + | |
| 1312 | + | |
| 1313 | + | |
| 1314 | + | |
| 1315 | + | |
| 1316 | + | |
| 1317 | + | |
| 1318 | + | |
| 1319 | + | |
| 1320 | + | |
| 1321 | + | |
| 1322 | + | |
| 1323 | + | |
| 1324 | + | |
| 1325 | + | |
| 1326 | + | |
| 1327 | + | |
| 1328 | + | |
| 1329 | + | |
| 1330 | + | |
| 1331 | + | |
| 1332 | + | |
| 1333 | + | |
| 1334 | + | |
| 1335 | + | |
| 1336 | + | |
| 1337 | + | |
| 1338 | + | |
| 1339 | + | |
| 1340 | + | |
| 1341 | + | |
| 1342 | + | |
| 1343 | + | |
| 1344 | + | |
| 1345 | + | |
| 1346 | + | |
| 1347 | + | |
1266 | 1348 | | |
1267 | 1349 | | |
1268 | | - | |
| 1350 | + | |
| 1351 | + | |
| 1352 | + | |
| 1353 | + | |
1269 | 1354 | | |
1270 | 1355 | | |
| 1356 | + | |
1271 | 1357 | | |
1272 | | - | |
1273 | | - | |
| 1358 | + | |
| 1359 | + | |
1274 | 1360 | | |
1275 | | - | |
1276 | | - | |
| 1361 | + | |
| 1362 | + | |
1277 | 1363 | | |
1278 | 1364 | | |
1279 | 1365 | | |
1280 | 1366 | | |
1281 | | - | |
| 1367 | + | |
| 1368 | + | |
| 1369 | + | |
| 1370 | + | |
| 1371 | + | |
| 1372 | + | |
| 1373 | + | |
| 1374 | + | |
| 1375 | + | |
| 1376 | + | |
| 1377 | + | |
| 1378 | + | |
| 1379 | + | |
1282 | 1380 | | |
1283 | 1381 | | |
1284 | 1382 | | |
1285 | | - | |
1286 | | - | |
1287 | | - | |
1288 | | - | |
1289 | | - | |
1290 | | - | |
1291 | | - | |
1292 | | - | |
| 1383 | + | |
| 1384 | + | |
| 1385 | + | |
| 1386 | + | |
1293 | 1387 | | |
1294 | | - | |
1295 | | - | |
| 1388 | + | |
| 1389 | + | |
| 1390 | + | |
1296 | 1391 | | |
1297 | | - | |
| 1392 | + | |
1298 | 1393 | | |
1299 | 1394 | | |
1300 | 1395 | | |
| |||
1446 | 1541 | | |
1447 | 1542 | | |
1448 | 1543 | | |
1449 | | - | |
1450 | | - | |
| 1544 | + | |
| 1545 | + | |
1451 | 1546 | | |
1452 | 1547 | | |
1453 | 1548 | | |
| |||
0 commit comments