Skip to content

Commit 3d36c69

Browse files
committed
feat: add write mode
1 parent b1241c4 commit 3d36c69

28 files changed

+2704
-320
lines changed

.pylintrc

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,8 @@ disable =
66
too-few-public-methods,
77
too-many-arguments,
88
too-many-branches,
9+
too-many-instance-attributes,
910
too-many-locals,
11+
12+
[SIMILARITIES]
13+
ignore-imports=yes

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ adheres to [Semantic Versioning](https://semver.org/).
1111

1212
### :rocket: Added
1313

14+
- Write modes (`w`, `x`, `r+`, `w+`, `x+`) :tada:
1415
- Allow to `seek` past the end of the fileobj
1516
- Calling `len` on a fileobj gives its length, and `bool` tells if it is empty
1617
- Export useful constants and functions from `lzma` for easy access: checks, filters,

README.md

Lines changed: 62 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@
44

55
Pure Python implementation of the XZ file format with random access support
66

7+
_Leveraging the lzma module for fast (de)compression_
8+
79
[![GitHub build status](https://img.shields.io/github/workflow/status/rogdham/python-xz/build/master)](https://github.com/rogdham/python-xz/actions?query=branch:master) [![Release on PyPI](https://img.shields.io/pypi/v/python-xz)](https://pypi.org/project/python-xz/) [![Code coverage](https://img.shields.io/badge/coverage-100%25-brightgreen)](https://github.com/rogdham/python-xz/search?q=fail+under&type=Code) [![MIT License](https://img.shields.io/pypi/l/python-xz)](https://github.com/Rogdham/python-xz/blob/master/LICENSE.txt)
810

911
---
@@ -14,40 +16,42 @@ Pure Python implementation of the XZ file format with random access support
1416

1517
---
1618

17-
A XZ file can be composed of several streams and blocks. This allows for random access
18-
when reading, but this is not supported by Python's builtin `lzma` module, which would
19-
read all previous blocks for nothing.
19+
A XZ file can be composed of several streams and blocks. This allows for fast random
20+
access when reading, but this is not supported by Python's builtin `lzma` module (which
21+
would read all previous blocks for nothing).
2022

2123
<div align="center">
2224

23-
| | [lzma] | [lzmaffi] | python-xz |
24-
| :-------------: | :---------------: | :------------------: | :------------------: |
25-
| module type | builtin | cffi (C extension) | pure Python |
26-
| 📄 **read** | | | |
27-
| random access | ❌ no<sup>1</sup> | ✔️ yes<sup>2</sup> | ✔️ yes<sup>2</sup> |
28-
| several blocks | ✔️ yes | ✔️✔️ yes<sup>3</sup> | ✔️✔️ yes<sup>3</sup> |
29-
| several streams | ✔️ yes | ✔️ yes | ✔️✔️ yes<sup>4</sup> |
30-
| stream padding | ❌ no | ✔️ yes | ✔️ yes |
31-
| 📝 **write** | | | |
32-
| `w` mode | ✔️ yes | ✔️ yes | ⏳ planned |
33-
| `x` mode | ✔️ yes | ❌ no | ⏳ planned |
34-
| `a` mode | ✔️ new stream | ✔️ new stream | ⏳ planned |
35-
| `r+w` mode | ❌ no | ❌ no | ⏳ planned |
36-
| several blocks | ❌ no | ❌ no | ⏳ planned |
37-
| several streams | ❌ no<sup>5</sup> | ❌ no<sup>5</sup> | ⏳ planned |
38-
| stream padding | ❌ no<sup>6</sup> | ✔️ yes | ⏳ planned |
25+
| | [lzma] | [lzmaffi] | python-xz |
26+
| :---------------: | :---------------: | :------------------: | :------------------: |
27+
| module type | builtin | cffi (C extension) | pure Python |
28+
| 📄 **read** | | | |
29+
| random access | ❌ no<sup>1</sup> | ✔️ yes<sup>2</sup> | ✔️ yes<sup>2</sup> |
30+
| several blocks | ✔️ yes | ✔️✔️ yes<sup>3</sup> | ✔️✔️ yes<sup>3</sup> |
31+
| several streams | ✔️ yes | ✔️ yes | ✔️✔️ yes<sup>4</sup> |
32+
| stream padding | ❌ no<sup>5</sup> | ✔️ yes | ✔️ yes |
33+
| 📝 **write** | | | |
34+
| `w` mode | ✔️ yes | ✔️ yes | ✔️ yes |
35+
| `x` mode | ✔️ yes | ❌ no | ✔️ yes |
36+
| `a` mode | ✔️ new stream | ✔️ new stream | ⏳ planned |
37+
| `r+`/`w+`/… modes | ❌ no | ❌ no | ✔️ yes |
38+
| several blocks | ❌ no | ❌ no | ✔️ yes |
39+
| several streams | ❌ no<sup>6</sup> | ❌ no<sup>6</sup> | ✔️ yes |
40+
| stream padding | ❌ no | ❌ no | ⏳ planned |
3941

4042
</div>
41-
<sub>
43+
44+
<details>
45+
<summary>Notes</summary>
4246

4347
1. Reading from a position will read the file from the very beginning
4448
2. Reading from a position will read the file from the beginning of the block
4549
3. Block positions available with the `block_boundaries` attribute
4650
4. Stream positions available with the `stream_boundaries` attribute
47-
5. Possible by manually closing and re-opening in append mode
48-
6. Related [issue](https://bugs.python.org/issue44134)
51+
5. Related [issue](https://bugs.python.org/issue44134)
52+
6. Possible by manually closing and re-opening in append mode
4953

50-
</sub>
54+
</details>
5155

5256
[lzma]: https://docs.python.org/3/library/lzma.html
5357
[lzmaffi]: https://github.com/r3m0t/backports.lzma
@@ -56,10 +60,10 @@ read all previous blocks for nothing.
5660

5761
## Usage
5862

59-
### Read mode
60-
6163
The API is similar to [lzma]: you can use either `xz.open` or `xz.XZFile`.
6264

65+
### Read mode
66+
6367
```python
6468
>>> with xz.open('example.xz') as fin:
6569
... fin.read(18)
@@ -95,7 +99,32 @@ are still in bytes (just like with `lzma.open`).
9599

96100
### Write mode
97101

98-
_This mode is not available yet._
102+
Writing is only supported from the end of file. It is however possible to truncate the
103+
file first. Note that truncating is only supported on block boundaries.
104+
105+
```python
106+
>>> with xz.open('test.xz', 'w') as fout:
107+
... fout.write(b'Hello, world!\n')
108+
... fout.write(b'This sentence is still in the previous block\n')
109+
... fout.change_block()
110+
... fout.write(b'But this one is in its own!\n')
111+
...
112+
14
113+
45
114+
28
115+
```
116+
117+
Advanced usage:
118+
119+
- Modes like `r+`/`w+`/`x+` allow to open for both read and write at the same time;
120+
however in the current implementation, a block with writing in progress is
121+
automatically closed when reading data from it.
122+
- The `check`, `preset` and `filters` arguments to `xz.open` and `xz.XZFile` allow to
123+
configure the default values for new streams and blocks.
124+
- Change block with the `change_block` method (the `preset` and `filters` attributes can
125+
be changed beforehand to apply to the new block).
126+
- Change stream with the `change_stream` method (the `check` attribute can be changed
127+
beforehand to apply to the new stream).
99128

100129
---
101130

@@ -121,15 +150,20 @@ compression ratio.
121150

122151
### How can I create XZ files optimized for random-access?
123152

124-
[XZ Utils](https://tukaani.org/xz/) can create XZ files with several blocks:
153+
You can open the file for writing and use the `change_block` method to create several
154+
blocks.
155+
156+
Other tools allow to create XZ files with several blocks as well:
157+
158+
- [XZ Utils](https://tukaani.org/xz/) needs to be called with flags:
125159

126160
```sh
127161
$ xz -T0 file # threading mode
128162
$ xz --block-size 16M file # same size for all blocks
129163
$ xz --block-list 16M,32M,8M,42M file # specific size for each block
130164
```
131165

132-
[PIXZ](https://github.com/vasi/pixz) creates files with several blocks by default:
166+
- [PIXZ](https://github.com/vasi/pixz) creates files with several blocks by default:
133167

134168
```sh
135169
$ pixz file

src/xz/block.py

Lines changed: 139 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,66 +1,169 @@
11
from io import DEFAULT_BUFFER_SIZE, SEEK_SET
2-
from lzma import FORMAT_XZ, LZMADecompressor, LZMAError
2+
from lzma import FORMAT_XZ, LZMACompressor, LZMADecompressor, LZMAError
33

4-
from xz.common import XZError, create_xz_header, create_xz_index_footer
4+
from xz.common import (
5+
XZError,
6+
create_xz_header,
7+
create_xz_index_footer,
8+
parse_xz_footer,
9+
parse_xz_index,
10+
)
511
from xz.io import IOAbstract, IOCombiner, IOStatic
612

713

8-
class XZBlock(IOAbstract):
9-
compressed_read_size = DEFAULT_BUFFER_SIZE
14+
class BlockRead:
15+
read_size = DEFAULT_BUFFER_SIZE
1016

1117
def __init__(self, fileobj, check, unpadded_size, uncompressed_size):
12-
super().__init__(uncompressed_size)
13-
self.compressed_fileobj = IOCombiner(
18+
self.length = uncompressed_size
19+
self.fileobj = IOCombiner(
1420
IOStatic(create_xz_header(check)),
1521
fileobj,
1622
IOStatic(
1723
create_xz_index_footer(check, [(unpadded_size, uncompressed_size)])
1824
),
1925
)
20-
self._decompressor_reset()
26+
self.reset()
2127

22-
def _decompressor_reset(self):
23-
self.compressed_fileobj.seek(0, SEEK_SET)
28+
def reset(self):
29+
self.fileobj.seek(0, SEEK_SET)
30+
self.pos = 0
2431
self.decompressor = LZMADecompressor(format=FORMAT_XZ)
2532

26-
def _decompressor_read(self, size):
33+
def decompress(self, pos, size):
34+
if pos < self.pos:
35+
self.reset()
36+
37+
skip_before = pos - self.pos
38+
2739
# pylint: disable=using-constant-test
2840
if self.decompressor.eof:
2941
raise XZError("block: decompressor eof")
42+
3043
if self.decompressor.needs_input:
31-
data_input = self.compressed_fileobj.read(self.compressed_read_size)
44+
data_input = self.fileobj.read(self.read_size)
3245
if not data_input:
3346
raise XZError("block: data eof")
3447
else:
3548
data_input = b""
36-
return self.decompressor.decompress(data_input, size)
37-
38-
def seek(self, *args):
39-
old_pos = self._pos
40-
super().seek(*args)
41-
pos_diff = self._pos - old_pos
42-
if pos_diff < 0:
43-
self._decompressor_reset()
44-
old_pos = 0
45-
pos_diff = self._pos
46-
if pos_diff > 0:
47-
self._pos = old_pos
48-
self.read(pos_diff)
4949

50-
def _read(self, size):
51-
try:
52-
data_output = self._decompressor_read(size)
50+
data_output = self.decompressor.decompress(data_input, skip_before + size)
51+
self.pos += len(data_output)
52+
53+
if self.pos == self.length:
54+
# we reached the end of the block
55+
# according to the XZ specification, we must check the
56+
# remaining bytes of the block; this is mainly performed by the
57+
# decompressor itself when we consume it
58+
while not self.decompressor.eof:
59+
if self.decompress(self.pos, 1):
60+
raise LZMAError("Corrupt input data")
61+
62+
return data_output[skip_before:]
63+
64+
65+
class BlockWrite:
66+
def __init__(self, fileobj, check, preset, filters):
67+
self.fileobj = fileobj
68+
self.check = check
69+
self.compressor = LZMACompressor(FORMAT_XZ, check, preset, filters)
70+
self.pos = 0
71+
if self.compressor.compress(b"") != create_xz_header(check):
72+
raise XZError("block: compressor header")
73+
74+
def _write(self, data):
75+
if data:
76+
self.fileobj.seek(self.pos)
77+
self.fileobj.write(data)
78+
self.pos += len(data)
79+
80+
def compress(self, data):
81+
self._write(self.compressor.compress(data))
82+
83+
def finish(self):
84+
data = self.compressor.flush()
85+
86+
# footer
87+
check, backward_size = parse_xz_footer(data[-12:])
88+
if check != self.check:
89+
raise XZError("block: compressor footer check")
5390

54-
if self._pos + len(data_output) == self._length:
55-
# we reached the end of the block
56-
# according to the XZ specification, we must check the
57-
# remaining bytes of the block; this is mainly performed by the
58-
# decompressor itself when we consume it
59-
while not self.decompressor.eof:
60-
if self._decompressor_read(1):
61-
raise LZMAError("Corrupt input data")
91+
# index
92+
records = parse_xz_index(data[-12 - backward_size : -12])
93+
if len(records) != 1:
94+
raise XZError("block: compressor index records length")
6295

63-
return data_output
96+
# remaining block data
97+
self._write(data[: -12 - backward_size])
6498

99+
return records[0] # (unpadded_size, uncompressed_size)
100+
101+
102+
class XZBlock(IOAbstract):
103+
def __init__(
104+
self,
105+
fileobj,
106+
check,
107+
unpadded_size,
108+
uncompressed_size,
109+
preset=None,
110+
filters=None,
111+
):
112+
super().__init__(uncompressed_size)
113+
self.fileobj = fileobj
114+
self.check = check
115+
self.preset = preset
116+
self.filters = filters
117+
self.unpadded_size = unpadded_size
118+
self.operation = None
119+
120+
@property
121+
def uncompressed_size(self):
122+
return self._length
123+
124+
def _read(self, size):
125+
# enforce read mode
126+
if not isinstance(self.operation, BlockRead):
127+
self._write_end()
128+
self.operation = BlockRead(
129+
self.fileobj,
130+
self.check,
131+
self.unpadded_size,
132+
self.uncompressed_size,
133+
)
134+
135+
# read data
136+
try:
137+
return self.operation.decompress(self._pos, size)
65138
except LZMAError as ex:
66139
raise XZError(f"block: error while decompressing: {ex}") from ex
140+
141+
def writable(self):
142+
return isinstance(self.operation, BlockWrite) or not self._length
143+
144+
def _write(self, data):
145+
# enforce write mode
146+
if not isinstance(self.operation, BlockWrite):
147+
self.operation = BlockWrite(
148+
self.fileobj,
149+
self.check,
150+
self.preset,
151+
self.filters,
152+
)
153+
154+
# write data
155+
self.operation.compress(data)
156+
return len(data)
157+
158+
def _write_after(self):
159+
if isinstance(self.operation, BlockWrite):
160+
self.unpadded_size, uncompressed_size = self.operation.finish()
161+
if uncompressed_size != self.uncompressed_size:
162+
raise XZError("block: compressor uncompressed size")
163+
self.operation = None
164+
165+
def _truncate(self, size):
166+
# thanks to the writable method, we are sure that length is zero
167+
# so we don't need to handle the case of truncating in middle of the block
168+
self.seek(size)
169+
self.write(b"")

src/xz/common.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
from binascii import crc32 as crc32int
2+
import lzma
23
from struct import pack, unpack
34

45
HEADER_MAGIC = b"\xfd7zXZ\x00"
@@ -12,7 +13,7 @@ class XZError(Exception):
1213
def encode_mbi(value):
1314
data = bytearray()
1415
while value >= 0x80:
15-
data.append((value | 0x80) & 0xFF)
16+
data.append((value & 0x7F) | 0x80)
1617
value >>= 7
1718
data.append(value)
1819
return data
@@ -57,6 +58,8 @@ def create_xz_index_footer(check, records):
5758
index = b"\x00"
5859
index += encode_mbi(len(records))
5960
for unpadded_size, uncompressed_size in records:
61+
if not unpadded_size:
62+
raise XZError("index record unpadded size")
6063
index += encode_mbi(unpadded_size)
6164
index += encode_mbi(uncompressed_size)
6265
index += pad(len(index))
@@ -124,3 +127,7 @@ def parse_xz_footer(footer):
124127
if flag_first_byte or not 0 <= check <= 0xF:
125128
raise XZError("footer flags")
126129
return (check, backward_size)
130+
131+
132+
# find default value for check implicitely used by lzma
133+
DEFAULT_CHECK = parse_xz_header(lzma.compress(b"")[:12])

0 commit comments

Comments
 (0)