Skip to content

Commit e0865e9

Browse files
authored
Merge pull request #564 from onekey-sec/multi-volume
feat(processing): add multi-file handler support
2 parents ec5ca90 + 3987cd5 commit e0865e9

File tree

27 files changed

+1181
-173
lines changed

27 files changed

+1181
-173
lines changed

docs/api.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,16 @@ hide:
2323
show_root_heading: true
2424
show_source: true
2525

26+
::: unblob.models.DirectoryHandler
27+
handler: python
28+
options:
29+
members:
30+
- get_dependencies
31+
- calculate_multifile
32+
- extract
33+
show_root_heading: true
34+
show_source: true
35+
2636
::: unblob.models.Extractor
2737
handler: python
2838
options:
@@ -31,3 +41,12 @@ hide:
3141
- extract
3242
show_root_heading: true
3343
show_source: true
44+
45+
::: unblob.models.DirectoryExtractor
46+
handler: python
47+
options:
48+
members:
49+
- get_dependencies
50+
- extract
51+
show_root_heading: true
52+
show_source: true

docs/development.md

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,58 @@ If you need to parse structure using different endianness, the class exposes two
196196
If your format allows it, we strongly recommend you to inherit from the
197197
StructHandler given that it will be strongly typed and less prone to errors.
198198

199+
### DirectoryHandler class
200+
201+
`DirectoryHandler` is a specialized handler responsible for identifying multi-file formats
202+
located in a directory or in a subtree. The abstract class is located in
203+
[unblob/models.py](https://github.com/onekey-sec/unblob/blob/main/unblob/models.py):
204+
205+
```python
206+
class DirectoryHandler(abc.ABC):
207+
"""A directory type handler is responsible for searching, validating and "unblobbing" files from multiple files in a directory."""
208+
209+
NAME: str
210+
211+
EXTRACTOR: DirectoryExtractor
212+
213+
PATTERN: DirectoryPattern
214+
215+
@classmethod
216+
def get_dependencies(cls):
217+
"""Return external command dependencies needed for this handler to work."""
218+
if cls.EXTRACTOR:
219+
return cls.EXTRACTOR.get_dependencies()
220+
return []
221+
222+
@abc.abstractmethod
223+
def calculate_multifile(self, file: Path) -> Optional[MultiFile]:
224+
"""Calculate the MultiFile in a directory, using a file matched by the pattern as a starting point."""
225+
226+
def extract(self, paths: List[Path], outdir: Path):
227+
if self.EXTRACTOR is None:
228+
logger.debug("Skipping file: no extractor.", paths=paths)
229+
raise ExtractError
230+
231+
# We only extract every blob once, it's a mistake to extract the same blob again
232+
outdir.mkdir(parents=True, exist_ok=False)
233+
234+
self.EXTRACTOR.extract(paths, outdir)
235+
```
236+
237+
- `NAME`: a unique name for this handler
238+
- `PATTERN`: A `DirectoryPattern` used to identify a starting/main file of the given format.
239+
- `EXTRACTOR`: a [DirectoryExtractor](extractors.md).
240+
- `get_dependencies()`: returns the extractor dependencies. This helps unblob keep
241+
track of [third party dependencies](extractors.md).
242+
- `calculate_multifile()`: this is the method that needs to be overridden in your
243+
handler. It receives a `file` Path object identified by the `PATTERN` in the directory.
244+
This is where you implement the logic to compute and return the `MultiFile` file set.
245+
246+
Any files that are being processed as part of a `MultiFile` set would be skipped from `Chunk`
247+
detection.
248+
249+
Any file that is part of multiple `MultiFile` is a collision and results in a processing error.
250+
199251
### Example Handler implementation
200252

201253
Let's imagine that we have a custom file format that always starts with the
@@ -367,6 +419,44 @@ PATTERNS = [
367419
]
368420
```
369421

422+
### DirectoryPatterns
423+
424+
The `DirectoryHandler` uses these patterns to identify the starting/main file of a given
425+
multi-file format. There are currently two main types: `Glob` and `SingleFile`
426+
427+
#### Glob
428+
429+
The `Glob` object can use traditional globbing to detect files in a directory. This could be used when
430+
the file could have a varying part. There are cases where multiple multi-file set could be in a single
431+
directory. The job of the `DirectoryPattern` is to recognize the main file for each set.
432+
433+
Here is an example on `Glob`:
434+
435+
```python
436+
PATTERN = Glob("*.7z.001")
437+
```
438+
439+
This example identify the first volume of a multi-volume sevenzip archive. Notice that this could pick
440+
up all first volumes in a given directory. (NB: Detecting the other volumes of a given set is the
441+
responsibility of the `DirectoryHandler.calculate_multifile` function. Do not write a `Glob` which picks
442+
up all the files of a multi-file set as that would result in errors.)
443+
444+
445+
#### SingleFile
446+
447+
The `SingleFile` object can be used to identify a single file with a known name. (Obviously only use this if the
448+
main file name is well-known and does not have a varying part. It also means that only a single multi-file set
449+
can be detected in a given directory.)
450+
451+
Here is an example on `SingleFile`:
452+
453+
```python
454+
PATTERN = SingleFile("meta-data.json")
455+
```
456+
457+
This would pick up the file `meta-data.json` and pass it to the `DirectoryHandler`. The handler still has to
458+
verify the file and has to find the additional files.
459+
370460
## Writing extractors
371461

372462
!!! Recommendation
@@ -412,6 +502,32 @@ Two methods are exposed by this class:
412502
- `extract()`: you must override this function. This is where you'll perform the
413503
extraction of `inpath` content into `outdir` extraction directory
414504

505+
### DirectoryExtractor class
506+
507+
The `DirectoryExtractor` interface is defined in
508+
[unblob/models.py](https://github.com/onekey-sec/unblob/blob/main/unblob/models.py):
509+
510+
```python
511+
class DirectoryExtractor(abc.ABC):
512+
def get_dependencies(self) -> List[str]:
513+
"""Return the external command dependencies."""
514+
return []
515+
516+
@abc.abstractmethod
517+
def extract(self, paths: List[Path], outdir: Path):
518+
"""Extract from a multi file path list.
519+
520+
Raises ExtractError on failure.
521+
"""
522+
```
523+
524+
Two methods are exposed by this class:
525+
526+
- `get_dependencies()`: you should override it if your custom extractor relies on
527+
external dependencies such as command line tools
528+
- `extract()`: you must override this function. This is where you'll perform the
529+
extraction of `paths` files into `outdir` extraction directory
530+
415531
### Example Extractor
416532

417533
Extractors are quite complex beasts, so rather than trying to come up with a
@@ -451,3 +567,9 @@ Learn from us so you can avoid them in the future 🙂
451567
back.
452568
- Watch out for [negative seeking](https://github.com/onekey-sec/unblob/pull/280)
453569
- Make sure you get your types right! signedness can [get in the way](https://github.com/onekey-sec/unblob/pull/130).
570+
- Try to use as specific as possible patterns to identify data in Handlers to avoid false-positive matches
571+
and extra processing in the Handler.
572+
- Try to avoid using overlapping patterns, as patterns that match on the same data could easily collide. Hyperscan
573+
does not guarantee priority between patterns matching on the same data. (Hyperscan reports matches ordered by the
574+
pattern match end offset. In case multiple pattern match on the same end offset the matching order depends on the
575+
pattern registration order which is undefined in unblob.)

docs/glossary.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,3 +38,8 @@ the recursion depth is reached. Beyond that level, no further extraction will
3838
happen.
3939
For example, if a `tar.gz` contains a `zip` and a text file, the
4040
recursion depth will be **3**: 1. gzip layer, 2. tar, 3. zip and text file.
41+
42+
#### MultiFile
43+
44+
A set of files that were identified by a `DirectoryHandler` representing a format
45+
which consists of multiple files. `MultiFile` is extracted using a `DirectoryExtractor`

docs/index.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,8 @@ extraction of arbitrary firmware.
3333
Specialized tools that can extract information from those firmware images already
3434
exist, but we were carving for something smarter that could identify both
3535
**start-offset** and **end-offset** of a specific chunk
36-
(e.g. filesystem, compression stream, archive, ...).
36+
(e.g. filesystem, compression stream, archive, ...) as well as handle formats
37+
split across multiple files.
3738

3839
We **stick to the format standard** as much as possible when deriving these
3940
offsets, and we clearly define what we want out of identified chunks (e.g., not
@@ -98,6 +99,15 @@ unblob identifies known and unknown chunks of data within a file:
9899

99100
![unblob_architecture.webp](unblob_architecture.webp)
100101

102+
unblob also supports special formats where data is split across multiple files
103+
like multi-volume archives or data & meta-data formats:
104+
105+
- Special **DirectoryHandler** is responsible to identify the files that make up
106+
a multi files set.
107+
108+
- Identified MultiFile sets are not carved, but rather directly extracted using
109+
special **DirectoryExtractor**.
110+
101111
## Used technologies
102112

103113
- unblob is written in [Python](https://www.python.org/).

tests/conftest.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,5 +11,5 @@
1111

1212
@pytest.fixture
1313
def task_result():
14-
task = Task(path=Path("/nonexistent"), depth=0, chunk_id="")
14+
task = Task(path=Path("/nonexistent"), depth=0, blob_id="")
1515
return TaskResult(task)
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
version https://git-lfs.github.com/spec/v1
2+
oid sha256:c17f34d380545b1f139b52486f48b1852a9e74c2079a8e0338b0b7600a720fd6
3+
size 10240
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
version https://git-lfs.github.com/spec/v1
2+
oid sha256:a89ae76a6a3624af5eef2bbebe3ac0ac9916d66f3a65b055a4931f28065fd55e
3+
size 100
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
version https://git-lfs.github.com/spec/v1
2+
oid sha256:f3af8f79806dc5059ce0c746906f6d4c1f4e0206abd5e1742f8a8215bf6ebae0
3+
size 81
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
version https://git-lfs.github.com/spec/v1
2+
oid sha256:47d741b6059c6d7e99be25ce46fb9ba099cfd6515de1ef7681f93479d25996a4
3+
size 9
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
version https://git-lfs.github.com/spec/v1
2+
oid sha256:e0763097d2327a89fb7fc6a1fad40f87d2261dcdd6c09e65ee00b200a0128e1c
3+
size 9

0 commit comments

Comments
 (0)