@@ -196,6 +196,58 @@ If you need to parse structure using different endianness, the class exposes two
196196 If your format allows it, we strongly recommend you to inherit from the
197197 StructHandler given that it will be strongly typed and less prone to errors.
198198
199+ ### DirectoryHandler class
200+
201+ ` DirectoryHandler ` is a specialized handler responsible for identifying multi-file formats
202+ located in a directory or in a subtree. The abstract class is located in
203+ [ unblob/models.py] ( https://github.com/onekey-sec/unblob/blob/main/unblob/models.py ) :
204+
205+ ``` python
206+ class DirectoryHandler (abc .ABC ):
207+ """ A directory type handler is responsible for searching, validating and "unblobbing" files from multiple files in a directory."""
208+
209+ NAME : str
210+
211+ EXTRACTOR : DirectoryExtractor
212+
213+ PATTERN : DirectoryPattern
214+
215+ @ classmethod
216+ def get_dependencies (cls ):
217+ """ Return external command dependencies needed for this handler to work."""
218+ if cls .EXTRACTOR :
219+ return cls .EXTRACTOR .get_dependencies()
220+ return []
221+
222+ @abc.abstractmethod
223+ def calculate_multifile (self , file : Path) -> Optional[MultiFile]:
224+ """ Calculate the MultiFile in a directory, using a file matched by the pattern as a starting point."""
225+
226+ def extract (self , paths : List[Path], outdir : Path):
227+ if self .EXTRACTOR is None :
228+ logger.debug(" Skipping file: no extractor." , paths = paths)
229+ raise ExtractError
230+
231+ # We only extract every blob once, it's a mistake to extract the same blob again
232+ outdir.mkdir(parents = True , exist_ok = False )
233+
234+ self .EXTRACTOR .extract(paths, outdir)
235+ ```
236+
237+ - ` NAME ` : a unique name for this handler
238+ - ` PATTERN ` : A ` DirectoryPattern ` used to identify a starting/main file of the given format.
239+ - ` EXTRACTOR ` : a [ DirectoryExtractor] ( extractors.md ) .
240+ - ` get_dependencies() ` : returns the extractor dependencies. This helps unblob keep
241+ track of [ third party dependencies] ( extractors.md ) .
242+ - ` calculate_multifile() ` : this is the method that needs to be overridden in your
243+ handler. It receives a ` file ` Path object identified by the ` PATTERN ` in the directory.
244+ This is where you implement the logic to compute and return the ` MultiFile ` file set.
245+
246+ Any files that are being processed as part of a ` MultiFile ` set would be skipped from ` Chunk `
247+ detection.
248+
249+ Any file that is part of multiple ` MultiFile ` is a collision and results in a processing error.
250+
199251### Example Handler implementation
200252
201253Let's imagine that we have a custom file format that always starts with the
@@ -367,6 +419,44 @@ PATTERNS = [
367419]
368420```
369421
422+ ### DirectoryPatterns
423+
424+ The ` DirectoryHandler ` uses these patterns to identify the starting/main file of a given
425+ multi-file format. There are currently two main types: ` Glob ` and ` SingleFile `
426+
427+ #### Glob
428+
429+ The ` Glob ` object can use traditional globbing to detect files in a directory. This could be used when
430+ the file could have a varying part. There are cases where multiple multi-file set could be in a single
431+ directory. The job of the ` DirectoryPattern ` is to recognize the main file for each set.
432+
433+ Here is an example on ` Glob ` :
434+
435+ ``` python
436+ PATTERN = Glob(" *.7z.001" )
437+ ```
438+
439+ This example identify the first volume of a multi-volume sevenzip archive. Notice that this could pick
440+ up all first volumes in a given directory. (NB: Detecting the other volumes of a given set is the
441+ responsibility of the ` DirectoryHandler.calculate_multifile ` function. Do not write a ` Glob ` which picks
442+ up all the files of a multi-file set as that would result in errors.)
443+
444+
445+ #### SingleFile
446+
447+ The ` SingleFile ` object can be used to identify a single file with a known name. (Obviously only use this if the
448+ main file name is well-known and does not have a varying part. It also means that only a single multi-file set
449+ can be detected in a given directory.)
450+
451+ Here is an example on ` SingleFile ` :
452+
453+ ``` python
454+ PATTERN = SingleFile(" meta-data.json" )
455+ ```
456+
457+ This would pick up the file ` meta-data.json ` and pass it to the ` DirectoryHandler ` . The handler still has to
458+ verify the file and has to find the additional files.
459+
370460## Writing extractors
371461
372462!!! Recommendation
@@ -412,6 +502,32 @@ Two methods are exposed by this class:
412502- ` extract() ` : you must override this function. This is where you'll perform the
413503 extraction of ` inpath ` content into ` outdir ` extraction directory
414504
505+ ### DirectoryExtractor class
506+
507+ The ` DirectoryExtractor ` interface is defined in
508+ [ unblob/models.py] ( https://github.com/onekey-sec/unblob/blob/main/unblob/models.py ) :
509+
510+ ``` python
511+ class DirectoryExtractor (abc .ABC ):
512+ def get_dependencies (self ) -> List[str ]:
513+ """ Return the external command dependencies."""
514+ return []
515+
516+ @abc.abstractmethod
517+ def extract (self , paths : List[Path], outdir : Path):
518+ """ Extract from a multi file path list.
519+
520+ Raises ExtractError on failure.
521+ """
522+ ```
523+
524+ Two methods are exposed by this class:
525+
526+ - ` get_dependencies() ` : you should override it if your custom extractor relies on
527+ external dependencies such as command line tools
528+ - ` extract() ` : you must override this function. This is where you'll perform the
529+ extraction of ` paths ` files into ` outdir ` extraction directory
530+
415531### Example Extractor
416532
417533Extractors are quite complex beasts, so rather than trying to come up with a
@@ -451,3 +567,9 @@ Learn from us so you can avoid them in the future 🙂
451567 back.
452568- Watch out for [ negative seeking] ( https://github.com/onekey-sec/unblob/pull/280 )
453569- Make sure you get your types right! signedness can [ get in the way] ( https://github.com/onekey-sec/unblob/pull/130 ) .
570+ - Try to use as specific as possible patterns to identify data in Handlers to avoid false-positive matches
571+ and extra processing in the Handler.
572+ - Try to avoid using overlapping patterns, as patterns that match on the same data could easily collide. Hyperscan
573+ does not guarantee priority between patterns matching on the same data. (Hyperscan reports matches ordered by the
574+ pattern match end offset. In case multiple pattern match on the same end offset the matching order depends on the
575+ pattern registration order which is undefined in unblob.)
0 commit comments