Skip to content

Commit 51ec5af

Browse files
authored
Merge pull request #1 from andhus/dirhash_standard
Implementation based on the Dirhash Standard
2 parents c3362a7 + aa4cd7f commit 51ec5af

File tree

10 files changed

+1630
-865
lines changed

10 files changed

+1630
-865
lines changed

CHANGELOG.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Changelog
2+
3+
All notable changes to this project will be documented in this file.
4+
5+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7+
8+
## [Unreleased]
9+
NIL
10+
11+
## [0.2.0] - 2019-04-20
12+
Complies with [Dirhash Standard](https://github.com/andhus/dirhash) Version [0.1.0](https://github.com/andhus/dirhash/releases/v0.1.0)
13+
14+
### Added
15+
- A first implementation based on the formalized [Dirhash Standard](https://github.com/andhus/dirhash).
16+
- This changelog.
17+
- Results form a new benchmark run after changes. The `benchmark/run.py` now outputs results files which names include the `dirhash.__version__`.
18+
19+
### Changed
20+
- **Significant breaking changes** from version 0.1.1 - both regarding API and the
21+
underlying method/protocol for computing the hash. This means that **hashes
22+
computed with this version will differ from hashes computed with version < 0.2.0 for
23+
same directory**.
24+
- This dirhash python implementation has moved to here
25+
[github.com/andhus/dirhash-python](https://github.com/andhus/dirhash-python) from
26+
the previous repository
27+
[github.com/andhus/dirhash](https://github.com/andhus/dirhash)
28+
which now contains the formal description of the Dirhash Standard.
29+
30+
### Removed
31+
- All support for the `.dirhashignore` file. This seemed superfluous, please file an
32+
issue if you need this feature.

README.md

Lines changed: 17 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,23 @@
1-
[![Build Status](https://travis-ci.com/andhus/dirhash.svg?branch=master)](https://travis-ci.com/andhus/dirhash)
2-
[![codecov](https://codecov.io/gh/andhus/dirhash/branch/master/graph/badge.svg)](https://codecov.io/gh/andhus/dirhash)
1+
[![Build Status](https://travis-ci.com/andhus/dirhash-python.svg?branch=master)](https://travis-ci.com/andhus/dirhash-python)
2+
[![codecov](https://codecov.io/gh/andhus/dirhash-python/branch/master/graph/badge.svg)](https://codecov.io/gh/andhus/dirhash-python)
33

44
# dirhash
5-
A lightweight python module and tool for computing the hash of any
5+
A lightweight python module and CLI for computing the hash of any
66
directory based on its files' structure and content.
7-
- Supports any hashing algorithm of Python's built-in `hashlib` module
8-
- `.gitignore` style "wildmatch" patterns for expressive filtering of files to
9-
include/exclude.
7+
- Supports all hashing algorithms of Python's built-in `hashlib` module.
8+
- Glob/wildcard (".gitignore style") path matching for expressive filtering of files to include/exclude.
109
- Multiprocessing for up to [6x speed-up](#performance)
1110

11+
The hash is computed according to the [Dirhash Standard](https://github.com/andhus/dirhash), which is designed to allow for consistent and collision resistant generation/verification of directory hashes across implementations.
12+
1213
## Installation
1314
From PyPI:
1415
```commandline
1516
pip install dirhash
1617
```
1718
Or directly from source:
1819
```commandline
19-
git clone [email protected]:andhus/dirhash.git
20+
git clone [email protected]:andhus/dirhash-python.git
2021
pip install dirhash/
2122
```
2223

@@ -25,16 +26,16 @@ Python module:
2526
```python
2627
from dirhash import dirhash
2728

28-
dirpath = 'path/to/directory'
29-
dir_md5 = dirhash(dirpath, 'md5')
30-
filtered_sha1 = dirhash(dirpath, 'sha1', ignore=['.*', '.*/', '*.pyc'])
31-
pyfiles_sha3_512 = dirhash(dirpath, 'sha3_512', match=['*.py'])
29+
dirpath = "path/to/directory"
30+
dir_md5 = dirhash(dirpath, "md5")
31+
pyfiles_md5 = dirhash(dirpath, "md5", match=["*.py"])
32+
no_hidden_sha1 = dirhash(dirpath, "sha1", ignore=[".*", ".*/"])
3233
```
3334
CLI:
3435
```commandline
3536
dirhash path/to/directory -a md5
36-
dirhash path/to/directory -a sha1 -i ".* .*/ *.pyc"
37-
dirhash path/to/directory -a sha3_512 -m "*.py"
37+
dirhash path/to/directory -a md5 --match "*.py"
38+
dirhash path/to/directory -a sha1 --ignore ".*" ".*/"
3839
```
3940

4041
## Why?
@@ -66,7 +67,7 @@ and executing `hashlib` code.
6667
The main effort to boost performance is support for multiprocessing, where the
6768
reading and hashing is parallelized over individual files.
6869

69-
As a reference, let's compare the performance of the `dirhash` [CLI](https://github.com/andhus/dirhash/blob/master/dirhash/cli.py)
70+
As a reference, let's compare the performance of the `dirhash` [CLI](https://github.com/andhus/dirhash-python/cli.py)
7071
with the shell command:
7172

7273
`find path/to/folder -type f -print0 | sort -z | xargs -0 md5 | md5`
@@ -87,7 +88,7 @@ shell reference | nested_32k_32kB | 6.82 | -> 1.0
8788
`dirhash` | nested_32k_32kB | 3.43 | 2.00
8889
`dirhash`(8 workers)| nested_32k_32kB | 1.14 | **6.00**
8990

90-
The benchmark was run a MacBook Pro (2018), further details and source code [here](https://github.com/andhus/dirhash/tree/master/benchmark).
91+
The benchmark was run a MacBook Pro (2018), further details and source code [here](https://github.com/andhus/dirhash-python/benchmark).
9192

9293
## Documentation
93-
Please refer to `dirhash -h` and the python [source code](https://github.com/andhus/dirhash/blob/master/dirhash/__init__.py).
94+
Please refer to `dirhash -h`, the python [source code](https://github.com/andhus/dirhash/dirhash-python/__init__.py) and the [Dirhash Standard](https://github.com/andhus/dirhash).

benchmark/results_v0.2.0.csv

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
,test_case,implementation,algorithm,workers,t_best,t_median,speed-up (median)
2+
0,flat_8_128MB,shell reference,md5,1,2.079,2.083,1.0
3+
1,flat_8_128MB,dirhash_impl,md5,1,1.734,1.945,1.0709511568123393
4+
2,flat_8_128MB,dirhash_impl,md5,2,0.999,1.183,1.760777683854607
5+
3,flat_8_128MB,dirhash_impl,md5,4,0.711,0.728,2.8612637362637368
6+
4,flat_8_128MB,dirhash_impl,md5,8,0.504,0.518,4.021235521235521
7+
5,flat_1k_1MB,shell reference,md5,1,3.383,3.679,1.0
8+
6,flat_1k_1MB,dirhash_impl,md5,1,1.846,1.921,1.9151483602290473
9+
7,flat_1k_1MB,dirhash_impl,md5,2,1.137,1.158,3.1770293609671847
10+
8,flat_1k_1MB,dirhash_impl,md5,4,0.74,0.749,4.911882510013351
11+
9,flat_1k_1MB,dirhash_impl,md5,8,0.53,0.534,6.889513108614231
12+
10,flat_32k_32kB,shell reference,md5,1,13.827,18.213,1.0
13+
11,flat_32k_32kB,dirhash_impl,md5,1,13.655,13.808,1.3190179606025494
14+
12,flat_32k_32kB,dirhash_impl,md5,2,3.276,3.33,5.469369369369369
15+
13,flat_32k_32kB,dirhash_impl,md5,4,2.409,2.421,7.522924411400249
16+
14,flat_32k_32kB,dirhash_impl,md5,8,2.045,2.086,8.731064237775648
17+
15,nested_1k_1MB,shell reference,md5,1,3.284,3.332,1.0
18+
16,nested_1k_1MB,dirhash_impl,md5,1,1.717,1.725,1.9315942028985504
19+
17,nested_1k_1MB,dirhash_impl,md5,2,1.026,1.034,3.222437137330754
20+
18,nested_1k_1MB,dirhash_impl,md5,4,0.622,0.633,5.263823064770932
21+
19,nested_1k_1MB,dirhash_impl,md5,8,0.522,0.529,6.29867674858223
22+
20,nested_32k_32kB,shell reference,md5,1,11.898,12.125,1.0
23+
21,nested_32k_32kB,dirhash_impl,md5,1,13.858,14.146,0.8571327583769263
24+
22,nested_32k_32kB,dirhash_impl,md5,2,2.781,2.987,4.059256779377302
25+
23,nested_32k_32kB,dirhash_impl,md5,4,1.894,1.92,6.315104166666667
26+
24,nested_32k_32kB,dirhash_impl,md5,8,1.55,1.568,7.732780612244897
27+
25,flat_8_128MB,shell reference,sha1,1,2.042,2.05,1.0
28+
26,flat_8_128MB,dirhash_impl,sha1,1,1.338,1.354,1.5140324963072376
29+
27,flat_8_128MB,dirhash_impl,sha1,2,0.79,0.794,2.5818639798488663
30+
28,flat_8_128MB,dirhash_impl,sha1,4,0.583,0.593,3.456998313659359
31+
29,flat_8_128MB,dirhash_impl,sha1,8,0.483,0.487,4.209445585215605
32+
30,flat_1k_1MB,shell reference,sha1,1,2.118,2.129,1.0
33+
31,flat_1k_1MB,dirhash_impl,sha1,1,1.39,1.531,1.3905943827563685
34+
32,flat_1k_1MB,dirhash_impl,sha1,2,0.925,0.932,2.2843347639484977
35+
33,flat_1k_1MB,dirhash_impl,sha1,4,0.614,0.629,3.384737678855326
36+
34,flat_1k_1MB,dirhash_impl,sha1,8,0.511,0.52,4.094230769230769
37+
35,flat_32k_32kB,shell reference,sha1,1,10.551,10.97,1.0
38+
36,flat_32k_32kB,dirhash_impl,sha1,1,4.663,4.76,2.304621848739496
39+
37,flat_32k_32kB,dirhash_impl,sha1,2,3.108,3.235,3.3910355486862445
40+
38,flat_32k_32kB,dirhash_impl,sha1,4,2.342,2.361,4.6463362981787375
41+
39,flat_32k_32kB,dirhash_impl,sha1,8,2.071,2.094,5.2387774594078325
42+
40,nested_1k_1MB,shell reference,sha1,1,2.11,2.159,1.0
43+
41,nested_1k_1MB,dirhash_impl,sha1,1,1.436,1.47,1.4687074829931972
44+
42,nested_1k_1MB,dirhash_impl,sha1,2,0.925,0.937,2.3041622198505864
45+
43,nested_1k_1MB,dirhash_impl,sha1,4,0.627,0.643,3.357698289269051
46+
44,nested_1k_1MB,dirhash_impl,sha1,8,0.516,0.527,4.096774193548386
47+
45,nested_32k_32kB,shell reference,sha1,1,3.982,7.147,1.0
48+
46,nested_32k_32kB,dirhash_impl,sha1,1,4.114,4.156,1.7196823869104911
49+
47,nested_32k_32kB,dirhash_impl,sha1,2,2.598,2.616,2.7320336391437308
50+
48,nested_32k_32kB,dirhash_impl,sha1,4,1.809,1.831,3.9033315128345167
51+
49,nested_32k_32kB,dirhash_impl,sha1,8,1.552,1.58,4.523417721518987

0 commit comments

Comments
 (0)