Skip to content

Commit 9254492

Browse files
WeichenXu123mengxr
authored andcommitted
[SPARK-22666][ML][SQL] Spark datasource for image format
## What changes were proposed in this pull request? Implement an image schema datasource. This image datasource support: - partition discovery (loading partitioned images) - dropImageFailures (the same behavior with `ImageSchema.readImage`) - path wildcard matching (the same behavior with `ImageSchema.readImage`) - loading recursively from directory (different from `ImageSchema.readImage`, but use such path: `/path/to/dir/**`) This datasource **NOT** support: - specify `numPartitions` (it will be determined by datasource automatically) - sampling (you can use `df.sample` later but the sampling operator won't be pushdown to datasource) ## How was this patch tested? Unit tests. ## Benchmark I benchmark and compare the cost time between old `ImageSchema.read` API and my image datasource. **cluster**: 4 nodes, each with 64GB memory, 8 cores CPU **test dataset**: Flickr8k_Dataset (about 8091 images) **time cost**: - My image datasource time (automatically generate 258 partitions): 38.04s - `ImageSchema.read` time (set 16 partitions): 68.4s - `ImageSchema.read` time (set 258 partitions): 90.6s **time cost when increase image number by double (clone Flickr8k_Dataset and loads double number images)**: - My image datasource time (automatically generate 515 partitions): 95.4s - `ImageSchema.read` (set 32 partitions): 109s - `ImageSchema.read` (set 515 partitions): 105s So we can see that my image datasource implementation (this PR) bring some performance improvement compared against old`ImageSchema.read` API. Closes apache#22328 from WeichenXu123/image_datasource. Authored-by: WeichenXu <[email protected]> Signed-off-by: Xiangrui Meng <[email protected]>
1 parent c66eef8 commit 9254492

File tree

27 files changed

+323
-4
lines changed

27 files changed

+323
-4
lines changed

data/mllib/images/origin/license.txt

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
The images in the folder "kittens" are under the creative commons CC0 license, or no rights reserved:
2+
https://creativecommons.org/share-your-work/public-domain/cc0/
3+
The images are taken from:
4+
https://ccsearch.creativecommons.org/image/detail/WZnbJSJ2-dzIDiuUUdto3Q==
5+
https://ccsearch.creativecommons.org/image/detail/_TlKu_rm_QrWlR0zthQTXA==
6+
https://ccsearch.creativecommons.org/image/detail/OPNnHJb6q37rSZ5o_L5JHQ==
7+
https://ccsearch.creativecommons.org/image/detail/B2CVP_j5KjwZm7UAVJ3Hvw==
8+
9+
The chr30.4.184.jpg and grayscale.jpg images are also under the CC0 license, taken from:
10+
https://ccsearch.creativecommons.org/image/detail/8eO_qqotBfEm2UYxirLntw==
11+
12+
The image under "multi-channel" directory is under the CC BY-SA 4.0 license cropped from:
13+
https://en.wikipedia.org/wiki/Alpha_compositing#/media/File:Hue_alpha_falloff.png

0 commit comments

Comments
 (0)