Skip to content

Commit feab2fb

Browse files
committed
Improvements from @astrofrog
1 parent f42a01f commit feab2fb

File tree

1 file changed

+81
-33
lines changed

1 file changed

+81
-33
lines changed

finance/proposal-calls/cycle3/aperio_fits_dask.md

Lines changed: 81 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
### Title
2-
Improve FITS compressed image support and Dask integration in `io.fits`.
2+
Improve FITS compressed image performance and Dask integration in `io.fits`.
33

44
### Project Team
55
Stuart Mumford
@@ -19,44 +19,81 @@ This is highly CPU and memory inefficient, and is one of the reasons Astropy is
1919
significantly slower at loading these types of files than the `cfitsio` package
2020
(which uses the same C library as Astropy for this currently).
2121

22-
For example we can compare the loading times of `io.fits` vs `cfitsio`:
22+
For example we can compare the loading times of `io.fits` vs [`fitsio`](https://github.com/esheldon/fitsio):
2323

24-
Loading a whole array with astropy:
24+
<details>
25+
<summary>Setup Code</summary>
2526

27+
```python
28+
import os
29+
from pathlib import Path
30+
31+
import fitsio
32+
from astropy.io import fits
33+
import astropy.units as u
34+
35+
filename = "VISP_2022_06_17T19_17_52_516_00630205_U_BLQRA_L1.fits"
36+
path = Path("~/dkist_data/BLQRA").expanduser() / filename
37+
38+
fio = fitsio.FITS(path)
39+
```
40+
41+
</details>
42+
43+
```python
44+
# Filesize
45+
print((os.stat(path).st_size * u.byte).to(u.Mibyte))
46+
```
2647
```
27-
In [44]: %timeit fits.getdata("VBI_L1_00656282_2018_05_11T14_25_05_466665_I.fits", hdu=1)
28-
183 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
48+
2.04071044921875 Mibyte
2949
```
3050

31-
with cfitsio:
51+
```python
52+
raw_hdul = fits.open(path, disable_image_compression=True)
53+
header = raw_hdul[1].header
54+
55+
print(f"{header['ZNAXIS1']=}")
56+
print(f"{header['ZNAXIS2']=}")
57+
print(f"{header['ZNAXIS3']=}")
58+
print(f"{header['ZTILE1']=}")
59+
print(f"{header['ZTILE2']=}")
60+
print(f"{header['ZTILE3']=}")
61+
```
3262
```
33-
In [45]: %timeit hdu[1][:, :]
34-
131 ms ± 182 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
63+
header['ZNAXIS1']=2560
64+
header['ZNAXIS2']=1000
65+
header['ZNAXIS3']=1
66+
header['ZTILE1']=256
67+
header['ZTILE2']=256
68+
header['ZTILE3']=1
3569
```
3670

37-
loading one or more individual tiles with `cfitsio`:
3871
```
39-
In [48]: %timeit hdu[1][0, 0]
40-
21.6 µs ± 191 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
72+
# Time loading all the array with astropy
73+
%timeit fits.getdata(path, hdu=1)
74+
75.4 ms ± 597 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
4175
42-
In [49]: %timeit hdu[1][0, :]
43-
22.3 µs ± 78.2 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
76+
# Time loading all the array with fitsio
77+
%timeit fio[1][:,:,:]
78+
25.6 ms ± 311 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
4479
45-
In [50]: %timeit hdu[1][:10, :]
46-
235 µs ± 718 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
80+
# Time loading a single chunk
81+
%timeit fio[1][0,0,0]
82+
24.6 µs ± 757 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
4783
```
4884

49-
note that doing the same operations with astropy currently loads the whole file
50-
so takes the same amount of time as the first `getdata` call. This means that
51-
loading a single tile with astropy is approximately 8300x slower than with
52-
`cfitsio`!!
85+
note that opening the file with astropy currently loads the whole array
86+
so takes the same amount of time as the first `getdata` call.
5387

5488
Issue [#3895](https://github.com/astropy/astropy/issues/3895) has been open
5589
since 2015 and describes a set of deep improvements for the compressed image
5690
support in `io.fits`. This includes implementing the compression natively in
57-
Python rather than using the `fitsio` C library to do it. This will have the
91+
Python rather than using the `cfitsio` C library to do it. This will have the
5892
side effect of significantly reducing the compile time complexity of Astropy, as
59-
this is the only part of the code which uses `fitsio`.
93+
the bundled cfitsio library could then be removed from the core package (as it
94+
was only used for the compression). We expect that we would address all of the
95+
points in this issue during development of a native Python tile [de]compression
96+
package.
6097

6198
Some relevant issues:
6299

@@ -67,30 +104,41 @@ Some relevant issues:
67104

68105
#### Dask Integration with reading FITS files
69106

70-
While this is one of the largest performance issues in `io.fits`, this project
71-
is also proposing to increase the integration of Dask with `io.fits` which will
72-
enable significant performance improvements when using FITS files with dask.
107+
While handling compression without loading the whole array is one of the largest
108+
performance issues in `io.fits`, this project is also proposing to increase the
109+
integration of Dask with `io.fits` which will enable significant performance
110+
improvements when using FITS files with dask.
73111

74-
This proposal is focusing on images in FITS files, so both uncompressed images
75-
and compressed images (which are stored in binary tables underneath), and
112+
This proposal is focusing on image HDUs in FITS files, so both uncompressed
113+
images and compressed images (which are stored in binary tables underneath), and
76114
proposes to add an option to `io.fits` to read both these types of FITS arrays
77115
directly into Dask arrays.
78116

79-
While the proposal team has significant experience with reading various data
117+
The proposal team has significant experience with reading various data
80118
formats into Dask arrays, for example FITS images and CASA images and tables.
81119
A proportion of the development time for this section of the proposal will be
82120
devoted to researching the most effective method of loading FITS files into Dask
83121
arrays.
84122

85123
Currently it is
86-
[possible](https://github.com/sunpy/sunpy/issues/2715#issuecomment-413286821) to
87-
load an image into a Dask array, via the "delayed" functionality in Dask. In
88-
this case, the file is opened when reading a chunk of data from the array, and
89-
then closed again afterwards.
124+
[possible](https://github.com/sunpy/sunpy/issues/2715#issuecomment-413286821)
125+
but not trivial to load an image from a FITS file into a Dask array, via the
126+
"delayed"
127+
functionality in Dask. In this case, the file is opened when reading a chunk of
128+
data from the array, and then closed again afterwards.
90129
This approach works well for a lot of use cases, but is complex, it would be a
91-
lot better if this were integrated into `io.fits` directly.
130+
lot better if this were integrated into `io.fits` directly. For example, something akin to:
131+
132+
```python
133+
hdulist = fits.open(filepath, use_dask=True)
134+
hdulist[0].data
135+
```
136+
```
137+
dask.array<reshape, shape=(4, 490, 1000, 2560), dtype=float64, chunksize=(1, 1, 1000, 2560), chunktype=numpy.ndarray>
138+
```
139+
92140

93-
For compressed images, Dask integration would allow you to process the
141+
For compressed images, Dask integration would allow users to process the
94142
compressed chunks of the image in parallel (either on a single machine or
95143
distributed), as if each compressed tile of the image was a dask chunk then it
96144
can be parallelised over using the various dask schedulers.

0 commit comments

Comments
 (0)