Skip to content

Commit 5e14400

Browse files
authored
Merge pull request #275 from Cadair/aperio_fits
Phase 3 Funding: Improving Compressed Image performance and Dask integration in `io.fits`.
2 parents fac1575 + abb7ad2 commit 5e14400

File tree

1 file changed

+178
-0
lines changed

1 file changed

+178
-0
lines changed
Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
### Title
2+
Improve FITS compressed image performance and Dask integration in `io.fits`.
3+
4+
### Project Team
5+
Stuart Mumford
6+
Thomas Robitaille
7+
Drew Leonard
8+
9+
### Project Description
10+
11+
This project aims to increase performance of `io.fits` especially around large
12+
compressed image data.
13+
14+
#### Compressed Image Performance
15+
16+
Currently when using tiled FITS compression the Astropy implementation loads
17+
all the tiles into RAM and decompresses them all.
18+
This is highly CPU and memory inefficient, and is one of the reasons Astropy is
19+
significantly slower at loading these types of files than the `cfitsio` package
20+
(which uses the same C library as Astropy for this currently).
21+
22+
For example we can compare the loading times of `io.fits` vs [`fitsio`](https://github.com/esheldon/fitsio):
23+
24+
<details>
25+
<summary>Setup Code</summary>
26+
27+
```python
28+
import os
29+
from pathlib import Path
30+
31+
import fitsio
32+
from astropy.io import fits
33+
import astropy.units as u
34+
35+
filename = "VISP_2022_06_17T19_17_52_516_00630205_U_BLQRA_L1.fits"
36+
path = Path("~/dkist_data/BLQRA").expanduser() / filename
37+
38+
fio = fitsio.FITS(path)
39+
```
40+
41+
</details>
42+
43+
```python
44+
# Filesize
45+
print((os.stat(path).st_size * u.byte).to(u.Mibyte))
46+
```
47+
```
48+
2.04071044921875 Mibyte
49+
```
50+
51+
```python
52+
raw_hdul = fits.open(path, disable_image_compression=True)
53+
header = raw_hdul[1].header
54+
55+
print(f"{header['ZNAXIS1']=}")
56+
print(f"{header['ZNAXIS2']=}")
57+
print(f"{header['ZNAXIS3']=}")
58+
print(f"{header['ZTILE1']=}")
59+
print(f"{header['ZTILE2']=}")
60+
print(f"{header['ZTILE3']=}")
61+
```
62+
```
63+
header['ZNAXIS1']=2560
64+
header['ZNAXIS2']=1000
65+
header['ZNAXIS3']=1
66+
header['ZTILE1']=256
67+
header['ZTILE2']=256
68+
header['ZTILE3']=1
69+
```
70+
71+
```
72+
# Time loading all the array with astropy
73+
%timeit fits.getdata(path, hdu=1)
74+
75.4 ms ± 597 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
75+
76+
# Time loading all the array with fitsio
77+
%timeit fio[1][:,:,:]
78+
25.6 ms ± 311 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
79+
80+
# Time loading a single chunk
81+
%timeit fio[1][0,0,0]
82+
24.6 µs ± 757 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
83+
```
84+
85+
note that opening the file with astropy currently loads the whole array
86+
so takes the same amount of time as the first `getdata` call.
87+
88+
Issue [#3895](https://github.com/astropy/astropy/issues/3895) has been open
89+
since 2015 and describes a set of deep improvements for the compressed image
90+
support in `io.fits`. This includes implementing the compression natively in
91+
Python rather than using the `cfitsio` C library to do it. This will have the
92+
side effect of significantly reducing the compile time complexity of Astropy, as
93+
the bundled cfitsio library could then be removed from the core package (as it
94+
was only used for the compression). We expect that we would address all of the
95+
points in this issue during development of a tile [de]compression package
96+
independent of cfitsio.
97+
98+
Some relevant issues:
99+
100+
* Compressed HDU enhancements [#3895](https://github.com/astropy/astropy/issues/3895)
101+
* Writing compressed FITS files is slow when multi-threaded [#11633](https://github.com/astropy/astropy/issues/11633)
102+
* Refactor FITS compressed image headers (CompImageHeader) [#9238](https://github.com/astropy/astropy/issues/9238)
103+
104+
105+
#### Dask Integration with reading FITS files
106+
107+
While handling compression without loading the whole array is one of the largest
108+
performance issues in `io.fits`, this project is also proposing to increase the
109+
integration of Dask with `io.fits` which will enable significant performance
110+
improvements when using FITS files with dask.
111+
112+
This proposal is focusing on image HDUs in FITS files, so both uncompressed
113+
images and compressed images (which are stored in binary tables underneath), and
114+
proposes to add an option to `io.fits` to read both these types of FITS arrays
115+
directly into Dask arrays.
116+
117+
The proposal team has significant experience with reading various data
118+
formats into Dask arrays, for example FITS images and CASA images and tables.
119+
A proportion of the development time for this section of the proposal will be
120+
devoted to researching the most effective method of loading FITS files into Dask
121+
arrays.
122+
123+
Currently it is
124+
[possible](https://github.com/sunpy/sunpy/issues/2715#issuecomment-413286821)
125+
but not trivial to load an image from a FITS file into a Dask array, via the
126+
"delayed"
127+
functionality in Dask. In this case, the file is opened when reading a chunk of
128+
data from the array, and then closed again afterwards.
129+
This approach works well for a lot of use cases, but is complex, it would be a
130+
lot better if this were integrated into `io.fits` directly. For example, something akin to:
131+
132+
```python
133+
hdulist = fits.open(filepath, use_dask=True)
134+
hdulist
135+
```
136+
```
137+
[<astropy.io.fits.hdu.image.DaskPrimaryHDU object at 0x7f1824914c10>, <astropy.io.fits.hdu.table.DaskCompImageHDU object at 0x7f1824914ee0>]
138+
```
139+
```python
140+
hdulist[1].data
141+
```
142+
```
143+
dask.array<reshape, shape=(4, 490, 1000, 2560), dtype=float64, chunksize=(1, 1, 1000, 2560), chunktype=numpy.ndarray>
144+
```
145+
146+
147+
For compressed images, Dask integration would allow users to process the
148+
compressed chunks of the image in parallel (either on a single machine or
149+
distributed), as if each compressed tile of the image was a dask chunk then it
150+
can be parallelised over using the various dask schedulers.
151+
152+
This proposal is to get an initial implementation of both of these read use
153+
cases integrated into `io.fits`, and in the process to document any future
154+
improvements that could be made to enhance performance.
155+
156+
### Approximate Budget
157+
158+
For part one, we would request a total of 150h @ $150/hour broken up into:
159+
160+
* 100h for removing the cfitsio dependency on `CompImageHDU`. This includes
161+
reimplementing or adapting the C implementation of the tile (de)compression
162+
algorithms, and modifying `CompImageHDU` to use the new astropy versions of
163+
these algorithms rather than parsing all the data to `cfitsio` for loading.
164+
* 50h for implementing a new lazy-loading property on `CompImageHDU` which
165+
allows users to load parts of the whole compressed image array without having
166+
to load and decompress the whole array.
167+
168+
For part two, we request another 100h @ $150/hour to implement prototype loaders
169+
for FITS Image HDUs making use of Dask. At the end of part two we expect to have
170+
at least one approach to efficiently loading FITS data into Dask, suitable for
171+
large scale parallelisation, and a plan of how these prototypes could be
172+
improved upon and merged into astropy core.
173+
174+
Our minimal budget is the 150h requested for part one, which would mean leaving
175+
all the Dask work to another proposal.
176+
177+
### Approved Budget
178+
$22,500.00

0 commit comments

Comments
 (0)