|
| 1 | +### Title |
| 2 | +Improve FITS compressed image performance and Dask integration in `io.fits`. |
| 3 | + |
| 4 | +### Project Team |
| 5 | +Stuart Mumford |
| 6 | +Thomas Robitaille |
| 7 | +Drew Leonard |
| 8 | + |
| 9 | +### Project Description |
| 10 | + |
| 11 | +This project aims to increase performance of `io.fits` especially around large |
| 12 | +compressed image data. |
| 13 | + |
| 14 | +#### Compressed Image Performance |
| 15 | + |
| 16 | +Currently when using tiled FITS compression the Astropy implementation loads |
| 17 | +all the tiles into RAM and decompresses them all. |
| 18 | +This is highly CPU and memory inefficient, and is one of the reasons Astropy is |
| 19 | +significantly slower at loading these types of files than the `cfitsio` package |
| 20 | +(which uses the same C library as Astropy for this currently). |
| 21 | + |
| 22 | +For example we can compare the loading times of `io.fits` vs [`fitsio`](https://github.com/esheldon/fitsio): |
| 23 | + |
| 24 | +<details> |
| 25 | +<summary>Setup Code</summary> |
| 26 | + |
| 27 | +```python |
| 28 | +import os |
| 29 | +from pathlib import Path |
| 30 | + |
| 31 | +import fitsio |
| 32 | +from astropy.io import fits |
| 33 | +import astropy.units as u |
| 34 | + |
| 35 | +filename = "VISP_2022_06_17T19_17_52_516_00630205_U_BLQRA_L1.fits" |
| 36 | +path = Path("~/dkist_data/BLQRA").expanduser() / filename |
| 37 | + |
| 38 | +fio = fitsio.FITS(path) |
| 39 | +``` |
| 40 | + |
| 41 | +</details> |
| 42 | + |
| 43 | +```python |
| 44 | +# Filesize |
| 45 | +print((os.stat(path).st_size * u.byte).to(u.Mibyte)) |
| 46 | +``` |
| 47 | +``` |
| 48 | +2.04071044921875 Mibyte |
| 49 | +``` |
| 50 | + |
| 51 | +```python |
| 52 | +raw_hdul = fits.open(path, disable_image_compression=True) |
| 53 | +header = raw_hdul[1].header |
| 54 | + |
| 55 | +print(f"{header['ZNAXIS1']=}") |
| 56 | +print(f"{header['ZNAXIS2']=}") |
| 57 | +print(f"{header['ZNAXIS3']=}") |
| 58 | +print(f"{header['ZTILE1']=}") |
| 59 | +print(f"{header['ZTILE2']=}") |
| 60 | +print(f"{header['ZTILE3']=}") |
| 61 | +``` |
| 62 | +``` |
| 63 | +header['ZNAXIS1']=2560 |
| 64 | +header['ZNAXIS2']=1000 |
| 65 | +header['ZNAXIS3']=1 |
| 66 | +header['ZTILE1']=256 |
| 67 | +header['ZTILE2']=256 |
| 68 | +header['ZTILE3']=1 |
| 69 | +``` |
| 70 | + |
| 71 | +``` |
| 72 | +# Time loading all the array with astropy |
| 73 | +%timeit fits.getdata(path, hdu=1) |
| 74 | +75.4 ms ± 597 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) |
| 75 | +
|
| 76 | +# Time loading all the array with fitsio |
| 77 | +%timeit fio[1][:,:,:] |
| 78 | +25.6 ms ± 311 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) |
| 79 | +
|
| 80 | +# Time loading a single chunk |
| 81 | +%timeit fio[1][0,0,0] |
| 82 | +24.6 µs ± 757 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) |
| 83 | +``` |
| 84 | + |
| 85 | +note that opening the file with astropy currently loads the whole array |
| 86 | +so takes the same amount of time as the first `getdata` call. |
| 87 | + |
| 88 | +Issue [#3895](https://github.com/astropy/astropy/issues/3895) has been open |
| 89 | +since 2015 and describes a set of deep improvements for the compressed image |
| 90 | +support in `io.fits`. This includes implementing the compression natively in |
| 91 | +Python rather than using the `cfitsio` C library to do it. This will have the |
| 92 | +side effect of significantly reducing the compile time complexity of Astropy, as |
| 93 | +the bundled cfitsio library could then be removed from the core package (as it |
| 94 | +was only used for the compression). We expect that we would address all of the |
| 95 | +points in this issue during development of a tile [de]compression package |
| 96 | +independent of cfitsio. |
| 97 | + |
| 98 | +Some relevant issues: |
| 99 | + |
| 100 | +* Compressed HDU enhancements [#3895](https://github.com/astropy/astropy/issues/3895) |
| 101 | +* Writing compressed FITS files is slow when multi-threaded [#11633](https://github.com/astropy/astropy/issues/11633) |
| 102 | +* Refactor FITS compressed image headers (CompImageHeader) [#9238](https://github.com/astropy/astropy/issues/9238) |
| 103 | + |
| 104 | + |
| 105 | +#### Dask Integration with reading FITS files |
| 106 | + |
| 107 | +While handling compression without loading the whole array is one of the largest |
| 108 | +performance issues in `io.fits`, this project is also proposing to increase the |
| 109 | +integration of Dask with `io.fits` which will enable significant performance |
| 110 | +improvements when using FITS files with dask. |
| 111 | + |
| 112 | +This proposal is focusing on image HDUs in FITS files, so both uncompressed |
| 113 | +images and compressed images (which are stored in binary tables underneath), and |
| 114 | +proposes to add an option to `io.fits` to read both these types of FITS arrays |
| 115 | +directly into Dask arrays. |
| 116 | + |
| 117 | +The proposal team has significant experience with reading various data |
| 118 | +formats into Dask arrays, for example FITS images and CASA images and tables. |
| 119 | +A proportion of the development time for this section of the proposal will be |
| 120 | +devoted to researching the most effective method of loading FITS files into Dask |
| 121 | +arrays. |
| 122 | + |
| 123 | +Currently it is |
| 124 | +[possible](https://github.com/sunpy/sunpy/issues/2715#issuecomment-413286821) |
| 125 | +but not trivial to load an image from a FITS file into a Dask array, via the |
| 126 | +"delayed" |
| 127 | +functionality in Dask. In this case, the file is opened when reading a chunk of |
| 128 | +data from the array, and then closed again afterwards. |
| 129 | +This approach works well for a lot of use cases, but is complex, it would be a |
| 130 | +lot better if this were integrated into `io.fits` directly. For example, something akin to: |
| 131 | + |
| 132 | +```python |
| 133 | +hdulist = fits.open(filepath, use_dask=True) |
| 134 | +hdulist |
| 135 | +``` |
| 136 | +``` |
| 137 | +[<astropy.io.fits.hdu.image.DaskPrimaryHDU object at 0x7f1824914c10>, <astropy.io.fits.hdu.table.DaskCompImageHDU object at 0x7f1824914ee0>] |
| 138 | +``` |
| 139 | +```python |
| 140 | +hdulist[1].data |
| 141 | +``` |
| 142 | +``` |
| 143 | +dask.array<reshape, shape=(4, 490, 1000, 2560), dtype=float64, chunksize=(1, 1, 1000, 2560), chunktype=numpy.ndarray> |
| 144 | +``` |
| 145 | + |
| 146 | + |
| 147 | +For compressed images, Dask integration would allow users to process the |
| 148 | +compressed chunks of the image in parallel (either on a single machine or |
| 149 | +distributed), as if each compressed tile of the image was a dask chunk then it |
| 150 | +can be parallelised over using the various dask schedulers. |
| 151 | + |
| 152 | +This proposal is to get an initial implementation of both of these read use |
| 153 | +cases integrated into `io.fits`, and in the process to document any future |
| 154 | +improvements that could be made to enhance performance. |
| 155 | + |
| 156 | +### Approximate Budget |
| 157 | + |
| 158 | +For part one, we would request a total of 150h @ $150/hour broken up into: |
| 159 | + |
| 160 | +* 100h for removing the cfitsio dependency on `CompImageHDU`. This includes |
| 161 | + reimplementing or adapting the C implementation of the tile (de)compression |
| 162 | + algorithms, and modifying `CompImageHDU` to use the new astropy versions of |
| 163 | + these algorithms rather than parsing all the data to `cfitsio` for loading. |
| 164 | +* 50h for implementing a new lazy-loading property on `CompImageHDU` which |
| 165 | + allows users to load parts of the whole compressed image array without having |
| 166 | + to load and decompress the whole array. |
| 167 | + |
| 168 | +For part two, we request another 100h @ $150/hour to implement prototype loaders |
| 169 | +for FITS Image HDUs making use of Dask. At the end of part two we expect to have |
| 170 | +at least one approach to efficiently loading FITS data into Dask, suitable for |
| 171 | +large scale parallelisation, and a plan of how these prototypes could be |
| 172 | +improved upon and merged into astropy core. |
| 173 | + |
| 174 | +Our minimal budget is the 150h requested for part one, which would mean leaving |
| 175 | +all the Dask work to another proposal. |
| 176 | + |
| 177 | +### Approved Budget |
| 178 | +$22,500.00 |
0 commit comments