-
Notifications
You must be signed in to change notification settings - Fork 29
Description
This is a CPU bug - in gpumode 0 (Factored CPU version) in galaxy.py.
We would get a crash due to an excessive amount of memory requested, e.g.
File "/global/cfs/cdirs/desi/users/cdwarner/code/Tractor/legacypipe/py/legacypipe/image.py", line 2299, in getFourierTransform
fft, (cx,cy), shape, (v,w) = super().getFourierTransform(px, py, radius)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "tractor/psf.py", line 333, in tractor.psf.PixelizedPSF.getFourierTransform
File "tractor/psf.py", line 308, in tractor.psf.PixelizedPSF._padInImage
numpy._core._exceptions._ArrayMemoryError: Unable to allocate 4.00 EiB for an array with shape (1073741824, 1073741824) and data type float32
I traced this down to bad values of px, py, and halfsize in galaxy.py:
px0=508422395.67614156 py0=145391495.9114129
px=508422395.67614156 py=145391495.9114129 halfsize=np.float64(508422204.67614156)
H=1073741824 W=1073741824
It seems to originate in radec2pixelxy from within class ConstantFitsWcs(ParamList, ducks.WCS)::
def positionToPixel(self, pos, src=None):
'''
Converts an :class:`tractor.RaDecPos` to a pixel position.
Returns: tuple of floats ``(x, y)``
'''
X = self.wcs.radec2pixelxy(pos.ra, pos.dec)
# handle X = (ok,x,y) and X = (x,y) return values
if len(X) == 3:
ok, x, y = X
else:
assert(len(X) == 2)
x, y = X
print (f'PTP3 {x=} {y=} {pos.ra=} {pos.dec=}')
# MAGIC: subtract 1 to convert from FITS to zero-indexed pixels.
return x - 1 - self.x0, y - 1 - self.y0
At the moment I have this fix in galaxy.py:
if halfsize > 32768:
print (f"WARNING: Bad positionToPixel results {px=} {py=} {halfsize=}")
return None
However reproducing this error has proved troublesome.
[container] cdwarner@nid001112:/global/cfs/cdirs/desi/users/cdwarner/code/Tractor/legacypipe/bin$ grep positionToPixel $SCRATCH/dr11-gpu-*/logs/*/*.log
/pscratch/sd/c/cdwarner/dr11-gpu-test-cw-gpu0only-redux/logs/014/0145p340.log:WARNING: Bad positionToPixel results px=-29740530052122.24 py=-1271579979548.0696 halfsize=29740530052261.24
/pscratch/sd/c/cdwarner/dr11-gpu-test-cw-gpu0only/logs/014/0145p340.log:WARNING: Bad positionToPixel results px=-29740530052122.24 py=-1271579979548.0696 halfsize=29740530052261.24
I have been able to get it consistently to occur in 0145p340 only when running --gpumode 0 and when running in the current versions of tractor and legacypipe in my environment. It does not occur when running in the image legacypipe:gpu-1.4.3 or legacypipe-gpu:1.4.4 despite both being up to date in legacypipe branch gpu-powered and tractor branch craig_factored_merge. And even when I tell it to use my versions of legacypipe and tractor in its PYTHONPATH I still can't reproduce this error inside the container.