-
Notifications
You must be signed in to change notification settings - Fork 142
Description
Does anyone encounter the following error on "CUDA_ERROR_ILLEGAL_ADDRESS" ?
I have changed multiprocessing to single process, but the same problem happened.
My GPU is GeForce RTX 2080 8GB (driver: 440.33.01), and
tensorflow: 1.12.0
cuda: 9.0
cudnn: 7.5.0
Training command is like this way:
CUDA_VISIBLE_DEVICES=0 python main.py
--model_name=model_roerich
--batch_size=1
--phase=train
--image_size=768
--lr=0.0002
--dsr=0.8
--ptcd=./data/Places2/data_large
--ptad=./data/artist/nicholas-roerich
Finally, the error message are:
tensorflow::CurrentStackTrace()
stream_executor::cuda::CUDADriver::SynchronizeContext(stream_executor::cuda::CudaContext*)
stream_executor::StreamExecutor::SynchronizeAllActivity()
tensorflow::GPUUtil::SyncAll(tensorflow::Device*)
tensorflow::BaseGPUDevice::Sync()
Eigen::NonBlockingThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int)
std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&)
clone
*** End stack trace ***
2020-01-26 00:10:03.045956: E tensorflow/stream_executor/event.cc:34] error destroying CUDA event in context 0x5402c10: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
File "/home/username/.local/share/virtualenvs/adaptive-style-transfer-PbxNnQ9W/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/home/username/.local/share/virtualenvs/adaptive-style-transfer-PbxNnQ9W/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/username/.local/share/virtualenvs/adaptive-style-transfer-PbxNnQ9W/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: GPU sync failed