Skip to content

Memory Bug #71

@raevillena

Description

@raevillena

I am having memory issue with the running things. Everything works except that training bigger data crashes the kernel of jupyter notebook.

System Desktop

Ubuntu 22.04 on WSL2
Host: Windows 11
32Gb ram
AMD 5700x CPU
Intel Arc A750 8GB

Setup:
miniconda3 on itex environment

# pip list |grep tensorflow
intel_extension_for_tensorflow     2.15.0.0
intel_extension_for_tensorflow_lib 2.15.0.0.2
tensorflow                         2.15.0
tensorflow-datasets                4.9.3
tensorflow-estimator               2.15.0
tensorflow-io-gcs-filesystem       0.37.0
tensorflow-metadata                1.15.0

running model fit with train data results to (especially with vgg, resnet works fine):

2024-06-27 16:10:04.010287: I external/tsl/tsl/framework/bfc_allocator.cc:1122] Sum Total of in-use chunks: 513.70MiB
2024-06-27 16:10:04.010290: I external/tsl/tsl/framework/bfc_allocator.cc:1124] Total bytes in pool: 982550528 memory_limit_: 7487535513 available bytes: 6504984985 curr_region_allocation_bytes_: 14975071232
2024-06-27 16:10:04.010295: I external/tsl/tsl/framework/bfc_allocator.cc:1129] Stats:
Limit:                      7487535513
InUse:                       538648576
MaxInUse:                    956967680
NumAllocs:                         297
MaxAllocSize:                485714176
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

it crashes no matter what I do when it tries to allocated that 14gb in the curr_region_allocation

Global mem shows:

#clinfo | grep "Global memory size"
Global memory size                              16723046400 (15.57GiB)
Global memory size                              8319483904 (7.748GiB)

btw my version of itex didnt came with check_env.sh so I cant run that, I just know it works cause it does and it doesnt.

In jupyter the device is recognized as this

1 Physical GPUs, [LogicalDevice(name='/device:XPU:0', device_type='XPU')]

Also the other setups I can read about issues of bfc allocator uses the one that came along with the tensorflow while mine was coming from itex build files.

I could see that the repo is available for rebuilding and there might be chance to find what is happening there but I dont have the time and ability to do so.

I just wanna know if there what am I missing here since it was able allocate almost 8gb memory but unable to expand it.

I also tried exporting this to the conda environment with no effect
export ITEX_LIMIT_MEMORY_SIZE_IN_MB=4096

I said earlier that it works, yes I can train a resnet model blazingly fast compared to tesla t4 in colab but running it twice give the memory error.

what is consistent is that it tries to allocate that curr region allocation bytes: 14975071232
that value was very consistent. which I dont know why. It makes sense the the oom happens with that but why allocate 14gb when tf doesnt even need that much for the current workload.

Metadata

Metadata

Labels

aitcequestionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions