Skip to content

Data validation checker#214

Merged
dkazanc merged 13 commits intomainfrom
datavalidator
Jun 12, 2025
Merged

Data validation checker#214
dkazanc merged 13 commits intomainfrom
datavalidator

Conversation

@dkazanc
Copy link
Collaborator

@dkazanc dkazanc commented Jun 2, 2025

adding data validation modules, tests and correcting all functions in misc

@dkazanc dkazanc added the run-zenodo-tests Run Zenodo tests for each PR label Jun 4, 2025
@dkazanc dkazanc marked this pull request as ready for review June 4, 2025 14:51
@dkazanc dkazanc requested a review from yousefmoazzam June 4, 2025 15:23
Copy link
Contributor

@yousefmoazzam yousefmoazzam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me!

If it hasn't been done so already, I'd recommend running the memory hook tests in httomo-backends against these changes, to double-check that memory estimation of methods which use the new validation function hasn't been affected.

Another thing to mention is that httomolibgpu has a .pre-commit-config.yaml which has black formatting setup, so I'm assuming that httomolibgpu code should be formatted with black? If so, you may want to check if black has been applied to the changed files, I can see some things like missing newline characters and no spaces after commas that I think wouldn't be present if black were applied (though I could be wrong!).

@dkazanc
Copy link
Collaborator Author

dkazanc commented Jun 5, 2025

Very good point about httomo-backends tests, thanks. I see that it does affect them and some of the tests failing. I'll do some research, but it looks there are mainly two lines to blame?
xp.nan_to_num(data, copy=False, nan=0.0, posinf=0.0, neginf=0.0)
and
zero_elements_total = int(xp.count_nonzero(data == 0))

@dkazanc
Copy link
Collaborator Author

dkazanc commented Jun 5, 2025

This also allocates the memory:
xp.all(xp.isfinite(data))

@dkazanc
Copy link
Collaborator Author

dkazanc commented Jun 5, 2025

Yes, looks like both functions (inf's nan's estimation/removal) and zeros calculation create the copy of the data.

This actually opens up more general discussion, sorry for many words.

I can see two ways of dealing with this as modifying every memory estimator doesn't seem appealing at all.

  1. Write bespoke element-wise kernels for CuPy arrays to deal with the data without creating any copies of it. I reckon we will need to loop through arrays and fix OR not nan's/inf's. Same with zeros. It will be probably more computationally efficient as now.
  2. Have data validator as a separate method in the library with its own memory estimator and insert it using the framework between each method. Similar to what we do with data_reducer added after the loader by default. The data validator will be removed from the methods themselves in httomolibgpu library as it becomes a standalone method.

Both approaches have benefits actually:

  • no. 1 Seems like a nice algorithmic workaround and probably not very time consuming to do. Also I don't like that the fact that such a simple method creates copies of data, that is probably not very efficient?
  • no. 2 Has a benefit of having this method for any other backend, it will also take care of httomolib and tomopy methods, for instance. Notably, we do not apply data validator to data_reducer and save_to_images of httomolib. In addition to that, it makes simpler for us to log warnings with the framework coming from the method. We can modify it in a way so that it will also return some info about the presence of nans/infs/zeros.

To me it looks like we need both 1 and 2 :) But we can do this in stages, if no. 1 resolves the memory issue, we can merge it first and then we can do no. 2. Any thoughts?

@dkazanc
Copy link
Collaborator Author

dkazanc commented Jun 6, 2025

Latest update. I've implemented no.1 from above. All tests now pass in the branch and with httomo-backends.

@dkazanc dkazanc merged commit b582acc into main Jun 12, 2025
1 of 3 checks passed
@dkazanc dkazanc deleted the datavalidator branch June 12, 2025 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-zenodo-tests Run Zenodo tests for each PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants