Skip to content

Conversation

wiredfool
Copy link
Member

@wiredfool wiredfool commented Jul 19, 2025

Addresses #8329 (comment) . see also #8329 (comment)
Changes proposed in this pull request:

  • Add the mode name as the schema name instead of "pixel"
  • Add image band metadata in the arrow schema
  • Tests for metadata interop
  • Client testing with arro3 and nanoarrow similar to the pyarrow tests

For multiband images, we issue:

ArrowSchema:
  Format: FixedLengthArray[4]
  Children: 
    - ArrowSchema:
       Format: uint8
       name: RGB
       metadata: {image: {bands: [R, G, B, X]}}

This metadata is accessible using pa.array(img).type.field(0).metadata

For single band images we emit:

ArrowSchema:
  Format: ...
  name: F
  metadata: {image: {bands: [F]}}

This metadata is not apparently accessible via pyarrow or arro3, but is accessible via nanoarrow.

The ultimate goal is to be able to have:

  • numpy/pandas/arrow users be able to do np.array(img) and be able to identify RGB channel names
  • height/width would be ideal

@rok

@wiredfool wiredfool marked this pull request as draft July 19, 2025 15:38
@rok
Copy link

rok commented Jul 20, 2025

@paleolimbot am I right to say array metadata was not really designed to pass application metadata? What are your thoughts on what's being done here?
@wiredfool has the issue here that he can't access metadata of children when in certain cases.
I'm suggesting usage of FixedShapeTensorArray, but I imagine working with arrays directly would simplify things here.

@wiredfool wiredfool force-pushed the pyarrow_band_names branch from e768bbf to 28c7645 Compare July 21, 2025 09:21
@wiredfool wiredfool force-pushed the pyarrow_band_names branch from 14ac76c to 7d2abbd Compare July 21, 2025 09:23
This metadata is available in nanoarrow, but not pyarrow or arro3
@paleolimbot
Copy link

Cool!

It's true that field metadata at anything other the top-level struct has a medium chance of getting propagated through various Arrow operations. If you used the two metadata fields ARROW:extension:name (== pillow.image) and ARROW:extension:metadata as the JSON you're proposing above at the Array level, I think you will have more success getting this to propagate correctly. For pyarrow you'd have to implement a minimal ExtensionType class that a (pyarrow) user would have to "register" for that type to show up nicely in pyarrow.

Feel free to ping on anything if you have a question!

@wiredfool wiredfool marked this pull request as ready for review October 10, 2025 14:40
Comment on lines +158 to +159
img2 = img.copy()
px = img2.load()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
img2 = img.copy()
px = img2.load()
px = img.load()

Or are you using copy() because you're testing that the data in img2 is still correct after img is no longer used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to ensure we haven't over freed the memory, as we're refcounting on the arrow usages.

Co-authored-by: Andrew Murray <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants