Skip to content

Conversation

@getChan
Copy link
Contributor

@getChan getChan commented Oct 1, 2025

Which issue does this PR close?

todo list

  1. Review whether we can remove public APIs from our own implementations.
  2. Applying arrow-avro 56.0.0 release.
  3. more test

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added common Related to common crate datasource Changes to the datasource crate labels Oct 1, 2025
@alamb
Copy link
Contributor

alamb commented Oct 2, 2025

❤️ amazing! Thank you @getChan
FYI @jecsand838 and @nathaniel-d-ef

@alamb
Copy link
Contributor

alamb commented Oct 19, 2025

Hi @getChan -- I am preparing to make an arrow release -- have you hit any blockers while integrating the new arrow-avro crate into DataFusion?

@getChan
Copy link
Contributor Author

getChan commented Oct 19, 2025

Hi @getChan -- I am preparing to make an arrow release -- have you hit any blockers while integrating the new arrow-avro crate into DataFusion?

No, not yet. Thanks for release.

@nathaniel-d-ef
Copy link

Thanks for jumping on this @getChan; let me know if I can help!

@github-actions github-actions bot removed the common Related to common crate label Oct 27, 2025
@alamb
Copy link
Contributor

alamb commented Oct 29, 2025

FYI I merged the arrow 57 upgrade to DataFusion -- so if you rebase this PR against main you'll have access to the new arrow-avro crate

@github-actions github-actions bot added core Core DataFusion crate common Related to common crate proto Related to proto crate labels Oct 29, 2025
@nathaniel-d-ef
Copy link

nathaniel-d-ef commented Nov 18, 2025

@jecsand838

I had a chance to get back at this today to try and find a workaround. Unless I'm missing something, the writer schema from the ReaderBuilder inference (from the Avro file, with the correct top level name) isn't exposed in a way that we can use in DataFusion. I'm curious if you've had success @getChan?

I think this effort is blocked until we can make the arrow-avro modifications.

@getChan
Copy link
Contributor Author

getChan commented Nov 19, 2025

@nathaniel-d-ef
No not yet. I'm still looking for a solution.
Whether or not projection is applied, I couldn't retrieve the schema metadata when reading the file with ReaderBuilder.
I don't know the arrow-avro internal implementation very well, so I'm investigating. I'll share it once I find a solution.

let avro_reader = ReaderBuilder::new().build(BufReader::new(reader))?;
println!("AVRO READER METADATA : {:?}", avro_reader.schema().metadata); // {}

@jecsand838
Copy link

jecsand838 commented Nov 25, 2025

@nathaniel-d-ef No not yet. I'm still looking for a solution. Whether or not projection is applied, I couldn't retrieve the schema metadata when reading the file with ReaderBuilder. I don't know the arrow-avro internal implementation very well, so I'm investigating. I'll share it once I find a solution.

let avro_reader = ReaderBuilder::new().build(BufReader::new(reader))?;
println!("AVRO READER METADATA : {:?}", avro_reader.schema().metadata); // {}

@getChan Try using avro_header instead to get the OCF Header:

        let avro_reader = ReaderBuilder::new().build(BufReader::new(reader))?;
        let header = avro_reader.avro_header();
        println!("\nAVRO HEADER METADATA BYTES: {:?}", header.metadata().collect::<Vec<_>>());
        let writer_avro = AvroSchema::new(
            std::str::from_utf8(
                header
                    .get(SCHEMA_METADATA_KEY.as_bytes())
                    .expect("missing avro.schema metadata"),
            )
                .unwrap()
                .to_string(),
        );
        println!("AVRO HEADER SCHEMA METADATA : {:?}", writer_avro);

You should see an output like this:

AVRO HEADER METADATA BYTES: [([97, 118, 114, 111, 46, 115, 99, 104, 101, 109, 97], [123, 34, 116, 121, 112, 101, 34, 58, 34, 114, 101, 99, 111, 114, 100, 34, 44, 34, 110, 97, 109, 101, 34, 58, 34, 116, 111, 112, 76, 101, 118, 101, 108, 82, 101, 99, 111, 114, 100, 34, 44, 34, 102, 105, 101, 108, 100, 115, 34, 58, 91, 123, 34, 110, 97, 109, 101, 34, 58, 34, 105, 100, 34, 44, 34, 116, 121, 112, 101, 34, 58, 91, 34, 105, 110, 116, 34, 44, 34, 110, 117, 108, 108, 34, 93, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 98, 111, 111, 108, 95, 99, 111, 108, 34, 44, 34, 116, 121, 112, 101, 34, 58, 91, 34, 98, 111, 111, 108, 101, 97, 110, 34, 44, 34, 110, 117, 108, 108, 34, 93, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 116, 105, 110, 121, 105, 110, 116, 95, 99, 111, 108, 34, 44, 34, 116, 121, 112, 101, 34, 58, 91, 34, 105, 110, 116, 34, 44, 34, 110, 117, 108, 108, 34, 93, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 115, 109, 97, 108, 108, 105, 110, 116, 95, 99, 111, 108, 34, 44, 34, 116, 121, 112, 101, 34, 58, 91, 34, 105, 110, 116, 34, 44, 34, 110, 117, 108, 108, 34, 93, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 105, 110, 116, 95, 99, 111, 108, 34, 44, 34, 116, 121, 112, 101, 34, 58, 91, 34, 105, 110, 116, 34, 44, 34, 110, 117, 108, 108, 34, 93, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 98, 105, 103, 105, 110, 116, 95, 99, 111, 108, 34, 44, 34, 116, 121, 112, 101, 34, 58, 91, 34, 108, 111, 110, 103, 34, 44, 34, 110, 117, 108, 108, 34, 93, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 102, 108, 111, 97, 116, 95, 99, 111, 108, 34, 44, 34, 116, 121, 112, 101, 34, 58, 91, 34, 102, 108, 111, 97, 116, 34, 44, 34, 110, 117, 108, 108, 34, 93, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 100, 111, 117, 98, 108, 101, 95, 99, 111, 108, 34, 44, 34, 116, 121, 112, 101, 34, 58, 91, 34, 100, 111, 117, 98, 108, 101, 34, 44, 34, 110, 117, 108, 108, 34, 93, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 100, 97, 116, 101, 95, 115, 116, 114, 105, 110, 103, 95, 99, 111, 108, 34, 44, 34, 116, 121, 112, 101, 34, 58, 91, 34, 98, 121, 116, 101, 115, 34, 44, 34, 110, 117, 108, 108, 34, 93, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 115, 116, 114, 105, 110, 103, 95, 99, 111, 108, 34, 44, 34, 116, 121, 112, 101, 34, 58, 91, 34, 98, 121, 116, 101, 115, 34, 44, 34, 110, 117, 108, 108, 34, 93, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 116, 105, 109, 101, 115, 116, 97, 109, 112, 95, 99, 111, 108, 34, 44, 34, 116, 121, 112, 101, 34, 58, 91, 123, 34, 116, 121, 112, 101, 34, 58, 34, 108, 111, 110, 103, 34, 44, 34, 108, 111, 103, 105, 99, 97, 108, 84, 121, 112, 101, 34, 58, 34, 116, 105, 109, 101, 115, 116, 97, 109, 112, 45, 109, 105, 99, 114, 111, 115, 34, 125, 44, 34, 110, 117, 108, 108, 34, 93, 125, 93, 125]), ([111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 118, 101, 114, 115, 105, 111, 110], [51, 46, 49, 46, 50]), ([97, 118, 114, 111, 46, 99, 111, 100, 101, 99], [115, 110, 97, 112, 112, 121])]


AVRO HEADER SCHEMA METADATA : AvroSchema { json_string: "{\"type\":\"record\",\"name\":\"topLevelRecord\",\"fields\":[{\"name\":\"id\",\"type\":[\"int\",\"null\"]},{\"name\":\"bool_col\",\"type\":[\"boolean\",\"null\"]},{\"name\":\"tinyint_col\",\"type\":[\"int\",\"null\"]},{\"name\":\"smallint_col\",\"type\":[\"int\",\"null\"]},{\"name\":\"int_col\",\"type\":[\"int\",\"null\"]},{\"name\":\"bigint_col\",\"type\":[\"long\",\"null\"]},{\"name\":\"float_col\",\"type\":[\"float\",\"null\"]},{\"name\":\"double_col\",\"type\":[\"double\",\"null\"]},{\"name\":\"date_string_col\",\"type\":[\"bytes\",\"null\"]},{\"name\":\"string_col\",\"type\":[\"bytes\",\"null\"]},{\"name\":\"timestamp_col\",\"type\":[{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"},\"null\"]}]}" }

Copy link

@jecsand838 jecsand838 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@getChan @nathaniel-d-ef

I think I found an approach to get this working. I added comments detailing the suggested changes to make and it all seems to work for me locally. With that said, I'm still fairly new to this codebase, so I apologize in advance if I'm missing something.

Let me know what you think and if this solves the projection issue.

----
logical_plan TableScan: avro_table projection=[f1, f2, f3]
physical_plan DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/avro/simple_enum.avro]]}, projection=[f1, f2, f3], file_type=avro
physical_plan DataSourceExec: file_groups={4 groups: [[WORKSPACE_ROOT/testing/data/avro/simple_enum.avro:0..103], [WORKSPACE_ROOT/testing/data/avro/simple_enum.avro:103..206], [WORKSPACE_ROOT/testing/data/avro/simple_enum.avro:206..309], [WORKSPACE_ROOT/testing/data/avro/simple_enum.avro:309..411]]}, projection=[f1, f2, f3], file_type=avro
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty neat

# Conflicts:
#	Cargo.lock
#	Cargo.toml
#	datafusion/datasource-avro/src/avro_to_arrow/arrow_array_reader.rs
#	datafusion/datasource-avro/src/avro_to_arrow/schema.rs
#	datafusion/datasource-avro/src/source.rs
@getChan
Copy link
Contributor Author

getChan commented Dec 20, 2025

There is a compatibility issue with projection. I'm waiting for a release of arrow-avro that includes the necessary projection features.

  • DataFusion does not infer the schema from the file when the table schema is explicitly defined.
  • arrow-avro requires reading the Avro file metadata (avro.schema) to perform projection.
  • Consequently, projection is problematic when reading Avro tables with explicitly defined schemas.
    Please let me know if my understanding is incorrect or if there is a workaround.

@alamb
Copy link
Contributor

alamb commented Dec 20, 2025

Thanks for the update @getChan

If the fix already included in arrow-avro (and you are waiting on a release), you could rebase this PR against this branch #19355 to get access to the pre-release code

We would have to wait for the arrow release to actually merge it but it could potentially help unblock your work

I actually would love to get some validation that we can cut over to the new arrow-avro reader before we make the next arrow release (so we can fix any issue that might be found)

@jecsand838
Copy link

Thanks for the update @getChan

If the fix already included in arrow-avro (and you are waiting on a release), you could rebase this PR against this branch #19355 to get access to the pre-release code

We would have to wait for the arrow release to actually merge it but it could potentially help unblock your work

I actually would love to get some validation that we can cut over to the new arrow-avro reader before we make the next arrow release (so we can fix any issue that might be found)

@alamb I'm going to start working on apache/arrow-rs#8923 early next week and should have a PR up before Jan 1st.

@jecsand838
Copy link

@getChan @alamb My apologies for the delay, but here's the PR which adds projection to the arrow-avro ReaderBuilder API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use arrow-avro for performance and improved type support

4 participants