Skip to content

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Apr 15, 2025

What changes were proposed in this pull request?

This PR aims to support user-specified schema in DataFrameReader.

Why are the changes needed?

For feature parity.

Does this PR introduce any user-facing change?

No. This is a new addition.

How was this patch tested?

Pass the CIs.

Was this patch authored or co-authored using generative AI tooling?

No.

@dongjoon-hyun
Copy link
Member Author

Could you review this PR, @yaooqinn ?

#expect(try await spark.read.schema("age SHORT").json(path).dtypes.count == 1)
#expect(try await spark.read.schema("age SHORT").json(path).dtypes[0] == ("age", "smallint"))
#expect(try await spark.read.schema("age SHORT, name STRING").json(path).dtypes[0] == ("age", "smallint"))
#expect(try await spark.read.schema("age SHORT, name STRING").json(path).dtypes[1] == ("name", "string"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add a test with comment & null constraint

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it's supported

But, according to the Apache Spark 4.0.0 RC4, it seems there are limitations.
spark-shell

$ bin/spark-shell
WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 4.0.0
      /_/

Using Scala version 2.13.16 (OpenJDK 64-Bit Server VM, Java 17.0.14)
Type in expressions to have them evaluated.
Type :help for more information.
25/04/15 12:32:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = local-1744687967546).
Spark session available as 'spark'.

scala> spark.read.schema("name STRING NOT NULL").json("examples/src/main/resources/people.json").printSchema
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
root
 |-- name: string (nullable = true)

spark-connect-shell

$ bin/spark-connect-shell --remote sc://localhost:15002
25/04/15 12:28:48 INFO DefaultAllocationManagerOption: allocation manager type not specified, using netty as the default type
25/04/15 12:28:48 INFO CheckAllocator: Using DefaultAllocationManager at memory/netty/DefaultAllocationManagerFactory.class
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 4.0.0
      /_/

Type in expressions to have them evaluated.
Spark connect server version 4.0.0.
Spark session available as 'spark'.

scala> spark.read.schema("name STRING").json("../examples/src/main/resources/people.json").printSchema
root
 |-- name: string (nullable = true)

scala> spark.read.schema("name STRING NOT NULL").json("../examples/src/main/resources/people.json").printSchema
root
 |-- name: string (nullable = true)

scala> spark.read.schema("name STRING NOT NULL").json("../examples/src/main/resources/people.json").show()
+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+

For that part, let me dig more, @yaooqinn .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dongjoon-hyun

@dongjoon-hyun
Copy link
Member Author

Thank you, @yaooqinn !

@dongjoon-hyun
Copy link
Member Author

Merged to main.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-51799 branch April 15, 2025 03:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants