Auto Loader (cloudFiles) is designed to handle changing data structures automatically when new files arrive.
When new columns appear in incoming data (like JSON/CSV), Auto Loader can:
- Detect new columns
- Add them to the schema
- Continue ingestion without breaking the pipeline
You enable schema evolution using:
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")
.option("cloudFiles.schemaLocation", "/path/schema")-
Auto Loader stores schema in schemaLocation
-
When new file arrives:
- It compares schema
- Detects new columns
- Updates schema
-
New columns are added as nullable
| Mode | Behavior | Example |
|---|---|---|
addNewColumns |
Adds new columns automatically | discount column added to schema |
rescue |
Unknown columns go into _rescued_data |
Extra fields stored in JSON column |
failOnNewColumns |
Pipeline fails if schema changes | Safe for strict schemas |
none |
Ignores schema changes | Drops unknown columns silently |
rescue mode: _rescued_data is a STRING column containing JSON. You need to parse it to extract values.
👉 "Auto Loader enables schema evolution by incrementally updating the schema metadata stored in schemaLocation, allowing pipelines to adapt to new columns without manual intervention."
Even though Auto Loader handles schema changes, downstream tables/jobs can break because:
- New columns not expected
- Schema mismatch in transformations
- BI tools expecting fixed schema
- Ingest using Auto Loader
- Allow schema evolution
- No strict validation
Flexible layer (schema can change)
- Apply transformations
- Control schema explicitly
- Select only required columns
df.select("id", "name", "amount")👉 This isolates downstream from schema changes
- Aggregated data
- Fixed schema for reporting
expected_schema = ["id", "name", "amount"]
df = df.select(*expected_schema)👉 Prevents unexpected columns from flowing downstream
Options:
- Ignore new columns
- Add default values
- Version your tables
With Delta Lake:
- Schema enforcement
- Schema evolution (
mergeSchema) - Time travel (rollback if needed)
.option("cloudFiles.schemaEvolutionMode", "rescue")👉 Keeps unexpected fields safely without breaking pipeline
{
"order_id": 1,
"amount": 500
}{
"order_id": 2,
"amount": 700,
"discount": 50
}- Pipeline fails ❌
- Manual schema update needed ❌
- Detects
discount - Adds column automatically ✅
- Pipeline continues ✅
Your reporting query:
SELECT order_id, amount FROM ordersStill works ✅
But if someone does:
SELECT * FROM orders👉 Schema changes → dashboards may break ❌
# Bronze Layer - Auto Loader with schema evolution
bronze_df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("cloudFiles.schemaEvolutionMode", "addNewColumns") \
.option("cloudFiles.schemaLocation", "/path/to/schema") \
.load("/path/to/source")
# Silver Layer - Schema enforcement
silver_df = bronze_df.select("order_id", "amount", "discount")
# Write to Delta table
silver_df.writeStream \
.format("delta") \
.option("checkpointLocation", "/path/checkpoint") \
.toTable("silver_orders")What you say in interview:
👉 "We implemented Auto Loader in the Bronze layer with schema evolution enabled. In the Silver layer, we enforced schema by selecting only required columns, which ensured that downstream pipelines remain stable and unaffected by upstream schema changes."
👉 "Auto Loader supports schema evolution by automatically detecting and adding new columns to the schema using schemaLocation metadata. This ensures uninterrupted ingestion of changing data from upstream sources."
👉 "To handle downstream dependency, we follow a Bronze-Silver-Gold architecture. Schema evolution is allowed in Bronze, while in Silver we enforce a controlled schema by selecting required columns only. This pattern ensures BI tools and dependent jobs receive predictable, stable schemas."
-
Q: "What if you want to track which columns are new?"
- A: Use
_rescued_datamode and log it for auditing purposes.
- A: Use
-
Q: "How do you handle schema breaking changes?"
- A: Use Delta time travel to rollback or maintain version history of schemas.
-
Q: "What's the performance impact of schema evolution?"
- A: Minimal; Auto Loader scans incrementally and updates metadata efficiently.
-
Q: "Can we reject new columns instead of accepting them?"
- A: Yes, use
failOnNewColumnsmode to stop the pipeline and alert the data team.
- A: Yes, use
- ✅ Auto Loader automates schema evolution — no manual intervention needed
- ✅ Use Bronze-Silver-Gold to isolate schema changes
- ✅ Silver layer acts as firewall — shields downstream from schema changes
- ✅ Delta Lake provides schema enforcement & time travel
- ✅ Always test downstream impact before deploying schema changes