Optimized traversal of schema tree for schema cleaning (`GenerateSchema.clean_schema`) #1487

MarkusSintonen · 2024-10-18T10:14:14Z

Change Summary

Adds schema tree traversal which gathers necessary schema nodes and information for schema inlining and discriminator handling. Schema tree traversal is done in a single pass gathering the needed information. This is used in GenerateSchema.clean_schema handling. Required for PR pydantic/pydantic#10655 This makes schema cleaning much more efficient where the biggest bottleneck has been the Python side tree traversal. This especially with lots of models or deep models.

Related issue number

See above Pydantic side PR.

Checklist

Unit tests for the changes exist
Documentation reflects the changes where applicable
Pydantic tests pass with this pydantic-core (except for expected changes)
My PR is ready to review, please add a comment including the phrase "please review" to assign reviewers

Selected Reviewer: @sydney-runkle

codecov · 2024-10-18T10:17:42Z

Codecov Report

Attention: Patch coverage is 98.88889% with 1 line in your changes missing coverage. Please review.

Project coverage is 89.63%. Comparing base (ab503cb) to head (29b0ebc).
Report is 274 commits behind head on main.

Files with missing lines	Patch %	Lines
src/schema_traverse.rs	98.75%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1487      +/-   ##
==========================================
- Coverage   90.21%   89.63%   -0.58%     
==========================================
  Files         106      113       +7     
  Lines       16339    17984    +1645     
  Branches       36       40       +4     
==========================================
+ Hits        14740    16120    +1380     
- Misses       1592     1844     +252     
- Partials        7       20      +13

Files with missing lines	Coverage Δ
python/pydantic_core/__init__.py	`93.10% <100.00%> (+0.51%)`	⬆️
src/lib.rs	`100.00% <100.00%> (+12.85%)`	⬆️
src/schema_traverse.rs	`98.75% <98.75%> (ø)`

... and 54 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 061711f...29b0ebc. Read the comment docs.

codspeed-hq · 2024-10-18T10:20:51Z

CodSpeed Performance Report

Merging #1487 will not alter performance

_{Comparing MarkusSintonen:schema-gather (29b0ebc) with main (6472887)}

Summary

✅ 155 untouched benchmarks

🆕 2 new benchmarks

Benchmarks breakdown

	Benchmark	`main`	`MarkusSintonen:schema-gather`	Change
🆕	`test_nested_schema_inlined`	N/A	11.2 ms	N/A
🆕	`test_nested_schema_using_defs`	N/A	61 µs	N/A

MarkusSintonen · 2024-10-19T20:02:37Z

please review

davidhewitt

This seems reasonable to me. I haven't thought too much about the individual schema types, but I had a couple of possible optimization ideas :)

pyproject.toml

src/schema_traverse.rs

davidhewitt · 2024-10-21T10:45:21Z

src/schema_traverse.rs

+    meta_with_keys: Option<(Bound<'py, PyDict>, &'a Bound<'py, PySet>)>,
+    def_refs: Bound<'py, PyDict>,
+    recursive_def_refs: Bound<'py, PySet>,
+    recursively_seen_refs: HashSet<String>,


There might be an optimization to store these as Python strings to avoid round-trips:

Suggested change

recursively_seen_refs: HashSet<String>,

recursively_seen_refs: HashSet<PyBackedStr>,

Wondering does it help here as the first contains check anyways converts it into rust string right away?

Changed to using PySet, it seems to be the fastest here. (Also faster than HashSet<PyBackedStr>)

I think Python strings might cache their hash, I don't think PyO3 uses that right now though. Interesting observation 👍

davidhewitt

As far as I am concerned, this side looks good to me! Thanks 👍

This reverts commit 90418d9.

samuelcolvin · 2024-11-18T14:58:28Z

src/schema_traverse.rs

this looks good in principle, but it needs comprehensive docstrings.

Viicos · 2025-02-03T17:02:27Z

Tackled in pydantic/pydantic#11244.

MarkusSintonen mentioned this pull request Oct 18, 2024

Optimized schema building process without Python _WalkCoreSchema pydantic/pydantic#10655

Closed

5 tasks

MarkusSintonen changed the title ~~Optimized traversal for schema node gathering for schema cleaning~~ Optimized traversal of schema nodes for schema cleaning Oct 18, 2024

MarkusSintonen force-pushed the schema-gather branch from 18a29c4 to f2b4fc2 Compare October 19, 2024 19:51

pydantic-hooky bot added the ready for review label Oct 19, 2024

pydantic-hooky bot assigned sydney-runkle Oct 19, 2024

MarkusSintonen changed the title ~~Optimized traversal of schema nodes for schema cleaning~~ Optimized traversal of schema tree for schema cleaning (GenerateSchema.clean_schema) Oct 19, 2024

MarkusSintonen force-pushed the schema-gather branch from 0ea8bb3 to 7eae801 Compare October 19, 2024 20:21

davidhewitt reviewed Oct 21, 2024

View reviewed changes

MarkusSintonen force-pushed the schema-gather branch 2 times, most recently from b4aed80 to 14e3137 Compare October 23, 2024 17:16

davidhewitt approved these changes Oct 24, 2024

View reviewed changes

MarkusSintonen force-pushed the schema-gather branch 2 times, most recently from 1a3071e to 22aa6a2 Compare November 14, 2024 13:35

Add schema tree node gathering for cleaning in pydantic GenerateSchema

5c9c9e9

MarkusSintonen force-pushed the schema-gather branch from 22aa6a2 to 5c9c9e9 Compare November 14, 2024 13:35

MarkusSintonen added 2 commits November 14, 2024 20:56

Remove unneeded InvalidSchema

90418d9

Revert "Remove unneeded InvalidSchema"

29b0ebc

This reverts commit 90418d9.

samuelcolvin reviewed Nov 18, 2024

View reviewed changes

src/schema_traverse.rs

Copy link

Member

samuelcolvin Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good in principle, but it needs comprehensive docstrings.

Viicos closed this Feb 3, 2025

	recursively_seen_refs: HashSet<String>,
	recursively_seen_refs: HashSet<PyBackedStr>,

Optimized traversal of schema tree for schema cleaning (GenerateSchema.clean_schema) #1487

Optimized traversal of schema tree for schema cleaning (GenerateSchema.clean_schema) #1487

Uh oh!

Conversation

MarkusSintonen commented Oct 18, 2024 • edited by pydantic-hooky bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Summary

Related issue number

Checklist

Uh oh!

codecov bot commented Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

codspeed-hq bot commented Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Performance Report

Merging #1487 will not alter performance

Summary

Benchmarks breakdown

Uh oh!

MarkusSintonen commented Oct 19, 2024

Uh oh!

davidhewitt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

davidhewitt Oct 21, 2024

Choose a reason for hiding this comment

Uh oh!

MarkusSintonen Oct 21, 2024

Choose a reason for hiding this comment

Uh oh!

MarkusSintonen Oct 21, 2024

Choose a reason for hiding this comment

Uh oh!

davidhewitt Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

davidhewitt left a comment

Choose a reason for hiding this comment

Uh oh!

samuelcolvin Nov 18, 2024

Choose a reason for hiding this comment

Uh oh!

Viicos commented Feb 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Optimized traversal of schema tree for schema cleaning (`GenerateSchema.clean_schema`) #1487

Optimized traversal of schema tree for schema cleaning (`GenerateSchema.clean_schema`) #1487

MarkusSintonen commented Oct 18, 2024 •

edited by pydantic-hooky bot

Loading

codecov bot commented Oct 18, 2024 •

edited

Loading

codspeed-hq bot commented Oct 18, 2024 •

edited

Loading