Skip to content

Conversation

@gsnedders
Copy link

This proved to be a perf bottleneck for html5lib, and can trivially
be reimplemented entirely within Python, never calling into lxml.
(The bytes case probably ought to be converted to be internal to
Python too.)

Note that I haven't actually checked what libxml2 does here, whether
it has any flag which changes between XML 1.0 and 1.1 validation;
the Python code here implements the 1.0 validation.

It's also worthwhile to note that this is actually stricter than the
bytes version, as that only considers invalid ASCII characters as
making the string invalid, just ignoring everything else.

This proved to be a perf bottleneck for html5lib, and can trivially
be reimplemented entirely within Python, never calling into lxml.
(The bytes case probably ought to be converted to be internal to
Python too.)

Note that I haven't actually checked what libxml2 does here, whether
it has any flag which changes between XML 1.0 and 1.1 validation;
the Python code here implements the 1.0 validation.

It's also worthwhile to note that this is actually stricter than the
bytes version, as that only considers invalid ASCII characters as
making the string invalid, just ignoring everything else.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant