Add a distinct check_string_utf8 for unicode strings. #1

gsnedders · 2013-07-21T20:23:41Z

This proved to be a perf bottleneck for html5lib, and can trivially
be reimplemented entirely within Python, never calling into lxml.
(The bytes case probably ought to be converted to be internal to
Python too.)

Note that I haven't actually checked what libxml2 does here, whether
it has any flag which changes between XML 1.0 and 1.1 validation;
the Python code here implements the 1.0 validation.

It's also worthwhile to note that this is actually stricter than the
bytes version, as that only considers invalid ASCII characters as
making the string invalid, just ignoring everything else.

This proved to be a perf bottleneck for html5lib, and can trivially be reimplemented entirely within Python, never calling into lxml. (The bytes case probably ought to be converted to be internal to Python too.) Note that I haven't actually checked what libxml2 does here, whether it has any flag which changes between XML 1.0 and 1.1 validation; the Python code here implements the 1.0 validation. It's also worthwhile to note that this is actually stricter than the bytes version, as that only considers invalid ASCII characters as making the string invalid, just ignoring everything else.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a distinct check_string_utf8 for unicode strings. #1

Add a distinct check_string_utf8 for unicode strings. #1

Uh oh!

gsnedders commented Jul 21, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add a distinct check_string_utf8 for unicode strings. #1

Are you sure you want to change the base?

Add a distinct check_string_utf8 for unicode strings. #1

Uh oh!

Conversation

gsnedders commented Jul 21, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant