You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This gets rid of the old 'Literal' type:
enum Literal {
Unicode(char),
Byte(u8),
}
and replaces it with
struct Literal(Box<[u8]>);
I did this primarily because I perceive the new version to be a bit
simpler and is very likely to be more space efficient given some of the
changes I have in mind (upcoming in subsequent commits). Namely, I want
to include more analysis information beyond just simply booleans, and
this means using up more space. Putting that analysis information on
every single byte/char seems gratuitous. But putting it on every single
sequence of byte/chars seems more justifiable.
I also have a hand-wavy idea that this might make analysis a bit easier.
And another hand-wavy idea that debug-printing such an HIR will make it
a bit more comprehensible.
Overall, this isn't a completely obvious win and I do wonder whether
I'll regret this. For one thing, the translator is now a fair bit
more complicated in exchange for not creating a 'Vec<u8>' for every
'ast::Literal' node.
This also gives up the Unicode vs byte distinct and just commits to "all
bytes." Instead, we do a UTF-8 check on every 'Hir::literal' call, and
that in turn sets the UTF-8 property. This does seem a bit wasteful, and
indeed, we do another UTF-8 check in the compiler (even though we could
use 'unsafe' correctly and avoid it). However, once the new NFA compiler
lands from regex-automata, it operates purely in byte-land and will not
need to do another UTF-8 check. Moreover, a UTF-8 check, even on every
literal, is likely barely measureable in the grand scheme of things.
I do also worry that this is overwrought. In particular, the AST creates
a node for each character. Then the HIR smooths them out to sequences of
characters (that is, Vec<u8>). And then NFA compilation splits them back
out into states where a state handles at most one character (or range of
characters). But, I am taking somewhat of a leap-of-judgment here that
this will make analysis easier and will overall use less space. But
we'll see.
0 commit comments