Skip to content

parse_html adds unwanted tags like <html><head>...<body></html> #583

@qknight

Description

@qknight

I want to use parse_document to create dom/vdom patches but the parse_document(...) keeps adding <html> and <body>. I wonder, is there an option to fine-tune the error correction level? I like that it does add a </title> in the example below.

But for creating a virtual-dom patch on a <div id="here"> it is bad to have to filter the html tags out afterwards.

/// parse none-escaped html strings as "Hello world!" into a node tree (see also raw_html(...))
pub fn parse_html<MSG>(html: &str) -> Result<Option<Node<MSG>>, ParseError> {
    let dom: RcDom = parse_document(RcDom::default(), Default::default()).one(html);
    if let Some(body) = find_body(&dom.document) {
        let new_document = Rc::new(markup5ever_rcdom::Node {
            data: NodeData::Document,
            parent: Cell::new(None),
            children: body.children.clone(),
        });
        process_handle(&new_document)
    } else {
        Err(ParseError::NoBodyInParsedHtml)
    }
}

// Recursively find the <body> element
fn find_body(handle: &Handle) -> Option<Handle> {
    match &handle.data {
        NodeData::Element { name, .. } if name.local.as_ref() == "body" => Some(handle.clone()),
        _ => {
            for child in handle.children.borrow().iter() {
                if let Some(body) = find_body(child) {
                    return Some(body);
                }
            }
            None
        }
    }
}

However, my problem is that I also want to parse html with a <html>...</html> tag in it and then it gets removed.

html-driver.rs test

#[test]
fn from_utf8() {
    let dom = driver::parse_document(RcDom::default(), Default::default())
        .from_utf8()
        .one("<title>Test".as_bytes());
    let mut serialized = Vec::new();
    let document: SerializableHandle = dom.document.clone().into();
    serialize::serialize(&mut serialized, &document, Default::default()).unwrap();
    assert_eq!(
        String::from_utf8(serialized).unwrap().replace(' ', ""),
        "<html><head><title>Test</title></head><body></body></html>"
    );
}

Update:

parse_fragment is also adding unwanted html.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions