Convert HTML to DatoCMS Structured Text (DAST format). PHP port of the official JavaScript library.
- PHP 8.2+
- DOM extension
- libxml extension
- Composer
composer require dealnews/datocms-html-to-structured-text<?php
require_once 'vendor/autoload.php';
use DealNews\HtmlToStructuredText\Converter;
// Create converter instance
$converter = new Converter();
// Simple HTML
$html = '<h1>DatoCMS</h1><p>The best <strong>headless CMS</strong>.</p>';
$dast = $converter->convert($html);
// Returns:
// [
// 'schema' => 'dast',
// 'document' => [
// 'type' => 'root',
// 'children' => [...]
// ]
// ]- ✅ Converts HTML to valid DAST documents
- ✅ Supports all standard HTML elements
- ✅ Custom handlers for specialized conversions
- ✅ DOM preprocessing hooks
- ✅ Configurable allowed blocks, marks, and heading levels
- ✅ Mark extraction from inline CSS styles
- ✅ URL resolution with
<base>tag support - ✅ Type-safe with comprehensive PHPDoc
| HTML | DAST Node | Notes |
|---|---|---|
<h1> - <h6> |
heading |
Level extracted from tag |
<p> |
paragraph |
|
<ul>, <ol> |
list |
Style: bulleted/numbered |
<li> |
listItem |
|
<blockquote> |
blockquote |
|
<pre>, <code> |
code |
Language from class attribute |
<hr> |
thematicBreak |
| HTML | Mark | Notes |
|---|---|---|
<strong>, <b> |
strong |
|
<em>, <i> |
emphasis |
|
<u> |
underline |
|
<s>, <strike> |
strikethrough |
|
<mark> |
highlight |
|
<code> (inline) |
code |
In paragraph context |
<a> |
link |
With URL and optional meta |
<br> |
span with \n |
Scripts, styles, and media elements are ignored: <script>, <style>, <video>, <audio>, <iframe>, <embed>
Override default conversion for specific elements:
use DealNews\HtmlToStructuredText\Converter;
use DealNews\HtmlToStructuredText\Options;
use DealNews\HtmlToStructuredText\Handlers;
$converter = new Converter();
$options = new Options();
// Custom h1 handler - adds prefix to all h1 headings
$options->handlers['h1'] = function (
callable $create_node,
\DOMNode $node,
$context
) {
// Use default handler
$result = Handlers::heading($create_node, $node, $context);
// Modify result
if (isset($result['children'][0]['value'])) {
$result['children'][0]['value'] = '★ ' . $result['children'][0]['value'];
}
return $result;
};
$html = '<h1>Important</h1>';
$dast = $converter->convert($html, $options);
// H1 will have "★ Important" as textModify the DOM before conversion:
$options = new Options();
// Convert all <div> tags to <p> tags
$options->preprocess = function (\DOMDocument $doc): void {
$divs = [];
foreach ($doc->getElementsByTagName('div') as $div) {
$divs[] = $div;
}
foreach ($divs as $div) {
$p = $doc->createElement('p');
while ($div->firstChild) {
$p->appendChild($div->firstChild);
}
$div->parentNode->replaceChild($p, $div);
}
};
$html = '<div>Content</div>';
$dast = $converter->convert($html, $options);
// Div becomes paragraph in DASTControl which block types are allowed:
$options = new Options();
$options->allowed_blocks = ['paragraph', 'list']; // Only paragraphs and lists
$html = '<h1>Title</h1><p>Text</p>';
$dast = $converter->convert($html, $options);
// H1 will be converted to paragraphControl which text marks are allowed:
$options = new Options();
$options->allowed_marks = ['strong']; // Only bold
$html = '<p><strong>Bold</strong> and <em>italic</em></p>';
$dast = $converter->convert($html, $options);
// Only strong mark will be applied, emphasis ignoredControl which heading levels are preserved:
$options = new Options();
$options->allowed_heading_levels = [1, 2]; // Only H1 and H2
$html = '<h1>H1</h1><h3>H3</h3>';
$dast = $converter->convert($html, $options);
// H3 will be converted to paragraphclass Options {
// Whether to preserve newlines in text
public bool $newlines = false;
// Custom handler overrides
public array $handlers = [];
// Preprocessing function
public $preprocess = null;
// Allowed block types
public array $allowed_blocks = [
'blockquote', 'code', 'heading', 'link', 'list'
];
// Allowed mark types
public array $allowed_marks = [
'strong', 'code', 'emphasis', 'underline',
'strikethrough', 'highlight'
];
// Allowed heading levels (1-6)
public array $allowed_heading_levels = [1, 2, 3, 4, 5, 6];
}Converts HTML string to DAST document.
Parameters:
$html- HTML string to convert$options- Optional conversion options
Returns: DAST document array or null if empty
Throws: ConversionError if conversion fails
Converts a DOMDocument to DAST (for pre-parsed HTML).
Parameters:
$doc- DOMDocument to convert$options- Optional conversion options
Returns: DAST document array or null if empty
Throws: ConversionError if conversion fails
The library extracts programming language from code block class names:
$html = '<pre><code class="language-javascript">const x = 1;</code></pre>';
$dast = $converter->convert($html);
// Result will have: ['type' => 'code', 'language' => 'javascript', 'code' => 'const x = 1;']Default prefix is language- but can be customized in context.
Link meta attributes (target, rel, title) are extracted:
$html = '<a href="https://example.com" target="_blank" rel="noopener">Link</a>';
$dast = $converter->convert($html);
// Result will have meta array: [['id' => 'target', 'value' => '_blank'], ...]The library can extract marks from inline CSS styles:
$html = '<span style="font-weight: bold">Bold via style</span>';
$dast = $converter->convert($html);
// Creates span with strong markSupported style properties:
font-weight: boldorfont-weight > 400→strongfont-style: italic→emphasistext-decoration: underline→underline
The <base> tag is respected for relative URL resolution:
$html = '<base href="https://example.com/"><a href="/page">Link</a>';
$dast = $converter->convert($html);
// Link URL will be resolved to: https://example.com/pageThe library throws ConversionError exceptions when conversion fails:
use DealNews\HtmlToStructuredText\ConversionError;
try {
$dast = $converter->convert($html);
} catch (ConversionError $e) {
echo "Conversion failed: " . $e->getMessage();
$node = $e->getNode(); // Get problematic DOM node if available
}- Single whitespace-only spans are removed when wrapped
- Newlines in text are preserved if
$options->newlines = true - In headings, newlines are converted to spaces (headings can't have line breaks)
Nested lists are fully supported:
$html = '<ul><li>Item<ul><li>Nested</li></ul></li></ul>';
// Converts correctly to nested list structureLinks and other hybrid elements are handled correctly:
$html = '<a href="#"><span>Inline</span><p>Block</p></a>';
// Properly splits into separate nodes- No Promises: PHP handlers return directly (synchronous)
- No Hast: Works directly with PHP DOMDocument instead of intermediate tree
- Array Structure: DAST nodes are arrays (not objects)
- Error Handling: Uses exceptions instead of rejection
composer install
./vendor/bin/phpunitCurrent test coverage: 86%+
php examples/basic.php
php examples/custom_handlers.php
php examples/preprocessing.phpBSD 3-Clause License - see LICENSE file for details
This is a PHP port of the official DatoCMS HTML to Structured Text JavaScript library.
Ported and maintained by DealNews.
- datocms-structured-text-to-html-string - Convert DAST to HTML (the inverse operation)