This guide helps AI coding assistants understand the architecture, patterns, and conventions used in this PHP library.
What it does: Converts HTML strings to DatoCMS Structured Text (DAST) format — a JSON-serializable document structure used by the DatoCMS headless CMS.
Why it exists: DatoCMS stores rich content as DAST, not HTML. This library enables importing HTML content (from legacy systems, WYSIWYG editors, or web scraping) into DatoCMS.
Key use cases:
- Migrating content from WordPress/Drupal to DatoCMS
- Converting user-generated HTML to structured content
- Normalizing messy HTML into clean, validated DAST
- Extracting structured data from web pages
Companion library: dealnews/datocms-structured-text-to-html-string does the reverse (DAST → HTML). Both share architecture and conventions.
Upstream reference: JavaScript implementation at datocms/structured-text
HTML String (dirty, unstructured)
↓
DOMDocument (PHP's native parser)
↓
Visitor (traverses DOM tree)
↓
Handlers (convert nodes to DAST)
↓
Wrapper (normalizes structure)
↓
DAST Document (clean, validated)
1. Converter (src/Converter.php)
- Entry point for all conversions
- Orchestrates parsing and handler assembly
- Methods:
convert(string $html),convertDocument(\DOMDocument $doc) - Handles libxml error suppression for malformed HTML
2. Visitor (src/Visitor.php)
- Traverses DOM tree recursively
- Dispatches nodes to appropriate handlers
- Accumulates results from child nodes
- Critical logic: Distinguishes single DAST nodes from arrays of nodes via
isset($result['type'])
3. Handlers (src/Handlers.php)
- 27 static methods, one per HTML element type
- Each converts a DOM node to DAST structure
- Examples:
root(),heading(),paragraph(),link(),text() - Handler signature:
function(callable $create_node, \DOMNode $node, Context $context): mixed
4. Wrapper (src/Wrapper.php)
- Fixes invalid DAST structure
- Wraps inline nodes (span, link) in paragraphs when needed
- Handles hybrid nodes (links containing block content)
- Methods:
wrap(),wrapListItems(),split()
5. Context (src/Context.php)
- Immutable state object passed through conversion
- Tracks: parent node, accumulated marks, handlers, allowed types
- Handlers clone before modifying (prevents side effects)
6. GlobalContext (src/GlobalContext.php)
- Document-level shared state
- Tracks: base URL (from
<base>tag), whether base was found - Mutable (shared across entire conversion)
7. Utils (src/Utils.php)
- DAST validation rules (
ALLOWED_CHILDRENconstant) - Type guards:
isDastNode(),isDastRoot() - Validation:
isAllowedChild()
8. Options (src/Options.php)
- Configuration value object
- Custom handlers, preprocessing function, allowed types
9. ConversionError (src/ConversionError.php)
- Exception thrown on conversion failures
- Stores problematic
\DOMNodefor debugging
DAST nodes are arrays with required 'type' key:
// Root document
[
'schema' => 'dast',
'document' => [
'type' => 'root',
'children' => [...]
]
]
// Block nodes
['type' => 'heading', 'level' => 1, 'children' => [...]]
['type' => 'paragraph', 'children' => [...]]
['type' => 'list', 'style' => 'bulleted', 'children' => [...]]
['type' => 'code', 'language' => 'php', 'code' => '<?php']
// Inline nodes
['type' => 'span', 'value' => 'text', 'marks' => ['strong']]
['type' => 'link', 'url' => 'https://...', 'children' => [...]]Allowed children enforced by Utils::ALLOWED_CHILDREN:
public const ALLOWED_CHILDREN = [
'root' => ['heading', 'paragraph', 'list', 'code', 'blockquote', 'thematicBreak'],
'heading' => ['inlineNodes'], // Special: span, link, etc.
'paragraph' => ['inlineNodes'],
'list' => ['listItem'],
'listItem' => ['paragraph', 'list'],
'blockquote' => ['paragraph'],
// ...
];Marks are strings that modify text styling:
['strong', 'emphasis', 'underline', 'strikethrough', 'highlight', 'code']Accumulated in Context::$marks as DOM traverses nested inline elements like <strong><em>text</em></strong>.
This codebase follows strict DealNews PHP conventions. Always follow these rules.
Same line as statement, except multi-line conditionals. Always use braces.
// ✓ Correct
public function convert(string $html): ?array {
if ($html === '') {
return null;
}
// Multi-line conditional
if (
$condition_one &&
$condition_two
) {
doSomething();
}
}
// ✗ Wrong
public function convert(string $html): ?array
{
if ($html === '')
return null; // Missing braces
}- Variables/properties:
snake_case(not camelCase) - Visibility:
protectedby default (notprivate) - Type hints: Required on all method signatures
// ✓ Correct
protected string $parent_node_type = 'root';
public function convert(string $html, ?Options $options = null): ?array {
// ...
}
// ✗ Wrong
private $parentNodeType; // Wrong visibility, missing type, camelCase
public function convert($html, $options) { // Missing types
// ...
}Prefer single return; early returns OK for validation.
// ✓ Correct
public function findNode(array $nodes, string $type): ?array {
if (empty($nodes)) {
return null; // Early validation OK
}
$result = null;
foreach ($nodes as $node) {
if ($node['type'] === $type) {
$result = $node;
break;
}
}
return $result; // Single main return
}
// ✗ Wrong
public function findNode(array $nodes, string $type): ?array {
foreach ($nodes as $node) {
if ($node['type'] === $type) {
return $node; // Multiple returns in main logic
}
}
return null;
}- Short syntax:
[]notarray() - Multi-line: Trailing commas, align associative keys
- Should not return arrays for complex data; use value objects
// ✓ Correct arrays
public const MARKS = [
'strong',
'emphasis',
'underline',
];
protected array $handlers = [
'h1' => 'heading',
'p' => 'paragraph',
];
// ✗ Wrong: Returning complex array
public function getNodeData(): array {
return ['type' => 'span', 'value' => 'text'];
}
// ✓ Correct: Use value object (or DAST array per this library's design)
// Note: DAST nodes are arrays by design (matching upstream JS library)
// but for non-DAST data, prefer value objectsShould not use pass-by-reference. Return values instead.
// ✗ Wrong
public function modify(\DOMNode &$node): void {
$node->nodeValue = 'changed';
}
// ✓ Correct
public function modify(\DOMNode $node): \DOMNode {
$node->nodeValue = 'changed';
return $node;
}
// ✗ Wrong in loops
foreach ($items as &$item) {
$item->foo = 'bar';
}
// ✓ Correct
foreach ($items as $key => $item) {
$item->foo = 'bar';
$items[$key] = $item;
}All classes and public methods need PHPDoc with @param, @return, @throws.
/**
* Converts HTML string to DAST document.
*
* Parses HTML using DOMDocument and converts to DatoCMS
* Structured Text format. Handles malformed HTML gracefully.
*
* @param string $html HTML string to convert
* @param Options|null $options Optional conversion options
*
* @return array<string, mixed>|null DAST document or null if empty
*
* @throws ConversionError If conversion fails
*/
public function convert(string $html, ?Options $options = null): ?array {
// ...
}- Exceptions: Catch
\Throwablenot\Exception - Namespace: Everything under
DealNews\HtmlToStructuredText\ - Class references: Use
Foo::classnot string'Foo' - Line length: Should be ≤80 chars where reasonable
- File endings: Unix
\nonly, single newline at end - Comments: Use
//or/* */(not#); explain why not how
# All tests with coverage
./vendor/bin/phpunit --coverage-text
# Single test file
./vendor/bin/phpunit tests/ConverterTest.php
# Single test method
./vendor/bin/phpunit --filter testBasicParagraph tests/ConverterTest.php
# Target: 85%+ coverage (currently 86.17%)php examples/basic.php # Simple HTML to DAST
php examples/custom_handlers.php # Custom handler override
php examples/preprocessing.php # DOM manipulation before conversioncomposer installAll handlers follow this signature and pattern:
public static function handlerName(
callable $create_node,
\DOMNode $node,
Context $context
): mixed {
// 1. Clone context to avoid side effects
$new_context = clone $context;
$new_context->parent_node_type = 'paragraph';
// 2. Process children or extract data
$children = Visitor::visitChildren($create_node, $node, $new_context);
// 3. Return DAST node using $create_node
return $create_node('paragraph', ['children' => $children]);
}$create_node($type, $props) automatically adds 'type' => $type to props.
Return types:
- Single DAST node (array with 'type' key)
- Array of DAST nodes
nullto skip the element
Always clone Context before modification:
// ✓ Correct: Clone prevents side effects
public static function withMark(
callable $create_node,
\DOMNode $node,
Context $context,
string $mark
): mixed {
$new_context = clone $context; // Clone first!
$new_context->marks[] = $mark;
return Visitor::visitChildren($create_node, $node, $new_context);
}
// ✗ Wrong: Modifying context directly
public static function withMark(...) {
$context->marks[] = $mark; // Side effect!
return Visitor::visitChildren($create_node, $node, $context);
}The Wrapper class detects and wraps inline nodes:
// HTML: <div>Some <strong>text</strong> here</div>
// Produces inline nodes: [span, span, span]
// Wrapper detects and wraps in paragraph:
// [paragraph => [span, span, span]]
// Use Wrapper::wrap() when you have mixed content
$children = Visitor::visitChildren($create_node, $node, $context);
$wrapped = Wrapper::wrap($children); // Wraps inline runsMarks accumulate as DOM traverses nested inline elements:
// HTML: <strong><em>Bold and italic</em></strong>
//
// 1. <strong> handler adds 'strong' to context.marks
// 2. <em> handler (nested) adds 'emphasis' to context.marks
// 3. Text handler creates span with marks: ['strong', 'emphasis']
// In handlers:
$new_context = clone $context;
$new_context->marks[] = 'strong'; // Add mark
return Visitor::visitChildren($create_node, $node, $new_context);Use Utils::ALLOWED_CHILDREN to validate structure:
$allowed = Utils::ALLOWED_CHILDREN['paragraph']; // ['inlineNodes']
foreach ($children as $child) {
if (!Utils::isAllowedChild('paragraph', $child['type'], $allowed)) {
// Child not allowed in paragraph
// Wrapper::wrap() will fix this
}
}- One test class per source class:
UtilsTest.phptestsUtils.php - Integration vs unit: Utils/Wrapper have unit tests; Converter has integration tests
- Method naming:
testCamelCaseDescription() - Coverage target: 85%+ line coverage
Unit tests (Utils, Wrapper):
- Method input/output correctness
- Edge cases (empty arrays, null values)
- Validation logic (allowed children, DAST node detection)
Integration tests (Converter):
- End-to-end HTML → DAST conversions
- Real HTML snippets (not mocked DOM)
- All element types (headings, lists, links, marks, code blocks)
- Custom options (handlers, preprocessing, allowed types)
public function testBasicParagraph(): void {
$converter = new Converter();
$result = $converter->convert('<p>Hello world</p>');
$this->assertIsArray($result);
$this->assertEquals('dast', $result['schema']);
$root = $result['document'];
$this->assertEquals('root', $root['type']);
$this->assertCount(1, $root['children']);
$paragraph = $root['children'][0];
$this->assertEquals('paragraph', $paragraph['type']);
}Problem: visitChildren() must distinguish single DAST nodes from arrays of nodes.
Symptom: If you use array_merge() blindly, DAST nodes (which are arrays) get merged incorrectly.
// ✗ Wrong: Merges node properties into parent array
$values = array_merge($values, $result);
// ✓ Correct: Check if $result is a DAST node (has 'type' key)
if (is_array($result)) {
if (isset($result['type'])) {
$values[] = $result; // Single node
} else {
$values = array_merge($values, $result); // Array of nodes
}
}Location: src/Visitor.php:79-87
nodeType mismatch: PHP's DOMDocument has nodeType = 13 (XML_DOCUMENT_TYPE_NODE), not 9 (XML_DOCUMENT_NODE) like the spec.
// ✗ Wrong
if ($node->nodeType === XML_DOCUMENT_NODE) { ... }
// ✓ Correct
if ($node instanceof \DOMDocument) { ... }XML Processing Instruction: DOMDocument adds <?xml encoding="UTF-8"> as a child node. Visitor returns null for unhandled node types, so it gets filtered out automatically.
UTF-8 Handling: Wrap HTML with XML encoding declaration for proper UTF-8:
$wrapped = '<?xml encoding="UTF-8">' . $html;
$doc->loadHTML($wrapped, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);If HTML converts to zero DAST nodes, convert() returns null (not empty array):
$result = $converter->convert('<script>ignored</script>');
// Returns: null (scripts are ignored)
$result = $converter->convert('<p>Text</p>');
// Returns: ['schema' => 'dast', 'document' => [...]]DAST has strict structure rules. Inline nodes (span, link) cannot be direct children of root or blockquote — they need paragraph wrappers.
Wrapper handles this automatically:
// HTML: <div>Some <a href="#">link</a> text</div>
// Without Wrapper: root → [span, link, span] ← INVALID
// With Wrapper: root → [paragraph → [span, link, span]] ← VALID
// In handlers:
$children = Visitor::visitChildren($create_node, $node, $context);
if ($needs_wrapping) {
$children = Wrapper::wrap($children); // Wraps inline runs
}Handlers can return:
- DAST node (array with 'type'): Added as child
- Array of DAST nodes: All added as children
null: Node skipped (e.g.,<script>,<style>)
// Single node
return $create_node('heading', ['level' => 1, 'children' => [...]]);
// Multiple nodes (rare: usually for splitting content)
return [
$create_node('paragraph', [...]),
$create_node('paragraph', [...]),
];
// Skip node
return null; // Used for <script>, <style>, etc.- Add handler method in
src/Handlers.php:
/**
* Handler for <custom> element
*
* @param callable $create_node Function to create DAST nodes
* @param \DOMNode $node DOM node
* @param Context $context Conversion context
*
* @return array<string, mixed>|null DAST node
*/
public static function custom(
callable $create_node,
\DOMNode $node,
Context $context
): ?array {
$new_context = clone $context;
$children = Visitor::visitChildren($create_node, $node, $new_context);
return $create_node('customBlock', ['children' => $children]);
}- Register handler in
src/Converter.php::buildHandlers():
protected function buildHandlers(): array {
return [
// ...
'custom' => [Handlers::class, 'custom'],
];
}- Add DAST type to allowed children in
src/Utils.php::ALLOWED_CHILDREN:
public const ALLOWED_CHILDREN = [
'root' => ['heading', 'paragraph', 'customBlock', /* ... */],
'customBlock' => ['inlineNodes'],
];- Write tests in
tests/ConverterTest.php:
public function testCustomElement(): void {
$converter = new Converter();
$result = $converter->convert('<custom>Content</custom>');
$custom = $result['document']['children'][0];
$this->assertEquals('customBlock', $custom['type']);
}-
Add mark constant in
src/Utils.php::INLINE_NODE_TYPESor createMARK_TYPESconstant -
Add handler for HTML element in
src/Handlers.php:
public static function customMark(
callable $create_node,
\DOMNode $node,
Context $context
): mixed {
return self::withMark($create_node, $node, $context, 'customMark');
}- Register handler in
Converter::buildHandlers():
'customtag' => [Handlers::class, 'customMark'],- Update Options defaults if needed in
src/Options.php:
public array $allowed_marks = [
'strong', 'emphasis', 'customMark', /* ... */
];- Write failing test that reproduces the bug
- Identify the component (Visitor, Handlers, Wrapper, Utils)
- Make minimal fix following coding standards
- Verify test passes and coverage doesn't drop
- Run full suite to check for regressions
# 1. Write test
vim tests/ConverterTest.php # Add testNewFeature()
# 2. Run test (should fail)
./vendor/bin/phpunit --filter testNewFeature
# 3. Implement feature
vim src/Handlers.php # Add handler
# 4. Verify test passes
./vendor/bin/phpunit --filter testNewFeature
# 5. Check coverage
./vendor/bin/phpunit --coverage-text
# 6. Run full suite
./vendor/bin/phpunitConverter.php- Entry point, orchestrationVisitor.php- DOM tree traversalHandlers.php- Element-to-DAST conversion (27 methods)Wrapper.php- Structure normalization (wrapping inline nodes)Utils.php- DAST validation rules and helpersContext.php- Conversion state (immutable)GlobalContext.php- Document-level state (mutable)Options.php- Configuration value objectConversionError.php- Exception class
- Adding element support? →
Handlers.php+Converter::buildHandlers() - Fixing structure issues? →
Wrapper.php - Changing validation rules? →
Utils.php::ALLOWED_CHILDREN - Adding configuration? →
Options.php - Traversal logic? →
Visitor.php(rarely needs changes) - Pre/post processing? →
Options::$preprocess(for preprocessing) or add toConverter(for post-processing)
Utils::INLINE_NODE_TYPES // ['span', 'link', ...]
Utils::ALLOWED_CHILDREN // Parent-child validation mapConverter::convert(string $html): ?array
Visitor::visitNode(callable $create_node, \DOMNode $node, Context $context): mixed
Wrapper::wrap(array $nodes): array
Utils::isAllowedChild(string $parent, string $child, array $allowed): bool