You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adds re-entrancy semantics to the importer API to enable pausing and
resuming data imports:
```php
$wxr_path = __DIR__ . '/tests/fixtures/wxr-simple.xml';
$importer = WP_Stream_Importer::create_for_wxr_file( $wxr_path );
// Do some work
for($i = 0;$i<10;$i++) {
$importer->next_step();
}
// Save our progress
$cursor = $importer->get_reentrancy_cursor();
// Continue where we left off later on
$new_importer = WP_Stream_Importer::create_for_wxr_file( $wxr_path, [], $cursor );
$new_importer->next_step();
```
## Motivation
Most WordPress importers fail because they assume a happy path: we have
enough memory, we have enough time, all the assets will be available,
and so on.
In Data Liberation, I want to assume the worst possible path through
thorny quicksand in full sun with venomous wasps stinging us. We'll run
out of memory after the first post, all the assets will be 40GB large,
and half of them won't be possible to download.
Pausing, resuming, and recovering from errors should be a basic
primitive of the system. The first step to supporting that is the
ability to suspend the import operation and restart it from the same
spot later on. And that's exactly what this PR adds.
## Re-entrancy interface
This PR doesn't store any information in the database yet. It merely
adds the plumbing for pausing and resuming the `WP_Stream_Importer`
instance.
### WP_Byte_Stream re-entrancy
The `WP_Byte_Stream` interface directly exposes a `tell(): int` and
`seek($offset)` methods. There's no need for anything fancier than that
– we're only interested in an offset in the stream. It seems to work
well for simple byte streams.
My only worry is we may need to revisit this interface later on to
support fetching fixed-size chunks from large files using byte ranges.
### WP_XML_Processor re-entrancy
`WP_XML_Processor` supports exporting state via:
* A `get_reentrancy_cursor()` method
* Resuming via a static `create($xml, $options, $cursor=null)`.
* Seeking the input stream to the correct location via
`get_token_byte_offset_in_the_input_stream()`
No method in the XML processor API will ever accept the cursor or the
byte offset as a way of moving to another location in the document. You
can only create a new XML processor at `$cursor`.
This is a measure to:
* Discourage using the byte offsets for manual string operations on the
XML document. It's a footgun and most API consumers who would try that
would just introduce bugs into their codebase.
* Make it impossible to misuse the re-entrancy API for `seek()`-ing. We
already have named bookmarks for that.
Usage:
```php
$xml = WP_XML_Processor::create_from_string( $xml_bytes );
for($i = 0;$i<10;$i++) {
$xml->next_step();
}
$cursor = $xml->get_reentrancy_cursor();
$unparsed_xml = substr(
$xml_bytes,
$xml->get_token_byte_offset_in_the_input_stream()
);
$xml2 = WP_XML_Processor::create_from_string( $unparsed_xml, $cursor );
$xml2->next_step();
```
### WP_WXR_Reader re-entrancy
The `WP_WXR_Reader` class uses the same `get_reentrancy_cursor()`
interface as `WP_XML_Processor`.
### WP_Stream_Importer re-entrancy
The `WP_Stream_Importer` class uses the same `get_reentrancy_cursor()`
interface as `WP_XML_Processor`. See the example at the top of this
description.
## Testing instructions
TBD. We don't yet have a good way of running PHPUnit in the WordPress
context yet. @zaerl is working on running import in CLI, we may need to
wait for that before adding tests to this PR and shipping it.
0 commit comments