Skip to content

Commit e4ffd2d

Browse files
authored
[Data Liberation] Re-entrant WP_Stream_Importer (#2004)
Adds re-entrancy semantics to the importer API to enable pausing and resuming data imports: ```php $wxr_path = __DIR__ . '/tests/fixtures/wxr-simple.xml'; $importer = WP_Stream_Importer::create_for_wxr_file( $wxr_path ); // Do some work for($i = 0;$i<10;$i++) { $importer->next_step(); } // Save our progress $cursor = $importer->get_reentrancy_cursor(); // Continue where we left off later on $new_importer = WP_Stream_Importer::create_for_wxr_file( $wxr_path, [], $cursor ); $new_importer->next_step(); ``` ## Motivation Most WordPress importers fail because they assume a happy path: we have enough memory, we have enough time, all the assets will be available, and so on. In Data Liberation, I want to assume the worst possible path through thorny quicksand in full sun with venomous wasps stinging us. We'll run out of memory after the first post, all the assets will be 40GB large, and half of them won't be possible to download. Pausing, resuming, and recovering from errors should be a basic primitive of the system. The first step to supporting that is the ability to suspend the import operation and restart it from the same spot later on. And that's exactly what this PR adds. ## Re-entrancy interface This PR doesn't store any information in the database yet. It merely adds the plumbing for pausing and resuming the `WP_Stream_Importer` instance. ### WP_Byte_Stream re-entrancy The `WP_Byte_Stream` interface directly exposes a `tell(): int` and `seek($offset)` methods. There's no need for anything fancier than that – we're only interested in an offset in the stream. It seems to work well for simple byte streams. My only worry is we may need to revisit this interface later on to support fetching fixed-size chunks from large files using byte ranges. ### WP_XML_Processor re-entrancy `WP_XML_Processor` supports exporting state via: * A `get_reentrancy_cursor()` method * Resuming via a static `create($xml, $options, $cursor=null)`. * Seeking the input stream to the correct location via `get_token_byte_offset_in_the_input_stream()` No method in the XML processor API will ever accept the cursor or the byte offset as a way of moving to another location in the document. You can only create a new XML processor at `$cursor`. This is a measure to: * Discourage using the byte offsets for manual string operations on the XML document. It's a footgun and most API consumers who would try that would just introduce bugs into their codebase. * Make it impossible to misuse the re-entrancy API for `seek()`-ing. We already have named bookmarks for that. Usage: ```php $xml = WP_XML_Processor::create_from_string( $xml_bytes ); for($i = 0;$i<10;$i++) { $xml->next_step(); } $cursor = $xml->get_reentrancy_cursor(); $unparsed_xml = substr( $xml_bytes, $xml->get_token_byte_offset_in_the_input_stream() ); $xml2 = WP_XML_Processor::create_from_string( $unparsed_xml, $cursor ); $xml2->next_step(); ``` ### WP_WXR_Reader re-entrancy The `WP_WXR_Reader` class uses the same `get_reentrancy_cursor()` interface as `WP_XML_Processor`. ### WP_Stream_Importer re-entrancy The `WP_Stream_Importer` class uses the same `get_reentrancy_cursor()` interface as `WP_XML_Processor`. See the example at the top of this description. ## Testing instructions TBD. We don't yet have a good way of running PHPUnit in the WordPress context yet. @zaerl is working on running import in CLI, we may need to wait for that before adding tests to this PR and shipping it.
1 parent 7e9a1ac commit e4ffd2d

14 files changed

+695
-370
lines changed

packages/playground/data-liberation/plugin.php

Lines changed: 56 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -8,13 +8,13 @@
88

99
/**
1010
* Don't run KSES on the attribute values during the import.
11-
*
11+
*
1212
* Without this filter, WP_HTML_Tag_Processor::set_attribute() will
1313
* assume the value is a URL and run KSES on it, which will incorrectly
1414
* prefix relative paths with http://.
15-
*
15+
*
1616
* For example:
17-
*
17+
*
1818
* > $html = new WP_HTML_Tag_Processor( '<img>' );
1919
* > $html->next_tag();
2020
* > $html->set_attribute( 'src', './_assets/log-errors.png' );
@@ -25,6 +25,41 @@
2525
return [];
2626
});
2727

28+
/**
29+
* Development debug code to run the import manually.
30+
* @TODO: Remove this in favor of a CLI command.
31+
*/
32+
add_action('init', function() {
33+
return;
34+
$wxr_path = __DIR__ . '/tests/fixtures/wxr-simple.xml';
35+
$importer = WP_Stream_Importer::create_for_wxr_file(
36+
$wxr_path
37+
);
38+
while($importer->next_step()) {
39+
// ...
40+
}
41+
return;
42+
$importer->next_step();
43+
$paused_importer_state = $importer->get_reentrancy_cursor();
44+
45+
echo "\n\n";
46+
echo "moving to importer2\n";
47+
echo "\n\n";
48+
49+
$importer2 = WP_Stream_Importer::create_for_wxr_file(
50+
$wxr_path,
51+
array(),
52+
$paused_importer_state
53+
);
54+
$importer2->next_step();
55+
$importer2->next_step();
56+
$importer2->next_step();
57+
// $importer2->next_step();
58+
// var_dump($importer2);
59+
60+
die("YAY");
61+
});
62+
2863
// Register admin menu
2964
add_action('admin_menu', function() {
3065
add_menu_page(
@@ -86,7 +121,7 @@ function data_liberation_admin_page() {
86121
data_liberation_process_import();
87122
echo '</pre>';
88123
}
89-
124+
90125
?>
91126
<h2>Active import</h2>
92127
<?php
@@ -148,9 +183,9 @@ function data_liberation_admin_page() {
148183
>
149184
<?php wp_nonce_field('data_liberation_import'); ?>
150185
<input type="hidden" name="action" value="data_liberation_import">
151-
186+
152187
<h2>Import Content</h2>
153-
188+
154189
<table class="form-table">
155190
<tr>
156191
<th scope="row">Import Type</th>
@@ -175,7 +210,7 @@ function data_liberation_admin_page() {
175210
</label>
176211
</td>
177212
</tr>
178-
213+
179214
<tr data-wp-context='{ "importType": "wxr_file" }'
180215
data-wp-class--hidden="!state.isImportTypeSelected">
181216
<th scope="row">WXR File</th>
@@ -184,7 +219,7 @@ function data_liberation_admin_page() {
184219
<p class="description">Upload a WordPress eXtended RSS (WXR) file</p>
185220
</td>
186221
</tr>
187-
222+
188223
<tr data-wp-context='{ "importType": "wxr_url" }'
189224
data-wp-class--hidden="!state.isImportTypeSelected">
190225
<th scope="row">WXR URL</th>
@@ -193,7 +228,7 @@ function data_liberation_admin_page() {
193228
<p class="description">Enter the URL of a WXR file</p>
194229
</td>
195230
</tr>
196-
231+
197232
<tr data-wp-context='{ "importType": "markdown_zip" }'
198233
data-wp-class--hidden="!state.isImportTypeSelected">
199234
<th scope="row">Markdown ZIP</th>
@@ -210,7 +245,7 @@ function data_liberation_admin_page() {
210245
<h2>Previous Imports</h2>
211246

212247
<p>TODO: Show a table of previous imports.</p>
213-
248+
214249
<table class="form-table">
215250
<tr>
216251
<th scope="row">Date</th>
@@ -329,7 +364,7 @@ function data_liberation_admin_page() {
329364
*/
330365
// if(is_wp_error(wp_schedule_event(time(), 'data_liberation_minute', 'data_liberation_process_import'))) {
331366
// wp_delete_attachment($attachment_id, true);
332-
// // @TODO: More user friendly error message – maybe redirect back to the import screen and
367+
// // @TODO: More user friendly error message – maybe redirect back to the import screen and
333368
// // show the error there.
334369
// wp_die('Failed to schedule import – the "data_liberation_minute" schedule may not be registered.');
335370
// }
@@ -353,20 +388,9 @@ function data_liberation_process_import() {
353388

354389
function data_liberation_import_step($import) {
355390
$importer = data_liberation_create_importer($import);
356-
// @TODO: Save the last importer state so we can resume it later if interrupted.
357-
update_option('data_liberation_import_progress', [
358-
'status' => 'Downloading static assets...',
359-
'current' => 0,
360-
'total' => 0
361-
]);
362-
$importer->frontload_assets();
363-
// @TODO: Keep track of multiple progress dimensions – posts, assets, categories, etc.
364-
update_option('data_liberation_import_progress', [
365-
'status' => 'Importing posts...',
366-
'current' => 0,
367-
'total' => 0
368-
]);
369-
$importer->import_entities();
391+
while($importer->next_step()) {
392+
// ...Twiddle our thumbs...
393+
}
370394
delete_option('data_liberation_active_import');
371395
// @TODO: Do not echo things. Append to an import log where we can retrace the steps.
372396
// Also, store specific import events in the database so the user can react and
@@ -382,25 +406,13 @@ function data_liberation_create_importer($import) {
382406
// @TODO: Save the error, report it to the user.
383407
return;
384408
}
385-
$entity_iterator_factory = function() use ($wxr_path) {
386-
$wxr = new WP_WXR_Reader();
387-
$wxr->connect_upstream(new WP_File_Reader($wxr_path));
388-
389-
return $wxr;
390-
};
391-
return WP_Stream_Importer::create(
392-
$entity_iterator_factory
409+
return WP_Stream_Importer::create_for_wxr_file(
410+
$wxr_path
393411
);
394412

395413
case 'wxr_url':
396-
$wxr_url = $import['wxr_url'];
397-
$entity_iterator_factory = function() use ($wxr_url) {
398-
$wxr = new WP_WXR_Reader();
399-
$wxr->connect_upstream(new WP_Remote_File_Reader($wxr_url));
400-
return $wxr;
401-
};
402-
return WP_Stream_Importer::create(
403-
$entity_iterator_factory
414+
return WP_Stream_Importer::create_for_wxr_url(
415+
$import['wxr_url']
404416
);
405417

406418
case 'markdown_zip':
@@ -419,18 +431,12 @@ function data_liberation_create_importer($import) {
419431
}
420432
}
421433
$markdown_root = $temp_dir;
422-
$entity_iterator_factory = function() use ($markdown_root) {
423-
return new WP_Markdown_Directory_Tree_Reader(
424-
$markdown_root,
425-
1000
426-
);
427-
};
428-
return WP_Markdown_Importer::create(
429-
$entity_iterator_factory, [
434+
return WP_Markdown_Importer::create_for_markdown_directory(
435+
$markdown_root, [
430436
'source_site_url' => 'file://' . $markdown_root,
431437
'local_markdown_assets_root' => $markdown_root,
432438
'local_markdown_assets_url_prefix' => '@site/',
433439
]
434440
);
435441
}
436-
}
442+
}

packages/playground/data-liberation/src/byte-readers/WP_Byte_Reader.php

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
<?php
22

33
interface WP_Byte_Reader {
4-
public function pause(): array|bool;
5-
public function resume( $paused_state ): bool;
4+
public function tell(): int;
5+
public function seek( int $offset ): bool;
66
public function is_finished(): bool;
77
public function next_bytes(): bool;
88
public function get_bytes(): string|null;

packages/playground/data-liberation/src/byte-readers/WP_File_Reader.php

Lines changed: 19 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,8 @@ class WP_File_Reader implements WP_Byte_Reader {
99
protected $chunk_size;
1010
protected $file_pointer;
1111
protected $offset_in_file;
12-
protected $output_bytes;
12+
protected $output_bytes = '';
13+
protected $last_chunk_size = 0;
1314
protected $last_error;
1415
protected $state = self::STATE_STREAMING;
1516

@@ -18,22 +19,24 @@ public function __construct( $file_path, $chunk_size = 8096 ) {
1819
$this->chunk_size = $chunk_size;
1920
}
2021

21-
/**
22-
* Really these are just `tell()` and `seek()` operations, only the state is more
23-
* involved than a simple offset. Hmm.
24-
*/
25-
public function pause(): array|bool {
26-
return array(
27-
'offset_in_file' => $this->offset_in_file,
28-
);
22+
public function tell(): int {
23+
// Save the previous offset, not the current one.
24+
// This way, after resuming, the next read will yield the same $output_bytes
25+
// as we have now.
26+
return $this->offset_in_file - $this->last_chunk_size;
2927
}
3028

31-
public function resume( $paused_state ): bool {
29+
public function seek( $offset_in_file ): bool {
30+
if ( ! is_int( $offset_in_file ) ) {
31+
_doing_it_wrong( __METHOD__, 'Cannot set a file reader cursor to a non-integer offset.', '1.0.0' );
32+
return false;
33+
}
3234
if ( $this->file_pointer ) {
33-
_doing_it_wrong( __METHOD__, 'Cannot resume a file reader that is already initialized.', '1.0.0' );
35+
_doing_it_wrong( __METHOD__, 'Cannot set a file reader cursor on a file reader that is already initialized.', '1.0.0' );
3436
return false;
3537
}
36-
$this->offset_in_file = $paused_state['offset_in_file'];
38+
$this->offset_in_file = $offset_in_file;
39+
$this->last_chunk_size = 0;
3740
return true;
3841
}
3942

@@ -50,7 +53,8 @@ public function get_last_error(): string|null {
5053
}
5154

5255
public function next_bytes(): bool {
53-
$this->output_bytes = '';
56+
$this->output_bytes = '';
57+
$this->last_chunk_size = 0;
5458
if ( $this->last_error || $this->is_finished() ) {
5559
return false;
5660
}
@@ -66,7 +70,8 @@ public function next_bytes(): bool {
6670
$this->state = static::STATE_FINISHED;
6771
return false;
6872
}
69-
$this->offset_in_file += strlen( $bytes );
73+
$this->last_chunk_size = strlen( $bytes );
74+
$this->offset_in_file += $this->last_chunk_size;
7075
$this->output_bytes .= $bytes;
7176
return true;
7277
}

packages/playground/data-liberation/src/byte-readers/WP_Remote_File_Reader.php

Lines changed: 13 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,19 @@ public function __construct( $url ) {
2222
$this->url = $url;
2323
}
2424

25+
public function tell(): int {
26+
return $this->bytes_already_read + $this->skip_bytes;
27+
}
28+
29+
public function seek( $offset_in_file ): bool {
30+
if ( $this->request ) {
31+
_doing_it_wrong( __METHOD__, 'Cannot set a remote file reader cursor on a remote file reader that is already initialized.', '1.0.0' );
32+
return false;
33+
}
34+
$this->skip_bytes = $offset_in_file;
35+
return true;
36+
}
37+
2538
public function next_bytes(): bool {
2639
if ( null === $this->request ) {
2740
$this->request = new WordPress\AsyncHttp\Request(
@@ -90,21 +103,6 @@ public function get_bytes(): string|null {
90103
return $this->current_chunk;
91104
}
92105

93-
public function pause(): array|bool {
94-
return array(
95-
'offset_in_file' => $this->bytes_already_read + $this->skip_bytes,
96-
);
97-
}
98-
99-
public function resume( $paused_state ): bool {
100-
if ( $this->request ) {
101-
_doing_it_wrong( __METHOD__, 'Cannot resume a remote file reader that is already initialized.', '1.0.0' );
102-
return false;
103-
}
104-
$this->skip_bytes = $paused_state['offset_in_file'];
105-
return true;
106-
}
107-
108106
public function is_finished(): bool {
109107
return $this->is_finished;
110108
}

0 commit comments

Comments
 (0)