The import is started by the Django management command:
python manage.py import_xml path/to/xml/file.xml parent_page_idbut what happens afterwards is complicated, and highly configurable:
<img src="import-flow-diagram.png">We assume here that you have a full XML export of a WordPress site. You should at least have considered using analyse and reduce, before you run the import.
Then, when you run the import_xml command it creates an importers.wordpress.WordpressImporter instance, which handles the rest of the import.
WordpressImporter begins by looking at each XML node tag in turn.
If the node is registered in settings for cacheing until later, see Cache tags and items for later.
Otherwise, if the node is a WordPress post or page, the import creates a WordPressItem instance to clean and process it.
All other nodes are ignored.
The item is checked to see whether it represents an existing Wagtail Page instance, and a page is either created or updated accordingly.
The item is processed for any categories, if that feature is enabled. See Categories import.
The item is cleaned. This:
- Cleans various fields, including:
-
- trimming whitespace from text fields
-
- validating dates
-
- formatting slugs
-
- SEO-related fields (see also Yoast Plugin documentation)
- Pre-filters the body field, to clean up WordPress quirks of the HTML. Pre-filters are a series of cumulative functions. These are configurable. See Pre-filters.
- Converts the body to a series of StreamField blocks. The mapping of HTML to StreamField blocks is configurable. See Block Builder.
-
- This stage also includes importing images from the source site, and modifying HTML
imgtags to link to the imported Image instance instead.
- This stage also includes importing images from the source site, and modifying HTML
- Converts any
wp:post_metatag nodes to a dict, which we can query later using Django's JSONField.
The package assumes that your own target Page model has an .import_wordpress_data function that maps the cleaned data to the available fields.
The page is saved, and published according to the WordPress live/draft status. The page ID is added to a list for later.
If the XML node is configured in settings for cacheing until later, it will be stored in WordpressImporter.tags_cache (for top-level tags such as authors), or .items_cache. Nodes that are suitable for this are those such as references to objects elsewhere in the XML, for example a header image, or an author link. We can't create the relationship until both objects have been imported.
The cache will use a key, e.g, "wp:author", to store a list of unique author entities found.
Your settings will also have specified a function to be run after walking the XML, see Run registered functions.
Once all nodes are processed, and either imported as Page objects, or cached for later, some processing involving database relations is possible.
If page body StreamField contains RichText blocks with HTML a elements linking to relative URLs, the importer attempts to convert these to dynamic Wagtail references to Page objects.
These are configurable functions to process cached tags and items, and create the ForeignKey relationships they represent. If an XML node has been configured for cacheing until later, a corresponding function will have been registered to process the cached data.
It is up to developers using this package to write these functions. They receive a queryset of imported pages (this query uses the list of page IDs stored earlier by the importer class), and the cached data.
For the author example previously, this function might handle getting the Author model instance (or creating a new one) for each entry found in the cache, and then querying the imported pages, and linking them to this author.
As the wp:post_meta tag of pages was cleaned and stored in a JSON field, it is possible to query all pages which refer to this author in their WordPress metadata, and update them with actual database links.
See Import Hooks.