Skip to content

arcxp/arc-xp-etl-migration-starter

Repository files navigation

Arc XP ETL Migration Starter Kit

A starter kit for building ETL adapters that migrate content from your CMS to Arc XP. This starter kit provides both standalone and orchestrated (Temporal) implementations for extracting, identifying, transforming, and loading content into Arc XP. Use this as a learning resource and starting point for your own content migration project.

Overview

This starter kit helps you:

  • Extract content from your source CMS
  • Identify content types and generate Arc XP ANS IDs
  • Transform source content into Arc XP ANS (Arc Native Specification) format
  • Load transformed content into Arc XP via the Migration Center API

The starter kit includes two adapter implementations:

  • Standalone Adapter (/adapter) - For iterative development and testing
  • Temporal Adapter (/adapter_temporal) - For production orchestrated execution

Scope and Limitations

This starter kit does not address all Arc XP object types. The starter kit focuses on:

  • Stories - Standard articles with text content
  • Images - Standalone images and images embedded in stories

Content types not addressed include:

  • Videos
  • Image galleries
  • Other Arc XP content types (authors, redirects, etc.)

Additionally, this starter kit does not address all possible content_elements that can be included in Arc XP stories. The examples demonstrate basic content elements (text, headings, lists, images), but you may need to extend the converters to handle additional content element types specific to your use case.

You can extend the starter kit to support additional content types and content elements by following the customization patterns in the Customization Guide.

Quick Start

  1. Create and activate a virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  2. Install dependencies:

    pip install -r requirements.txt
  3. Set up environment variables (create .env file):

    BEARER_TOKEN=your_arc_xp_token
    ORG=your_organization_id
    ENV=sandbox  # or production
    WEBSITE=your_website_id
    TEMPORAL_SERVER_URL=localhost:7233  # Optional, defaults to localhost:7233
    EXTRACT_API_URL=http://127.0.0.1:8000/  # Optional, defaults to fake-news
  4. Start the mock CMS service (for testing):

    fastapi dev fake-news/main.py
  5. Choose your adapter:

Documentation

Getting Started

  • Customization Guide - START HERE - Guide for adapting this starter kit to your CMS
    • Quick start checklist
    • Selection of non-comprehensive customization points for each component
    • Common customization patterns
    • Examples and best practices

Adapter Implementations

  • Standalone Adapter - Non-orchestrated implementation

    • Best for: Iterative development, debugging, learning, testing modifications
    • Run each script independently
    • Manual data passing between steps
    • See: Content mapping, extract, identify, transform, load processes
  • Temporal Adapter - Orchestrated implementation using Temporal

    • Best for: Production deployments, automated workflows
    • Automatic data flow between functions
    • Built-in error handling and retries
    • Progress tracking via Temporal UI
    • See: Setup, orchestration, workflow architecture

Supporting Services

  • Fake-News Mock CMS - Mock CMS API for development and testing
    • FastAPI service that generates test content
    • Includes error fixtures for testing failure scenarios
    • Reference implementation for CMS data structure

Repository Structure

arc-xp-etl-migration-starter/
├── adapter/                    # Standalone adapter implementation
│   ├── extract.py             # Extract content from CMS
│   ├── identify.py             # Identify content types
│   ├── convert_story.py        # Transform stories to ANS
│   ├── convert_image.py        # Transform images to ANS
│   ├── inventory.py            # Track processed items
│   ├── load.py                 # Load content to Arc XP
│   └── ReadMe.md              # Standalone adapter documentation
│
├── adapter_temporal/           # Temporal orchestrated adapter
│   ├── activities.py          # Temporal activities (ETL steps)
│   ├── orchestration_workflow.py  # Workflow definitions
│   ├── run_worker.py          # Start Temporal worker
│   ├── run_workflow.py        # Start workflow execution
│   ├── tests/                 # Unit tests
│   └── ReadMe.md             # Temporal adapter documentation
│
├── fake-news/                 # Mock CMS service
│   ├── main.py               # FastAPI application
│   ├── fake_content.py       # Content generation logic
│   └── ReadMe.md            # Mock CMS documentation
│
├── fixtures/                  # Test data and examples
│   ├── fake_news_*.json      # Sample content fixtures
│   └── workflow_logging_result.txt  # Example workflow output
│
├── CUSTOMIZATION_GUIDE.md    # Comprehensive customization guide
├── requirements.txt          # Python dependencies
└── README.md                # This file

Key Concepts

ETL Pipeline Stages

  1. Extract - Retrieve content from your source CMS
  2. Identify - Determine content type and generate Arc XP ANS IDs
  3. Transform - Convert source content to Arc XP ANS format
  4. Load - Submit transformed content to Arc XP via Migration Center API

Content Mapping

Before coding, you must map your source CMS fields to Arc XP ANS fields. This is critical pre-work that determines:

  • Which source fields map to which ANS fields
  • What transformations are needed (date formats, URL structures, etc.)
  • How to handle different content types

See the Pre-Work section in the Standalone Adapter guide for detailed mapping guidance.

ANS (Arc Native Specification) Format

Arc XP uses ANS as its native content format. Your adapter must transform your CMS content into ANS format, which includes:

  • Required fields: _id, type, version, etc.
  • Content elements: Story body, images, headlines, lists, etc.
  • Circulation: Sections, URLs, publishing metadata
  • Migration Center payload: Wrapper for loading into Arc XP

Customization Workflow

  1. Read the Customization Guide - Understand what needs to be customized
  2. Map your CMS data - Create a mapping document of source → ANS fields
  3. Start with the Standalone Adapter - Test transformations incrementally
  4. Customize converters - Modify set_content_elements() and other set_*() methods

Requirements

  • Python 3.8+
  • Arc XP account with Migration Center API access
  • (For Temporal adapter) Temporal server (local dev server or cloud instance)

See requirements.txt for Python package dependencies.

Getting Help

About

A starter kit for building ETL adapters that migrate content from your CMS to Arc XP.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages