Skip to content

Latest commit

 

History

History
194 lines (146 loc) · 8.19 KB

File metadata and controls

194 lines (146 loc) · 8.19 KB

Confluence Sync

Copies an entire page tree (including images and attachments) from one Confluence instance to another, preserving hierarchy and rewriting internal links.

Current configuration

  • Source: Configured in config.json (source Confluence instance, space, and root page)
  • Target: Configured in config.json (target Confluence instance, space, and parent page — synced pages are placed as siblings of existing content)

Requirements

  • Python 3.10+
  • requests library (pip install requests)

Installation

# Create a virtual environment (or use an existing one)
python3 -m venv .venv
source .venv/bin/activate
pip install requests

# Copy the config template and fill in your tokens
cp config.example.json config.json

Configuration

All sensitive and environment-specific values are stored in config.json:

{
  "source": {
    "base_url": "https://source-confluence.example.com",
    "token": "<SOURCE_BEARER_TOKEN>",
    "space_key": "SOURCEKEY",
    "page_id": "123456789"
  },
  "target": {
    "base_url": "https://target-confluence.example.com",
    "token": "<TARGET_BEARER_TOKEN>",
    "space_key": "TARGETKEY",
    "parent_page_id": "987654321"
  },
  "title_prefix": "[SYNC] ",
  "restrictions": {
    "users": [
      "user@example.com"
    ],
    "groups": []
  }
}
Field Description
source.base_url URL of the source Confluence instance
source.token Bearer token (API key) for the source
source.space_key Space key of the source space
source.page_id ID of the root page to be copied
target.base_url URL of the target Confluence instance
target.token Bearer token for the target
target.space_key Space key of the target space
target.parent_page_id ID of the parent page under which the copy is placed
title_prefix Prefix prepended to all page titles on the target (e.g. "[SYNC] "). Set to "" to disable. Useful to avoid title conflicts with existing pages.
restrictions.users List of usernames that get read access to the synced pages
restrictions.groups List of groups that get read access to the synced pages

config.json is listed in .gitignore and will not be committed. Use config.example.json as a template.

Usage

# Full sync (fetch + create + attachments + rewrite-links + restrict)
python3 confluence_sync.py --phase all

# Dry-run: preview what would happen without making changes
python3 confluence_sync.py --phase all --dry-run

# Run individual phases
python3 confluence_sync.py --phase fetch
python3 confluence_sync.py --phase create
python3 confluence_sync.py --phase attachments
python3 confluence_sync.py --phase rewrite-links
python3 confluence_sync.py --phase restrict

# Clean up: remove all synced pages from the target
python3 confluence_sync.py --phase delete

Phases

The script operates in 5 sequential phases (plus a separate delete phase):

1. Fetch (--phase fetch)

Recursively retrieves the entire page tree from the source instance:

  • Page content in Confluence storage format (XHTML)
  • List of attachments per page (with download URLs)
  • Hierarchical structure (parent-child relationships)

The data is stored in confluence_sync_state.json so subsequent phases do not need to re-fetch.

2. Create (--phase create)

Creates all pages on the target instance:

  • Root page is created as a child of target.parent_page_id
  • Child pages are recursively created with the same hierarchy
  • Page content (storage format) is transferred as-is, no conversion needed
  • Mapping from source ID to target ID is saved in state

Resume support: if the script stops midway, already created pages are skipped on restart.

3. Attachments (--phase attachments)

Copies all attachments (images, files) from source to target:

  • Downloads each attachment from the source API
  • Caches locally in the cache/ directory
  • Uploads to the corresponding target page
  • Tracks which pages are complete (resume support)

4. Rewrite Links (--phase rewrite-links)

Rewrites internal links in all target pages:

  • ri:content-id references: source IDs are replaced with target IDs
  • Hardcoded URLs to the source instance are rewritten to the target
  • ri:space-key attributes are updated to the target space key
  • Both relative and absolute URL patterns are handled

5. Restrict (--phase restrict)

Sets a read restriction on the synced root page. Confluence uses restriction inheritance: only the listed users and groups can see the root page and all its descendants. Everyone else — including logged-in users — will not see these pages.

  • Restriction is set on the root page only (children inherit automatically)
  • Users and groups are configured in config.json under restrictions
  • To grant access to additional users later, add them to the restrictions.users list and re-run --phase restrict

6. Delete (--phase delete)

Removes all previously synced pages from the target:

  • Recursively deletes all child pages (bottom-up)
  • Deletes the root page
  • Resets the ID mapping and attachment status in the state

Generated files

File Description
confluence_sync_state.json Contains the full sync state: fetched pages, ID mapping, progress. Managed automatically.
confluence_sync.log Detailed log file (DEBUG level). Console only shows INFO.
cache/ Local cache of downloaded attachments. Can be deleted after a successful sync.

All generated files are listed in .gitignore.

How it works

Source Confluence                        Target Confluence
┌─────────────────────┐                  ┌─────────────────────────┐
│ Root page           │   ── fetch ──>   │ State (JSON)            │
│ ├── Section A       │                  │ ├── source_pages        │
│ │   ├── Page 1      │                  │ ├── source_tree         │
│ │   └── Page 2      │                  │ └── id_mapping          │
│ ├── Section B       │                  │                         │
│ └── Section C       │   ── create ──>  │ Target parent page      │
│                     │                  │ ├── Existing content    │
│ Attachments:        │                  │ └── Root page (new)     │
│ ├── image1.png      │ ─ attachments ─> │     ├── Section A       │
│ └── diagram.svg     │                  │     │   ├── Page 1      │
│                     │                  │     │   └── Page 2      │
│ Internal links:     │                  │     ├── Section B       │
│ ri:content-id="123" │ ─ rewrite ────>  │     └── Section C       │
│ href="/spaces/SRC"  │                  │     (links rewritten)   │
└─────────────────────┘                  └─────────────────────────┘
  1. Fetch retrieves everything and stores it locally in the state
  2. Create builds the page tree on the target (storage format 1:1)
  3. Attachments copies all files from source to target
  4. Rewrite updates internal links to point to the target pages
  5. Restrict locks down the root page so only listed users/groups can see it

Important notes

  • Macros: Confluence-specific macros (expand, code, toc, etc.) are copied as-is. This only works if both instances support the same plugins/macros.
  • Permissions: Source permissions are not copied. Instead, the restrict phase sets new read restrictions on the root page based on config.json. All child pages inherit this restriction automatically.
  • Titles: Page titles are transferred unchanged. If a page with the same title already exists in the target space, the create phase will fail for that page.
  • Rate limiting: The script has built-in pauses (0.3-0.5s) between API calls and retry logic for HTTP 429/502/503/504.
  • State file: Delete confluence_sync_state.json to start a completely clean sync. Or use --phase delete to clean up the previous sync first.