Confluence source connector: add support for extracting inline images, files, etc. from pages (#514)

Paul-Cornell · web-flow · commit 9cf535dd6ade · 2025-03-06T17:01:14.000-08:00
diff --git a/snippets/general-shared-text/confluence-api-placeholders.mdx b/snippets/general-shared-text/confluence-api-placeholders.mdx
@@ -3,6 +3,8 @@
 - `<max-num-of-spaces>` - The maximum number of Confluence spaces to access within the Confluence Cloud instance. The default is `500` unless otherwise specified.
 - `<max-num-of-docs-from-each-space>` - The maximum number of documents to access within each space. The default is `150` unless otherwise specified.
 - `spaces` is an array of strings, with each `<space-name>` specifying the name of a space to access, for example: `["luke","paul"]`. By default, if no space names are specified, and the `<max-num-of-spaces>` is exceeded for the instance, be aware that you might get unexpected results.
+- `extract_images` - Set to `true` to download images and replace the HTML content with Base64-encoded images. The default is `false` if not otherwise specified.
+- `extract_files` - Set to `true` to download any embedded files in pages. The default is `false` if not otherwise specified.
 
 For API token authentication:
 
diff --git a/snippets/general-shared-text/confluence-cli-api.mdx b/snippets/general-shared-text/confluence-cli-api.mdx
@@ -23,3 +23,7 @@ Additional settings include:
 - `--max-num-of-spaces` (CLI) or `max_num_of_spaces` (Python): Optionally, the maximum number of spaces to access, expressed as an integer. The default value is `500` if not otherwise specified.
 - `--max-num-of-docs-from-each-space` (CLI) or `max_num_of_docs_from_each_space` (Python): Optionally, the maximum number of documents to access from each space, expressed as an integer. The default value is `100` if not otherwise specified.
 - `--cloud` or `--no-cloud` (CLI) or `cloud` (Python): Optionally, whether to use Confluence Cloud (`--cloud` for CLI or `cloud=True` for Python). The default is `--no-cloud` (CLI) or `cloud=False` (Python) if not otherwise specified.
+- `--extract-images` (CLI) or `extract_images` (Python): Optionally, download images and replace the HTML content with Base64-encoded images. The default is `--no-extract-images` (CLI) or `extract_images=False` (Python) if not otherwise specified.
+- `--extract-files` (CLI) or `extract_files` (Python): Optionally, download any embedded files. The default is `--no-extract-files` (CLI) or `extract_files=False` (Python) if not otherwise specified.
+- `--force-download` (CLI) or `force_download` (Python): Optionally, re-download extracted files even if they already exist locally. The default is `--no-force-download` (CLI) or `force_download=False` (Python) if not otherwise specified.
+- `--allow-list` (CLI) or `allow_list`: Optionally, a command-separated list (CLI) or a an array of strings (Python) of allowed URLs to download. By default, the base URL that the original HTML came from is used, if not otherwise specified.
diff --git a/snippets/general-shared-text/confluence-platform.mdx b/snippets/general-shared-text/confluence-platform.mdx
@@ -4,10 +4,13 @@ Fill in the following fields:
 - **URL** (_required_): The target Confluence site's URL.
 - For personal access token (PAT) authentication: for **Authentication Method**, select **Personal Access Token**. Then enter the PAT into the **Personal Access Token** field.
 - For API token or password authentication: for **Authentication Method**, select **Password or API token**. Then enter the user's name or email address into the **Username** field and the API token or password into the **Password** field. Also, if you are using Confluence Cloud, check the **Cloud** box.
+- **Cloud**: Check this box if you are using Confluence Cloud. By default this box is unchecked.
 - **Max number of spaces**: The maximum number of Confluence spaces to access within the Confluence Cloud instance. 
   The default is 500 unless otherwise specified.
 - **Max number of docs per space**: The maximum number of documents to access within each space. 
   The default is 150 unless otherwise specified.
 - **List of spaces**: A comma-separated string that lists the names of all of the spaces to access, for example: `luke,paul`. 
   By default, if no space names are specified, and the **Max Number of Spaces** is reached for the instance, be aware that you might get 
   unexpected results.
+- **Extract inline images**: Check this box to download images and replace the HTML content with Base64-encoded images. By default, this box is unchecked.
+- **Extract files**: Check this box to download any embedded files in pages. By default, this box is unchecked.
diff --git a/snippets/source_connectors/confluence.sh.mdx b/snippets/source_connectors/confluence.sh.mdx
@@ -11,6 +11,9 @@ unstructured-ingest \
     --spaces luke,paul \
     --max-num-of-spaces 500 \
     --max-num-of-docs-from-each-space 150 \
+    --extract-images \
+    --extract-files \
+    --force-download \
     --output-dir $LOCAL_FILE_OUTPUT_DIR \
     --partition-by-api \
     --api-key $UNSTRUCTURED_API_KEY \
@@ -26,6 +29,9 @@ unstructured-ingest \
     --spaces luke,paul \
     --max-num-of-spaces 500 \
     --max-num-of-docs-from-each-space 150 \
+    --extract-images \
+    --extract-files \
+    --force-download \
     --output-dir $LOCAL_FILE_OUTPUT_DIR \
     --partition-by-api \
     --api-key $UNSTRUCTURED_API_KEY \
@@ -43,6 +49,9 @@ unstructured-ingest \
     --spaces luke,paul \
     --max-num-of-spaces 500 \
     --max-num-of-docs-from-each-space 150 \
+    --extract-images \
+    --extract-files \
+    --force-download \
     --output-dir $LOCAL_FILE_OUTPUT_DIR \
     --partition-by-api \
     --api-key $UNSTRUCTURED_API_KEY \
diff --git a/snippets/source_connectors/confluence.v2.py.mdx b/snippets/source_connectors/confluence.v2.py.mdx
@@ -26,8 +26,13 @@ if __name__ == "__main__":
             max_num_of_spaces=500,
             max_num_of_docs_from_each_space=150
         ),
-        downloader_config=ConfluenceDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")),
-        
+        downloader_config=ConfluenceDownloaderConfig(
+            download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")
+            extract_images=True,
+            extract_files=True,
+            force_download=True,
+            allow_list=[] 
+        ),
         # For API token authentication:
         source_connection_config=ConfluenceConnectionConfig(
             access_config=ConfluenceAccessConfig(
diff --git a/snippets/source_connectors/confluence_rest_create.mdx b/snippets/source_connectors/confluence_rest_create.mdx
@@ -13,6 +13,8 @@ curl --request 'POST' --location \
         "max_num_of_spaces": <max-num-of-spaces>,
         "max_num_of_docs_from_each_space": <max-num-of-docs-from-each-space>,
         "spaces": ["<space-name>", "<space-name>"],
+        "extract_images": "<true|false>",
+        "extract_files": "<true|false>",
 
         # For API token authentication:
 
diff --git a/snippets/source_connectors/confluence_sdk.mdx b/snippets/source_connectors/confluence_sdk.mdx
@@ -20,6 +20,8 @@ with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as clien
                     max_num_of_spaces=<max-num-of-spaces>,
                     max_num_of_docs_from_each_space=<max-num-of-docs-from-each-space>,
                     spaces=["<space-name>", "<space-name>"],
+                    extract_images=<True|False>,
+                    extract_files=<True|False>,
 
                     # For API token authentication: