Skip to content

Commit 9cf535d

Browse files
authored
Confluence source connector: add support for extracting inline images, files, etc. from pages (#514)
1 parent 18bf878 commit 9cf535d

File tree

7 files changed

+29
-2
lines changed

7 files changed

+29
-2
lines changed

snippets/general-shared-text/confluence-api-placeholders.mdx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@
33
- `<max-num-of-spaces>` - The maximum number of Confluence spaces to access within the Confluence Cloud instance. The default is `500` unless otherwise specified.
44
- `<max-num-of-docs-from-each-space>` - The maximum number of documents to access within each space. The default is `150` unless otherwise specified.
55
- `spaces` is an array of strings, with each `<space-name>` specifying the name of a space to access, for example: `["luke","paul"]`. By default, if no space names are specified, and the `<max-num-of-spaces>` is exceeded for the instance, be aware that you might get unexpected results.
6+
- `extract_images` - Set to `true` to download images and replace the HTML content with Base64-encoded images. The default is `false` if not otherwise specified.
7+
- `extract_files` - Set to `true` to download any embedded files in pages. The default is `false` if not otherwise specified.
68

79
For API token authentication:
810

snippets/general-shared-text/confluence-cli-api.mdx

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,3 +23,7 @@ Additional settings include:
2323
- `--max-num-of-spaces` (CLI) or `max_num_of_spaces` (Python): Optionally, the maximum number of spaces to access, expressed as an integer. The default value is `500` if not otherwise specified.
2424
- `--max-num-of-docs-from-each-space` (CLI) or `max_num_of_docs_from_each_space` (Python): Optionally, the maximum number of documents to access from each space, expressed as an integer. The default value is `100` if not otherwise specified.
2525
- `--cloud` or `--no-cloud` (CLI) or `cloud` (Python): Optionally, whether to use Confluence Cloud (`--cloud` for CLI or `cloud=True` for Python). The default is `--no-cloud` (CLI) or `cloud=False` (Python) if not otherwise specified.
26+
- `--extract-images` (CLI) or `extract_images` (Python): Optionally, download images and replace the HTML content with Base64-encoded images. The default is `--no-extract-images` (CLI) or `extract_images=False` (Python) if not otherwise specified.
27+
- `--extract-files` (CLI) or `extract_files` (Python): Optionally, download any embedded files. The default is `--no-extract-files` (CLI) or `extract_files=False` (Python) if not otherwise specified.
28+
- `--force-download` (CLI) or `force_download` (Python): Optionally, re-download extracted files even if they already exist locally. The default is `--no-force-download` (CLI) or `force_download=False` (Python) if not otherwise specified.
29+
- `--allow-list` (CLI) or `allow_list`: Optionally, a command-separated list (CLI) or a an array of strings (Python) of allowed URLs to download. By default, the base URL that the original HTML came from is used, if not otherwise specified.

snippets/general-shared-text/confluence-platform.mdx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,13 @@ Fill in the following fields:
44
- **URL** (_required_): The target Confluence site's URL.
55
- For personal access token (PAT) authentication: for **Authentication Method**, select **Personal Access Token**. Then enter the PAT into the **Personal Access Token** field.
66
- For API token or password authentication: for **Authentication Method**, select **Password or API token**. Then enter the user's name or email address into the **Username** field and the API token or password into the **Password** field. Also, if you are using Confluence Cloud, check the **Cloud** box.
7+
- **Cloud**: Check this box if you are using Confluence Cloud. By default this box is unchecked.
78
- **Max number of spaces**: The maximum number of Confluence spaces to access within the Confluence Cloud instance.
89
The default is 500 unless otherwise specified.
910
- **Max number of docs per space**: The maximum number of documents to access within each space.
1011
The default is 150 unless otherwise specified.
1112
- **List of spaces**: A comma-separated string that lists the names of all of the spaces to access, for example: `luke,paul`.
1213
By default, if no space names are specified, and the **Max Number of Spaces** is reached for the instance, be aware that you might get
1314
unexpected results.
15+
- **Extract inline images**: Check this box to download images and replace the HTML content with Base64-encoded images. By default, this box is unchecked.
16+
- **Extract files**: Check this box to download any embedded files in pages. By default, this box is unchecked.

snippets/source_connectors/confluence.sh.mdx

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,9 @@ unstructured-ingest \
1111
--spaces luke,paul \
1212
--max-num-of-spaces 500 \
1313
--max-num-of-docs-from-each-space 150 \
14+
--extract-images \
15+
--extract-files \
16+
--force-download \
1417
--output-dir $LOCAL_FILE_OUTPUT_DIR \
1518
--partition-by-api \
1619
--api-key $UNSTRUCTURED_API_KEY \
@@ -26,6 +29,9 @@ unstructured-ingest \
2629
--spaces luke,paul \
2730
--max-num-of-spaces 500 \
2831
--max-num-of-docs-from-each-space 150 \
32+
--extract-images \
33+
--extract-files \
34+
--force-download \
2935
--output-dir $LOCAL_FILE_OUTPUT_DIR \
3036
--partition-by-api \
3137
--api-key $UNSTRUCTURED_API_KEY \
@@ -43,6 +49,9 @@ unstructured-ingest \
4349
--spaces luke,paul \
4450
--max-num-of-spaces 500 \
4551
--max-num-of-docs-from-each-space 150 \
52+
--extract-images \
53+
--extract-files \
54+
--force-download \
4655
--output-dir $LOCAL_FILE_OUTPUT_DIR \
4756
--partition-by-api \
4857
--api-key $UNSTRUCTURED_API_KEY \

snippets/source_connectors/confluence.v2.py.mdx

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,13 @@ if __name__ == "__main__":
2626
max_num_of_spaces=500,
2727
max_num_of_docs_from_each_space=150
2828
),
29-
downloader_config=ConfluenceDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")),
30-
29+
downloader_config=ConfluenceDownloaderConfig(
30+
download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")
31+
extract_images=True,
32+
extract_files=True,
33+
force_download=True,
34+
allow_list=[]
35+
),
3136
# For API token authentication:
3237
source_connection_config=ConfluenceConnectionConfig(
3338
access_config=ConfluenceAccessConfig(

snippets/source_connectors/confluence_rest_create.mdx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@ curl --request 'POST' --location \
1313
"max_num_of_spaces": <max-num-of-spaces>,
1414
"max_num_of_docs_from_each_space": <max-num-of-docs-from-each-space>,
1515
"spaces": ["<space-name>", "<space-name>"],
16+
"extract_images": "<true|false>",
17+
"extract_files": "<true|false>",
1618
1719
# For API token authentication:
1820

snippets/source_connectors/confluence_sdk.mdx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,8 @@ with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as clien
2020
max_num_of_spaces=<max-num-of-spaces>,
2121
max_num_of_docs_from_each_space=<max-num-of-docs-from-each-space>,
2222
spaces=["<space-name>", "<space-name>"],
23+
extract_images=<True|False>,
24+
extract_files=<True|False>,
2325

2426
# For API token authentication:
2527

0 commit comments

Comments
 (0)