Skip to content

Commit 2e8185b

Browse files
pamelafoxmattgotteinerCopilot
authored
Add ACL support for cloud ingestion pipeline (#2917)
* add custom role * add x ms elevated read * conditionally provision role * fall back to AZURE_TENANT_ID if auth tenant unavailable * Implement ACLs for cloud ingestion * Address PR review comments for ACL cloud ingestion - Remove unused AZURE_ADLS_GEN2_FILESYSTEM_PATH from CI/CD files and docs - Make ACL fields conditional on use_acls in cloudingestionstrategy.py - Add use_acls setting to text_processor for conditional ACL output - Make local_files required parameter in setup_list_file_strategy() - Add legacy fallback comments for AZURE_CLOUD_INGESTION_STORAGE_ACCOUNT - Rename is_adls to storage_is_adls for clarity - Update tests for all changes * Fix markdown error * Address Copilot PR review comments - Fix print statement to use f-string instead of %s placeholder - Fix typo in comment (hyphen to colon) - Include empty ACL arrays when use_acls=True to distinguish 'no ACLs' from 'not extracted' - Add 3 tests for text_processor ACL passthrough behavior * Upgrade black to 26.1.0 and reformat files - Update pre-commit config to black 26.1.0 - Reformat test_htmlparser.py and test_textparser.py - Add github-pr-inline-reply skill * Add issue links to skill * Add docs on using verify * Fix ACL config issues from code review - Fix env var: use USE_CLOUD_INGESTION_ACLS instead of AZURE_USE_AUTHENTICATION in setup_cloud_ingestion.py - Add warning when AZURE_ENFORCE_ACCESS_CONTROL enabled without USE_CLOUD_INGESTION_ACLS - Move DataLakeServiceClient initialization to GlobalSettings for better performance * Update tests for GlobalSettings data_lake_service_client field * Address PR review comments: add permission note and USE_CLOUD_INGESTION_ACLS to env vars reference * Fix: use consistent USE_CLOUD_INGESTION_ACLS env var name across functions and Bicep * Reorganize login_and_acl.md: remove outdated Testing section, move ACL verification to cloud ingestion, move Programmatic access to top-level, rename Troubleshooting * Address PR review comments: add Bicep parameter descriptions and documentation - Add comment on adlsStorageAccountName explaining it must be specified when useExistingAdlsStorage is true (Bicep assert is experimental) - Add @description() decorators to storage-role.bicep parameters - Add comment explaining useCloudIngestionAcls requires useCloudIngestion * Update docs/login_and_acl.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix markdown linting: add blank line above heading --------- Co-authored-by: Matt Gotteiner <matthew.gotteiner@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
1 parent 8292fbe commit 2e8185b

File tree

19 files changed

+1120
-405
lines changed

19 files changed

+1120
-405
lines changed

.azdo/pipelines/azure-dev.yml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,11 @@ steps:
8787
AZURE_LOG_ANALYTICS: $(AZURE_LOG_ANALYTICS)
8888
USE_VECTORS: $(USE_VECTORS)
8989
USE_MULTIMODAL: $(USE_MULTIMODAL)
90+
USE_CLOUD_INGESTION: $(USE_CLOUD_INGESTION)
91+
USE_CLOUD_INGESTION_ACLS: $(USE_CLOUD_INGESTION_ACLS)
92+
USE_EXISTING_ADLS_STORAGE: $(USE_EXISTING_ADLS_STORAGE)
93+
AZURE_ADLS_GEN2_STORAGE_ACCOUNT: $(AZURE_ADLS_GEN2_STORAGE_ACCOUNT)
94+
AZURE_ADLS_GEN2_STORAGE_RESOURCE_GROUP: $(AZURE_ADLS_GEN2_STORAGE_RESOURCE_GROUP)
9095
AZURE_VISION_ENDPOINT: $(AZURE_VISION_ENDPOINT)
9196
VISION_SECRET_NAME: $(VISION_SECRET_NAME)
9297
AZURE_VISION_SERVICE: $(AZURE_VISION_SERVICE)
@@ -114,8 +119,6 @@ steps:
114119
ALLOWED_ORIGIN: $(ALLOWED_ORIGIN)
115120
AZURE_SERVER_APP_SECRET: $(AZURE_SERVER_APP_SECRET)
116121
AZURE_CLIENT_APP_SECRET: $(AZURE_CLIENT_APP_SECRET)
117-
AZURE_ADLS_GEN2_STORAGE_ACCOUNT: $(AZURE_ADLS_GEN2_STORAGE_ACCOUNT)
118-
AZURE_ADLS_GEN2_FILESYSTEM_PATH: $(AZURE_ADLS_GEN2_FILESYSTEM_PATH)
119122
AZURE_ADLS_GEN2_FILESYSTEM: $(AZURE_ADLS_GEN2_FILESYSTEM)
120123
DEPLOYMENT_TARGET: $(DEPLOYMENT_TARGET)
121124
AZURE_CONTAINER_APPS_WORKLOAD_PROFILE: $(AZURE_CONTAINER_APPS_WORKLOAD_PROFILE)

.github/workflows/azure-dev.yml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,11 @@ jobs:
8383
AZURE_LOG_ANALYTICS: ${{ vars.AZURE_LOG_ANALYTICS }}
8484
USE_VECTORS: ${{ vars.USE_VECTORS }}
8585
USE_MULTIMODAL: ${{ vars.USE_MULTIMODAL }}
86+
USE_CLOUD_INGESTION: ${{ vars.USE_CLOUD_INGESTION }}
87+
USE_CLOUD_INGESTION_ACLS: ${{ vars.USE_CLOUD_INGESTION_ACLS }}
88+
USE_EXISTING_ADLS_STORAGE: ${{ vars.USE_EXISTING_ADLS_STORAGE }}
89+
AZURE_ADLS_GEN2_STORAGE_ACCOUNT: ${{ vars.AZURE_ADLS_GEN2_STORAGE_ACCOUNT }}
90+
AZURE_ADLS_GEN2_STORAGE_RESOURCE_GROUP: ${{ vars.AZURE_ADLS_GEN2_STORAGE_RESOURCE_GROUP }}
8691
AZURE_VISION_ENDPOINT: ${{ vars.AZURE_VISION_ENDPOINT }}
8792
VISION_SECRET_NAME: ${{ vars.VISION_SECRET_NAME }}
8893
ENABLE_LANGUAGE_PICKER: ${{ vars.ENABLE_LANGUAGE_PICKER }}
@@ -103,8 +108,6 @@ jobs:
103108
AZURE_SERVER_APP_ID: ${{ vars.AZURE_SERVER_APP_ID }}
104109
AZURE_CLIENT_APP_ID: ${{ vars.AZURE_CLIENT_APP_ID }}
105110
ALLOWED_ORIGIN: ${{ vars.ALLOWED_ORIGIN }}
106-
AZURE_ADLS_GEN2_STORAGE_ACCOUNT: ${{ vars.AZURE_ADLS_GEN2_STORAGE_ACCOUNT }}
107-
AZURE_ADLS_GEN2_FILESYSTEM_PATH: ${{ vars.AZURE_ADLS_GEN2_FILESYSTEM_PATH }}
108111
AZURE_ADLS_GEN2_FILESYSTEM: ${{ vars.AZURE_ADLS_GEN2_FILESYSTEM }}
109112
DEPLOYMENT_TARGET: ${{ vars.DEPLOYMENT_TARGET }}
110113
AZURE_CONTAINER_APPS_WORKLOAD_PROFILE: ${{ vars.AZURE_CONTAINER_APPS_WORKLOAD_PROFILE }}

.github/workflows/evaluate.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,6 @@ jobs:
9898
AZURE_CLIENT_APP_ID: ${{ vars.AZURE_CLIENT_APP_ID }}
9999
ALLOWED_ORIGIN: ${{ vars.ALLOWED_ORIGIN }}
100100
AZURE_ADLS_GEN2_STORAGE_ACCOUNT: ${{ vars.AZURE_ADLS_GEN2_STORAGE_ACCOUNT }}
101-
AZURE_ADLS_GEN2_FILESYSTEM_PATH: ${{ vars.AZURE_ADLS_GEN2_FILESYSTEM_PATH }}
102101
AZURE_ADLS_GEN2_FILESYSTEM: ${{ vars.AZURE_ADLS_GEN2_FILESYSTEM }}
103102
DEPLOYMENT_TARGET: ${{ vars.DEPLOYMENT_TARGET }}
104103
AZURE_CONTAINER_APPS_WORKLOAD_PROFILE: ${{ vars.AZURE_CONTAINER_APPS_WORKLOAD_PROFILE }}

app/backend/prepdocs.py

Lines changed: 5 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,6 @@
1616
IntegratedVectorizerStrategy,
1717
)
1818
from prepdocslib.listfilestrategy import (
19-
ADLSGen2ListFileStrategy,
20-
ListFileStrategy,
2119
LocalListFileStrategy,
2220
)
2321
from prepdocslib.servicesetup import (
@@ -51,33 +49,13 @@ async def check_search_service_connectivity(search_service: str) -> bool:
5149

5250
def setup_list_file_strategy(
5351
azure_credential: AsyncTokenCredential,
54-
local_files: Optional[str],
55-
datalake_storage_account: Optional[str],
56-
datalake_filesystem: Optional[str],
57-
datalake_path: Optional[str],
58-
datalake_key: Optional[str],
52+
local_files: str,
5953
enable_global_documents: bool = False,
6054
):
61-
list_file_strategy: ListFileStrategy
62-
if datalake_storage_account:
63-
if datalake_filesystem is None or datalake_path is None:
64-
raise ValueError("DataLake file system and path are required when using Azure Data Lake Gen2")
65-
adls_gen2_creds: AsyncTokenCredential | str = azure_credential if datalake_key is None else datalake_key
66-
logger.info("Using Data Lake Gen2 Storage Account: %s", datalake_storage_account)
67-
list_file_strategy = ADLSGen2ListFileStrategy(
68-
data_lake_storage_account=datalake_storage_account,
69-
data_lake_filesystem=datalake_filesystem,
70-
data_lake_path=datalake_path,
71-
credential=adls_gen2_creds,
72-
enable_global_documents=enable_global_documents,
73-
)
74-
elif local_files:
75-
logger.info("Using local files: %s", local_files)
76-
list_file_strategy = LocalListFileStrategy(
77-
path_pattern=local_files, enable_global_documents=enable_global_documents
78-
)
79-
else:
80-
raise ValueError("Either local_files or datalake_storage_account must be provided.")
55+
logger.info("Using local files: %s", local_files)
56+
list_file_strategy = LocalListFileStrategy(
57+
path_pattern=local_files, enable_global_documents=enable_global_documents
58+
)
8159
return list_file_strategy
8260

8361

@@ -165,9 +143,6 @@ async def main(strategy: Strategy, setup_index: bool = True):
165143
required=False,
166144
help="Optional. Use this Azure Blob Storage account key instead of the current user identity to login (use az login to set current user for Azure)",
167145
)
168-
parser.add_argument(
169-
"--datalakekey", required=False, help="Optional. Use this key when authenticating to Azure Data Lake Gen2"
170-
)
171146
parser.add_argument(
172147
"--documentintelligencekey",
173148
required=False,
@@ -274,10 +249,6 @@ async def main(strategy: Strategy, setup_index: bool = True):
274249
list_file_strategy = setup_list_file_strategy(
275250
azure_credential=azd_credential,
276251
local_files=args.files,
277-
datalake_storage_account=os.getenv("AZURE_ADLS_GEN2_STORAGE_ACCOUNT"),
278-
datalake_filesystem=os.getenv("AZURE_ADLS_GEN2_FILESYSTEM"),
279-
datalake_path=os.getenv("AZURE_ADLS_GEN2_FILESYSTEM_PATH"),
280-
datalake_key=clean_key_if_exists(args.datalakekey),
281252
enable_global_documents=enable_global_documents,
282253
)
283254

app/backend/prepdocslib/cloudingestionstrategy.py

Lines changed: 47 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -131,6 +131,9 @@ def _build_skillset(self) -> SearchIndexerSkillset:
131131
]
132132
if self.use_multimodal:
133133
mappings.append(InputFieldMappingEntry(name="images", source="/document/chunks/*/images"))
134+
if self.use_acls:
135+
mappings.append(InputFieldMappingEntry(name="oids", source="/document/chunks/*/oids"))
136+
mappings.append(InputFieldMappingEntry(name="groups", source="/document/chunks/*/groups"))
134137

135138
index_projection = SearchIndexerIndexProjection(
136139
selectors=[
@@ -168,7 +171,16 @@ def _build_skillset(self) -> SearchIndexerSkillset:
168171
outputs=[
169172
OutputFieldMappingEntry(name="pages", target_name="pages"),
170173
OutputFieldMappingEntry(name="figures", target_name="figures"),
171-
],
174+
]
175+
+ (
176+
[
177+
# ACL outputs for document-level access control (populated by manual ADLS Gen2 extraction)
178+
OutputFieldMappingEntry(name="oids", target_name="oids"),
179+
OutputFieldMappingEntry(name="groups", target_name="groups"),
180+
]
181+
if self.use_acls
182+
else []
183+
),
172184
)
173185

174186
figure_processor_skill = WebApiSkill(
@@ -232,7 +244,16 @@ def _build_skillset(self) -> SearchIndexerSkillset:
232244
),
233245
InputFieldMappingEntry(name="file_name", source="/document/metadata_storage_name"),
234246
InputFieldMappingEntry(name="storageUrl", source="/document/metadata_storage_path"),
235-
],
247+
]
248+
+ (
249+
[
250+
# ACL fields from document_extractor's manual ADLS Gen2 ACL extraction
251+
InputFieldMappingEntry(name="oids", source="/document/oids"),
252+
InputFieldMappingEntry(name="groups", source="/document/groups"),
253+
]
254+
if self.use_acls
255+
else []
256+
),
236257
outputs=[OutputFieldMappingEntry(name="output", target_name="consolidated_document")],
237258
)
238259

@@ -272,6 +293,23 @@ async def setup(self) -> None:
272293
if not isinstance(self.embeddings, OpenAIEmbeddings):
273294
raise TypeError("Cloud ingestion requires Azure OpenAI embeddings to configure the search index.")
274295

296+
# Warn if access control is enforced but ACL extraction is not enabled
297+
if self.enforce_access_control and not self.use_acls:
298+
logger.warning(
299+
"AZURE_ENFORCE_ACCESS_CONTROL is enabled but USE_CLOUD_INGESTION_ACLS is not. "
300+
"Documents will not have ACLs extracted automatically from ADLS Gen2. "
301+
"If you intend to use document-level access control, either set USE_CLOUD_INGESTION_ACLS=true "
302+
"or manually set ACLs using scripts/manageacl.py after ingestion."
303+
)
304+
305+
# Verify the storage container exists before attempting to create the data source
306+
container_client = self.blob_manager.blob_service_client.get_container_client(self.blob_manager.container)
307+
if not await container_client.exists():
308+
raise ValueError(
309+
f"Storage container '{self.blob_manager.container}' does not exist in storage account '{self.blob_manager.account}'. "
310+
f"Please create the container first, or set AZURE_STORAGE_CONTAINER to an existing container name."
311+
)
312+
275313
self._search_manager = SearchManager(
276314
search_info=self.search_info,
277315
search_analyzer_name=self.search_analyzer_name,
@@ -287,9 +325,15 @@ async def setup(self) -> None:
287325
await self._search_manager.create_index()
288326

289327
async with self.search_info.create_search_indexer_client() as indexer_client:
328+
# Use ADLS_GEN2 when ACLs are enabled (requires hierarchical namespace storage)
329+
# Note: We do NOT use indexer_permission_options because that's incompatible with
330+
# Custom WebAPI skills. Instead, ACLs are extracted manually in document_extractor.
331+
data_source_type = (
332+
SearchIndexerDataSourceType.ADLS_GEN2 if self.use_acls else SearchIndexerDataSourceType.AZURE_BLOB
333+
)
290334
data_source_connection = SearchIndexerDataSourceConnection(
291335
name=self.data_source_name,
292-
type=SearchIndexerDataSourceType.AZURE_BLOB,
336+
type=data_source_type,
293337
connection_string=self.blob_manager.get_managedidentity_connectionstring(),
294338
container=SearchIndexerDataContainer(name=self.blob_manager.container),
295339
data_deletion_detection_policy=NativeBlobSoftDeleteDeletionDetectionPolicy(),

app/backend/prepdocslib/listfilestrategy.py

Lines changed: 0 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -3,17 +3,11 @@
33
import logging
44
import os
55
import re
6-
import tempfile
76
from abc import ABC
87
from collections.abc import AsyncGenerator
98
from glob import glob
109
from typing import IO, Optional
1110

12-
from azure.core.credentials_async import AsyncTokenCredential
13-
from azure.storage.filedatalake.aio import (
14-
DataLakeServiceClient,
15-
)
16-
1711
logger = logging.getLogger("scripts")
1812

1913

@@ -136,75 +130,3 @@ def check_md5(self, path: str) -> bool:
136130
md5_f.write(existing_hash)
137131

138132
return False
139-
140-
141-
class ADLSGen2ListFileStrategy(ListFileStrategy):
142-
"""
143-
Concrete strategy for listing files that are located in a data lake storage account
144-
"""
145-
146-
def __init__(
147-
self,
148-
data_lake_storage_account: str,
149-
data_lake_filesystem: str,
150-
data_lake_path: str,
151-
credential: AsyncTokenCredential | str,
152-
enable_global_documents: bool = False,
153-
):
154-
self.data_lake_storage_account = data_lake_storage_account
155-
self.data_lake_filesystem = data_lake_filesystem
156-
self.data_lake_path = data_lake_path
157-
self.credential = credential
158-
self.enable_global_documents = enable_global_documents
159-
160-
async def list_paths(self) -> AsyncGenerator[str, None]:
161-
async with DataLakeServiceClient(
162-
account_url=f"https://{self.data_lake_storage_account}.dfs.core.windows.net", credential=self.credential
163-
) as service_client, service_client.get_file_system_client(self.data_lake_filesystem) as filesystem_client:
164-
async for path in filesystem_client.get_paths(path=self.data_lake_path, recursive=True):
165-
if path.is_directory:
166-
continue
167-
168-
yield path.name
169-
170-
async def list(self) -> AsyncGenerator[File, None]:
171-
async with DataLakeServiceClient(
172-
account_url=f"https://{self.data_lake_storage_account}.dfs.core.windows.net", credential=self.credential
173-
) as service_client, service_client.get_file_system_client(self.data_lake_filesystem) as filesystem_client:
174-
async for path in self.list_paths():
175-
temp_file_path = os.path.join(tempfile.gettempdir(), os.path.basename(path))
176-
try:
177-
async with filesystem_client.get_file_client(path) as file_client:
178-
with open(temp_file_path, "wb") as temp_file:
179-
downloader = await file_client.download_file()
180-
await downloader.readinto(temp_file)
181-
# Parse out user ids and group ids
182-
acls: dict[str, list[str]] = {"oids": [], "groups": []}
183-
# https://learn.microsoft.com/python/api/azure-storage-file-datalake/azure.storage.filedatalake.datalakefileclient?view=azure-python#azure-storage-filedatalake-datalakefileclient-get-access-control
184-
# Request ACLs as GUIDs
185-
access_control = await file_client.get_access_control(upn=False)
186-
acl_list = access_control["acl"]
187-
# https://learn.microsoft.com/azure/storage/blobs/data-lake-storage-access-control
188-
# ACL Format: user::rwx,group::r-x,other::r--,user:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx:r--
189-
acl_list = acl_list.split(",")
190-
for acl in acl_list:
191-
acl_parts: list = acl.split(":")
192-
if len(acl_parts) != 3:
193-
continue
194-
if len(acl_parts[1]) == 0:
195-
continue
196-
if acl_parts[0] == "user" and "r" in acl_parts[2]:
197-
acls["oids"].append(acl_parts[1])
198-
if acl_parts[0] == "group" and "r" in acl_parts[2]:
199-
acls["groups"].append(acl_parts[1])
200-
201-
if self.enable_global_documents and len(acls["oids"]) == 0 and len(acls["groups"]) == 0:
202-
acls = {"oids": ["all"], "groups": ["all"]}
203-
204-
yield File(content=open(temp_file_path, "rb"), acls=acls, url=file_client.url)
205-
except Exception as data_lake_exception:
206-
logger.error(f"\tGot an error while reading {path} -> {data_lake_exception} --> skipping file")
207-
try:
208-
os.remove(temp_file_path)
209-
except Exception as file_delete_exception:
210-
logger.error(f"\tGot an error while deleting {temp_file_path} -> {file_delete_exception}")

app/backend/setup_cloud_ingestion.py

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -36,13 +36,18 @@ async def setup_cloud_ingestion_strategy(
3636
search_service = os.environ["AZURE_SEARCH_SERVICE"]
3737
index_name = os.environ["AZURE_SEARCH_INDEX"]
3838
search_user_assigned_identity_resource_id = os.environ["AZURE_SEARCH_USER_ASSIGNED_IDENTITY_RESOURCE_ID"]
39-
storage_account = os.environ["AZURE_STORAGE_ACCOUNT"]
4039
storage_container = os.environ["AZURE_STORAGE_CONTAINER"]
41-
storage_resource_group = os.environ["AZURE_STORAGE_RESOURCE_GROUP"]
4240
subscription_id = os.environ["AZURE_SUBSCRIPTION_ID"]
4341
image_storage_container = os.environ.get("AZURE_IMAGESTORAGE_CONTAINER")
4442
search_embedding_field = os.environ["AZURE_SEARCH_FIELD_NAME_EMBEDDING"]
4543

44+
# Cloud ingestion storage account (ADLS Gen2 when ACLs enabled, standard blob otherwise)
45+
# Fallback to AZURE_STORAGE_ACCOUNT is for legacy deployments only - may be removed in future
46+
storage_account = os.getenv("AZURE_CLOUD_INGESTION_STORAGE_ACCOUNT") or os.environ["AZURE_STORAGE_ACCOUNT"]
47+
storage_resource_group = (
48+
os.getenv("AZURE_CLOUD_INGESTION_STORAGE_RESOURCE_GROUP") or os.environ["AZURE_STORAGE_RESOURCE_GROUP"]
49+
)
50+
4651
# Cloud ingestion specific endpoints
4752
document_extractor_uri = os.environ["DOCUMENT_EXTRACTOR_SKILL_ENDPOINT"]
4853
document_extractor_resource_id = os.environ["DOCUMENT_EXTRACTOR_SKILL_AUTH_RESOURCE_ID"]
@@ -53,10 +58,19 @@ async def setup_cloud_ingestion_strategy(
5358

5459
# Feature flags
5560
use_multimodal = os.getenv("USE_MULTIMODAL", "").lower() == "true"
56-
use_acls = os.getenv("AZURE_USE_AUTHENTICATION", "").lower() == "true"
61+
use_acls = os.getenv("USE_CLOUD_INGESTION_ACLS", "").lower() == "true"
5762
enforce_access_control = os.getenv("AZURE_ENFORCE_ACCESS_CONTROL", "").lower() == "true"
5863
use_web_source = os.getenv("USE_WEB_SOURCE", "").lower() == "true"
5964

65+
# Warn if access control is enforced but ACL extraction is not enabled
66+
if enforce_access_control and not use_acls:
67+
logger.warning(
68+
"AZURE_ENFORCE_ACCESS_CONTROL is enabled but USE_CLOUD_INGESTION_ACLS is not. "
69+
"Documents will be indexed without ACLs, so access control filtering will not work. "
70+
"Either set USE_CLOUD_INGESTION_ACLS=true to extract ACLs from ADLS Gen2, "
71+
"or manually set ACLs using scripts/manageacl.py after ingestion."
72+
)
73+
6074
# Setup search info
6175
search_info = setup_search_info(
6276
search_service=search_service,

0 commit comments

Comments
 (0)