Skip to content

Commit 9dabf06

Browse files
authored
Update README.md
1 parent f0d16e4 commit 9dabf06

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -75,10 +75,10 @@ Click below to start your App service for MongoDB to Fabric replication:
7575
1. Please note the code actually creates two threads for each collection (one for initial_sync and one for delta_sync) and thus if we have large collections (~10 Million+ records), we should be judicous in selecting the compute size of the App service or VM. As a high level bench mark, a compute of 4 CPUs, 16 GiB of memory might work for 5 such collections with a high throughput of say 1000 records/second. Beyond, that we should really monitor the performance and threads and check the CPU usage.
7676
2. Azure Storage explorer is your point to start the troubleshooting. Use below files that start with an underscore to get vital information. (They are not copied to OneLake as they start with underscore"_"). Also note these are pickle files and you can view them using command "python -mpickle _maxid.pkl” in terminal.
7777
a. _max_id file: Will tell you what was the maximum _id field that was captured before initial sync begain. Any _id > this _id from _max_id is coming from real time sync. All records with _id <= this _max_id are copied as part of initial_sync
78-
b. _resume_token: Contains the last resume token of the real time change event copied to LZ. Thus, you see this file only if atleast one real time changes parquet file was written in LZ.
78+
b. _resume_token: Contains the last resume token of the real time change event copied to LZ. Thus, you see this file only if atleast one real time changes parquet file was written in LZ.
7979
c. _initial_sync_status: Indicates initial_sync is complete or not. "Y" in this file will indicate that initial_sync is complete.
8080
d. _metadata.json: Has the primary key which is always "_id". This file should exist in a replicated folder/ table for mirroring to work.
8181
e. _last_id: This is the "_id" value of the last record of the last initial sync batch file written to LZ. This file is deleted when initial sync is completed.
8282
f. _internal_schema: This is one of the very first files written and has the schema as of the records in the collection being replicated.
83-
3. The restartability of the App service/ replication is guaranteed if _resume_token file is present. This is because if initial sync is not completed and we restart the App service, the delta changes that came in the interim were being accumulated in a TEMP parquet files in the APp service whcih will be lost. Thus, as a best practice, if the process fails before initial sync is completed, it is advised to delete all files in the collection folder using Azure Storage explorer and restrart the process so that it can get the new max _id and start initial_sync. Once initial_sync is completed and _resume_token file is created we can restart without any worries as it will pick up changes from the last resume_token from the change stream.
83+
3. The restartability of the App service/ replication is guaranteed if _resume_token file is present. This is because if initial sync is not completed and we restart the App service, the delta changes that came in the interim were being accumulated in a TEMP parquet files in the App service which will be lost. Thus, as a best practice, if the process fails before initial sync is completed, it is advised to delete all files in the collection folder using Azure Storage Explorer and restrart the process so that it can get the new max _id and start initial_sync. Once initial_sync is completed and _resume_token file is created we can restart without any worries as it will pick up changes from the last resume_token from the change stream.
8484

0 commit comments

Comments
 (0)