Skip to content

Commit 30c74d7

Browse files
committed
fix #563 - update documentation
1 parent 87af195 commit 30c74d7

File tree

1 file changed

+77
-19
lines changed

1 file changed

+77
-19
lines changed

src/main/java/com/marklogic/client/example/cookbook/datamovement/IncrementalLoadFromJdbc.java

Lines changed: 77 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -49,12 +49,19 @@
4949
* example source data (accessed via JDBC) continues to grow and evolve, so
5050
* updates from the source must be regularly incporated into the target system
5151
* (MarkLogic Server). These updates include new documents, updated documents,
52-
* and deleted documents. The source data is too large to ingest completely
53-
* every time. So this example addresses the more difficult scenario where
54-
* incremental loads are required to include only the updates. Additionally,
55-
* this example addresses the more difficult scenario where the source system
56-
* can provide a list of all current document uris, but cannot provide any
57-
* information about modified or deleted documents.
52+
* and deleted documents.
53+
*
54+
* The source data is too large to ingest completely every time. So this
55+
* example addresses the more difficult scenario where incremental loads are
56+
* required to include only the updates.
57+
*
58+
* Many source systems offer a document version or last updated time-stamp.
59+
* This pattern addresses the more difficult scenario where the source system
60+
* offers no such option.
61+
*
62+
* Additionally, this example addresses the more difficult scenario where the
63+
* source system can provide a list of all current document uris, but cannot
64+
* provide any information about modified or deleted documents.
5865
*
5966
* # Solution
6067
*
@@ -78,12 +85,13 @@
7885
* Any document written to MarkLogic Server also has written a "sidecar"
7986
* document containing metadata including the document uri, a hashcode and a
8087
* jobName. The sidecar document has a collection representing the data
81-
* source. The hascode is generated based on the source document contents.
82-
* The hascode algorithm is consistent when the source document hasn't changed
83-
* and different any time the source document has changed. The jobName is any
84-
* id or timestamp representing the last job which validated the document, and
85-
* should differ from previous job runs. This sidecar document is updated with
86-
* each job run to reflect the latest jobName.
88+
* source. The hascode is generated based on select portions of the source
89+
* document contents. The hascode algorithm is consistent when the source
90+
* document hasn't changed and different any time the source document has
91+
* changed. The jobName is any id or timestamp representing the last job which
92+
* checked the hashcode of the document, and should differ from previous job
93+
* runs. This sidecar document is updated with each job run to reflect the
94+
* latest jobName.
8795
*
8896
* ## Step 3
8997
*
@@ -92,38 +100,88 @@
92100
* jobName which indicates these documents are in MarkLogic but were missing
93101
* from this job run and are therefore not in the datasource. After confirming
94102
* that these documents are legitimately not in the datasource, they are
95-
* deleted from MarkLogic Server. This is how we stay up-to-date with deletes
96-
* when the source system offers no way to track deleted documents.
103+
* archived in MarkLogic Server. To archive documents we remove the collection
104+
* for this datasource and add an "archived" collection. This effectively
105+
* removes the documents from queries that are looking for documents in the
106+
* collection for this datasource. This is how we stay up-to-date with
107+
* deletes when the source system offers no way to track deleted documents.
97108
*
98-
* # Solution Alternative
109+
* # Alternative Solutions
110+
*
111+
* ## Alternative Solution 1
99112
*
100113
* If your scenario allows you to load all the documents each time, do that
101114
* because it's simpler. Simply delete in the target all data from that one
102115
* source then reload the latest data from that source. This addresses new
103116
* documents, updated documents, and deleted documents.
104117
*
118+
* ## Alternative Solution 2
119+
*
120+
* Your scenario may be different if it requires a one-time data migration
121+
* rather than an ongoing load of updates from the source. For example, a
122+
* one-time load for a production cut-over may have significant performance
123+
* requirements this solution cannot address. Also, some one-time migrations
124+
* will not require comparison of hashcodes nor tracking of deletes.
125+
*
126+
* # Adjustments
127+
*
105128
* # Solution Adjustment 1
106129
*
130+
* If the source can provide you with last updated timestamps, compare those
131+
* instead of hashcodes. This reduces the effort to select which portions of
132+
* the document to include in the hashcode. This also reduces the processing
133+
* of calculating hashcodes each time.
134+
*
135+
* # Solution Adjustment 2
136+
*
107137
* The sidecar document can be written to a different MarkLogic database,
108138
* cluster, or non-MarkLogic system (including the file system). This will
109139
* reduce the read load on the database with the actual document contents.
110140
* This also opens more options to write sidecar to a database with a different
111141
* configuration including forests on less expensive storage.
112142
*
113-
* # Solution Adjustment 2
143+
* # Solution Adjustment 3
114144
*
115145
* For systems that offer a way to track deleted documents, use that instead of
116146
* step 3. Get the list of uris of source documents deleted since the last job
117-
* run. Delete those documents (and associated sidecar files) from MarkLogic
118-
* Server.
147+
* run. Archive or delete those documents (and associated sidecar files) from
148+
* MarkLogic Server.
119149
*
120-
* # Solution Adjustment 3
150+
* # Solution Adjustment 4
121151
*
122152
* The source documents can be read from a staging area containing at least the
123153
* uri and the up-to-date hashcode for each document. This will reduce the
124154
* read load on the source system to only documents found to be missing from
125155
* MarkLogic or updated from what is in MarkLogic.
126156
*
157+
* # Gotchas
158+
*
159+
* ## No Staging of Source Documents in Target
160+
*
161+
* We recommend loading documents to a staging area in MarkLogic without
162+
* transformations so we can see the documents in MarkLogic as they look in the
163+
* source system. If we don't do that, and we transform the documents in
164+
* MarkLogic, it may be confusing how to calculate hashcodes. Nevertheless,
165+
* this pattern can still be applied, it just requires more careful design and
166+
* documentation so it can reasonably be maintained.
167+
*
168+
* ## Documents are not 1-1 from Source to Target
169+
*
170+
* Not all documents (or records, or rows) from a source system map 1-1 to
171+
* final documents in a target system. This may make it less obvious how to
172+
* apply this pattern. Sometimes mapping source documents to target documents
173+
* occurs client-side. Sometimes mapping source documents to target documents
174+
* happens server-side, as in the Data Hub Framwork. One key to resolving this
175+
* is to generate hashcodes that help determine whether relevant source data
176+
* changed, so hashcodes should incorporate all relevant source data but not
177+
* data generated solely by transformations (or harmonization).
178+
*
179+
* When all relevant source data comes from multiple records, and no staging
180+
* documents match source documents, the source records must of course be
181+
* combined prior to calculating hashcodes, as we do in this example. Here we
182+
* perform a join in the source relational database to combine all relevant
183+
* data into multiple rows. Additionally, we combine multiple rows into a
184+
* single Employee object before we calculate the hashcodes.
127185
*/
128186
public class IncrementalLoadFromJdbc extends BulkLoadFromJdbcWithSimpleJoins {
129187
// threadCount and batchSize are only small because this recipe ships with a

0 commit comments

Comments
 (0)