|
49 | 49 | * example source data (accessed via JDBC) continues to grow and evolve, so |
50 | 50 | * updates from the source must be regularly incporated into the target system |
51 | 51 | * (MarkLogic Server). These updates include new documents, updated documents, |
52 | | - * and deleted documents. The source data is too large to ingest completely |
53 | | - * every time. So this example addresses the more difficult scenario where |
54 | | - * incremental loads are required to include only the updates. Additionally, |
55 | | - * this example addresses the more difficult scenario where the source system |
56 | | - * can provide a list of all current document uris, but cannot provide any |
57 | | - * information about modified or deleted documents. |
| 52 | + * and deleted documents. |
| 53 | + * |
| 54 | + * The source data is too large to ingest completely every time. So this |
| 55 | + * example addresses the more difficult scenario where incremental loads are |
| 56 | + * required to include only the updates. |
| 57 | + * |
| 58 | + * Many source systems offer a document version or last updated time-stamp. |
| 59 | + * This pattern addresses the more difficult scenario where the source system |
| 60 | + * offers no such option. |
| 61 | + * |
| 62 | + * Additionally, this example addresses the more difficult scenario where the |
| 63 | + * source system can provide a list of all current document uris, but cannot |
| 64 | + * provide any information about modified or deleted documents. |
58 | 65 | * |
59 | 66 | * # Solution |
60 | 67 | * |
|
78 | 85 | * Any document written to MarkLogic Server also has written a "sidecar" |
79 | 86 | * document containing metadata including the document uri, a hashcode and a |
80 | 87 | * jobName. The sidecar document has a collection representing the data |
81 | | - * source. The hascode is generated based on the source document contents. |
82 | | - * The hascode algorithm is consistent when the source document hasn't changed |
83 | | - * and different any time the source document has changed. The jobName is any |
84 | | - * id or timestamp representing the last job which validated the document, and |
85 | | - * should differ from previous job runs. This sidecar document is updated with |
86 | | - * each job run to reflect the latest jobName. |
| 88 | + * source. The hascode is generated based on select portions of the source |
| 89 | + * document contents. The hascode algorithm is consistent when the source |
| 90 | + * document hasn't changed and different any time the source document has |
| 91 | + * changed. The jobName is any id or timestamp representing the last job which |
| 92 | + * checked the hashcode of the document, and should differ from previous job |
| 93 | + * runs. This sidecar document is updated with each job run to reflect the |
| 94 | + * latest jobName. |
87 | 95 | * |
88 | 96 | * ## Step 3 |
89 | 97 | * |
|
92 | 100 | * jobName which indicates these documents are in MarkLogic but were missing |
93 | 101 | * from this job run and are therefore not in the datasource. After confirming |
94 | 102 | * that these documents are legitimately not in the datasource, they are |
95 | | - * deleted from MarkLogic Server. This is how we stay up-to-date with deletes |
96 | | - * when the source system offers no way to track deleted documents. |
| 103 | + * archived in MarkLogic Server. To archive documents we remove the collection |
| 104 | + * for this datasource and add an "archived" collection. This effectively |
| 105 | + * removes the documents from queries that are looking for documents in the |
| 106 | + * collection for this datasource. This is how we stay up-to-date with |
| 107 | + * deletes when the source system offers no way to track deleted documents. |
97 | 108 | * |
98 | | - * # Solution Alternative |
| 109 | + * # Alternative Solutions |
| 110 | + * |
| 111 | + * ## Alternative Solution 1 |
99 | 112 | * |
100 | 113 | * If your scenario allows you to load all the documents each time, do that |
101 | 114 | * because it's simpler. Simply delete in the target all data from that one |
102 | 115 | * source then reload the latest data from that source. This addresses new |
103 | 116 | * documents, updated documents, and deleted documents. |
104 | 117 | * |
| 118 | + * ## Alternative Solution 2 |
| 119 | + * |
| 120 | + * Your scenario may be different if it requires a one-time data migration |
| 121 | + * rather than an ongoing load of updates from the source. For example, a |
| 122 | + * one-time load for a production cut-over may have significant performance |
| 123 | + * requirements this solution cannot address. Also, some one-time migrations |
| 124 | + * will not require comparison of hashcodes nor tracking of deletes. |
| 125 | + * |
| 126 | + * # Adjustments |
| 127 | + * |
105 | 128 | * # Solution Adjustment 1 |
106 | 129 | * |
| 130 | + * If the source can provide you with last updated timestamps, compare those |
| 131 | + * instead of hashcodes. This reduces the effort to select which portions of |
| 132 | + * the document to include in the hashcode. This also reduces the processing |
| 133 | + * of calculating hashcodes each time. |
| 134 | + * |
| 135 | + * # Solution Adjustment 2 |
| 136 | + * |
107 | 137 | * The sidecar document can be written to a different MarkLogic database, |
108 | 138 | * cluster, or non-MarkLogic system (including the file system). This will |
109 | 139 | * reduce the read load on the database with the actual document contents. |
110 | 140 | * This also opens more options to write sidecar to a database with a different |
111 | 141 | * configuration including forests on less expensive storage. |
112 | 142 | * |
113 | | - * # Solution Adjustment 2 |
| 143 | + * # Solution Adjustment 3 |
114 | 144 | * |
115 | 145 | * For systems that offer a way to track deleted documents, use that instead of |
116 | 146 | * step 3. Get the list of uris of source documents deleted since the last job |
117 | | - * run. Delete those documents (and associated sidecar files) from MarkLogic |
118 | | - * Server. |
| 147 | + * run. Archive or delete those documents (and associated sidecar files) from |
| 148 | + * MarkLogic Server. |
119 | 149 | * |
120 | | - * # Solution Adjustment 3 |
| 150 | + * # Solution Adjustment 4 |
121 | 151 | * |
122 | 152 | * The source documents can be read from a staging area containing at least the |
123 | 153 | * uri and the up-to-date hashcode for each document. This will reduce the |
124 | 154 | * read load on the source system to only documents found to be missing from |
125 | 155 | * MarkLogic or updated from what is in MarkLogic. |
126 | 156 | * |
| 157 | + * # Gotchas |
| 158 | + * |
| 159 | + * ## No Staging of Source Documents in Target |
| 160 | + * |
| 161 | + * We recommend loading documents to a staging area in MarkLogic without |
| 162 | + * transformations so we can see the documents in MarkLogic as they look in the |
| 163 | + * source system. If we don't do that, and we transform the documents in |
| 164 | + * MarkLogic, it may be confusing how to calculate hashcodes. Nevertheless, |
| 165 | + * this pattern can still be applied, it just requires more careful design and |
| 166 | + * documentation so it can reasonably be maintained. |
| 167 | + * |
| 168 | + * ## Documents are not 1-1 from Source to Target |
| 169 | + * |
| 170 | + * Not all documents (or records, or rows) from a source system map 1-1 to |
| 171 | + * final documents in a target system. This may make it less obvious how to |
| 172 | + * apply this pattern. Sometimes mapping source documents to target documents |
| 173 | + * occurs client-side. Sometimes mapping source documents to target documents |
| 174 | + * happens server-side, as in the Data Hub Framwork. One key to resolving this |
| 175 | + * is to generate hashcodes that help determine whether relevant source data |
| 176 | + * changed, so hashcodes should incorporate all relevant source data but not |
| 177 | + * data generated solely by transformations (or harmonization). |
| 178 | + * |
| 179 | + * When all relevant source data comes from multiple records, and no staging |
| 180 | + * documents match source documents, the source records must of course be |
| 181 | + * combined prior to calculating hashcodes, as we do in this example. Here we |
| 182 | + * perform a join in the source relational database to combine all relevant |
| 183 | + * data into multiple rows. Additionally, we combine multiple rows into a |
| 184 | + * single Employee object before we calculate the hashcodes. |
127 | 185 | */ |
128 | 186 | public class IncrementalLoadFromJdbc extends BulkLoadFromJdbcWithSimpleJoins { |
129 | 187 | // threadCount and batchSize are only small because this recipe ships with a |
|
0 commit comments