@@ -157,22 +157,22 @@ To use the above transform, verify that your user has been granted the MarkLogic
157157
158158## Compressing content
159159
160- The ` --compression ` option is used to write files either to Gzip or ZIP files.
160+ The ` --compression ` option is used to write files either to gzip or ZIP files.
161161
162- To Gzip each file, include ` --compression GZIP ` .
162+ To gzip each file, include ` --compression GZIP ` .
163163
164- To write multiple files to one or more ZIP files, include ` --compression ZIP ` . A zip file will be created for each
164+ To write multiple files to one or more ZIP files, include ` --compression ZIP ` . A ZIP file will be created for each
165165partition that was created when reading data via Optic. You can include ` --zip-file-count 1 ` to force all documents to be
166166written to a single ZIP file. See the below section on "Understanding partitions" for more information.
167167
168- ### Windows-specific issues with zip files
168+ ### Windows-specific issues with ZIP files
169169
170- In the likely event that you have one or more URIs with a forward slash - ` / ` - in them, then creating a zip file
170+ In the likely event that you have one or more URIs with a forward slash - ` / ` - in them, then creating a ZIP file
171171with those URIs - which are used as the zip entry names - will produce confusing behavior on Windows. If you open the
172- zip file via Windows Explorer, Windows will erroneously think the zip file is empty. If you open the zip file using
172+ ZIP file via Windows Explorer, Windows will erroneously think the file is empty. If you open the file using
1731737-Zip, you will see a top-level entry named ` _ ` if one or more of your URIs begin with a forward slash. These are
174174effectively issues that only occur when viewing the file within Windows and do not reflect the actual contents of the
175- zip file. The contents of the file are correct and if you were to import them with Flux via the ` import-files `
175+ ZIP file. The contents of the file are correct and if you were to import them with Flux via the ` import-files `
176176command, you will get the expected results.
177177
178178## Specifying an encoding
@@ -202,6 +202,27 @@ bin\flux export-files ^
202202{% endtabs %}
203203
204204
205+ ## Exporting large binary documents
206+
207+ MarkLogic's [ support for large binary documents] ( https://docs.marklogic.com/guide/app-dev/binaries#id_93203 ) allows
208+ for storing binary files of any size. To ensure that large binary documents can be exported to a file path, consider
209+ using the ` --streaming ` option introduced in Flux 1.1.0. When this option is set, Flux will stream each document
210+ from MarkLogic directly to the file path, thereby avoiding reading the contents of a file into memory. This option
211+ can be used when exporting documents to gzip or ZIP files as well via the ` --compression zip ` option.
212+
213+ As streaming to a file requires Flux to retrieve one document at a time from MarkLogic, you should not use this option
214+ when exporting smaller documents that can easily fit into the memory available to Flux.
215+
216+ When using ` --streaming ` , the following options will behave in a different fashion:
217+
218+ - ` --batch-size ` will still affect how many URIs are retrieved from MarkLogic in a single request, but will not impact
219+ the number of documents retrieved from MarkLogic in a single request, which will always be 1.
220+ - ` --encoding ` will be ignored as applying an encoding requires reading the document into memory.
221+ - ` --pretty-print ` will have no effect as the contents of a document will never be read into memory.
222+
223+ You typically will not want to use the ` --transform ` option as applying a REST transform in MarkLogic to a
224+ large binary document may exhaust the amount of memory available to MarkLogic.
225+
205226## Understanding partitions
206227
207228As Flux is built on top of Apache Spark, it is heavily influenced by how Spark
@@ -237,9 +258,9 @@ bin\flux export-files ^
237258{% endtab %}
238259{% endtabs %}
239260
240- The ` ./export ` directory will have 12 zip files in it. This count is due to how Flux reads data from MarkLogic,
261+ The ` ./export ` directory will have 12 ZIP files in it. This count is due to how Flux reads data from MarkLogic,
241262which involves creating 4 partitions by default per forest in the MarkLogic database. The example application has 3
242- forests in its content database, and thus 12 partitions are created, resulting in 12 separate zip files.
263+ forests in its content database, and thus 12 partitions are created, resulting in 12 separate ZIP files.
243264
244265You can use the ` --partitions-per-forest ` option to control how many partitions - and thus workers - read documents
245266from each forest in your database:
@@ -272,7 +293,7 @@ bin\flux export-files ^
272293{% endtabs %}
273294
274295
275- This approach will produce 3 zip files - one per forest.
296+ This approach will produce 3 ZIP files - one per forest.
276297
277298You can also use the ` --repartition ` option, available on every command, to force the number of partitions used when
278299writing data, regardless of how many were used to read the data:
@@ -303,7 +324,7 @@ bin\flux export-files ^
303324{% endtabs %}
304325
305326
306- This approach will produce a single zip file due to the use of a single partition when writing files.
327+ This approach will produce a single ZIP file due to the use of a single partition when writing files.
307328The ` --zip-file-count ` option is effectively an alias for ` --repartition ` . Both options produce the same outcome.
308329` --zip-file-count ` is included as a more intuitive option for the common case of configuring how many files should
309330be written.
0 commit comments