Releases: marklogic/flux
2.0.0
This major release upgrades the underlying Apache Spark platform to version 4.1.0 and thus requires Java 17 to run. Please see below for a complete list of breaking changes, enhancements, and bug fixes, as well as a migration guide.
Breaking Changes
- Single-letter options from Flux 1.x (such as
-Pand-C) are no longer supported. These have been replaced with their standard long-form option names. This change fixes a bug where argument values starting with a dash followed by a single letter (e.g.,-Psome-value) were incorrectly interpreted as option flags. Please see the migration guide below for a mapping of removed option names to new option names. - When splitting text and creating chunks, Flux now defaults to creating one sidecar document per chunk as opposed to defaulting to adding all chunks to the source document.
- Vector embeddings in XML documents now default to a QName of
vectorwith a namespace ofhttp://marklogic.com/vector, matching upcoming default index exclusions in the MarkLogic server. - Vector embeddings in JSON documents now default to a name of
_vector, also matching upcoming default index exclusions in the MarkLogic server. - Removed usage of the Spark
fs.s3noptions, as Spark 4 no longer supports those for connecting to AWS S3.
Enhancements
- When a URI template specified via
--uri-templatehas an expression that cannot be resolved for a given document, the new--uri-template-warn-on-missing-fieldoption can be included to log a warning instead of failing. The expression will have its value replaced withUNRESOLVED-prepended to a random UUID. - When reading files, Flux now defaults the number of partitions to Spark's underlying parallelism default value, helping avoid performance issues due to large numbers of very small partitions.
- When aggregating rows during import, aggregated rows can now be sorted via the new
--aggregate-order-byand--aggregate-order-descoptions. - When classifying text via a Semaphore instance in Progress Data Cloud (PDC), the PDC token will be renewed if it expires during the course of a connector job.
- When exporting documents to zip file, a warning will be logged once a zip file contains 500,000 entries. Writing multiple large zip files at once can lead to heap space exhaustion in the JVM; users can avoid this by increasing the number of partitions.
- When reprocessing items and defining partitions via custom code, the new
--read-partitions-varoption can be used to specify custom variables to send to the code. Previously, the variables defined via--read-varwere sent to the custom partition code, but that is no longer the case. - When connecting to Progress Data Cloud (PDC),
--auth-type cloudno longer needs to be specified; the use of--cloud-api-keyis sufficient for enabling authentication with PDC. - When copying data, the
--output-databaseoption can be used without any other "output" options to copy from one database to another using the same app server connection. - When importing from or exporting to AWS S3, an S3 session token can now be specified via
--s3-session-token. - The
versioncommand now has aliases of-vand-version. - Improved documentation on tuning performance with better explanations of partitions and parallelism.
Bug Fixes
- When importing triples where
datatypeis only set iflangdoes not exist. - When copying data, classifier options - e.g.
--classifier-hostare now properly applied. - When importing files, a partition will now never have zero files, which caused an error. Files are now evenly distributed across partitions.
- Documents can be exported to zip files on Windows without encountering path errors.
Migration Guide
Updating Command-Line Arguments
If you have existing Flux 1.x scripts or commands, replace any single-letter options with their standard equivalents using the below table:
| Single letter option | Standard option |
|---|---|
| -E | --embedder-prop |
| -L | --classifier-prop |
| -M | --doc-metadata |
| -P | --spark-prop |
| -R | --doc-prop |
| -S | --splitter-prop |
| -X | --xpath-namespace |
For example, if you had an option -Pkey=value, you would change that to --spark-prop key=value.
Java Version Requirement
Ensure your environment has Java 17 or 21 installed. Java 11 is no longer supported.
URI Template Behavior
If you rely on --uri-template and want to continue failing on unresolved expressions, no action is needed. To allow missing fields and generate UUIDs instead, add the --uri-template-warn-on-missing-field flag.
Text Splitting Default Change
If you previously relied on chunks being added to source documents (the 1.x default), you will now need to explicitly configure this behavior. Flux 2.0 defaults to creating separate sidecar documents for each chunk. To preserve the 1.x behavior, include the following option:
--splitter-sidecar-max-chunks 0
Embedding Data Structure Changes
If you are using Flux to generate embeddings in JSON documents and wish to preserve the 1.x default data structure for embeddings in JSON documents, use the following options:
--embedder-embedding-name embedding
To preserve the 1.x default data structure for embeddings in XML documents, use the following options:
--embedder-embedding-name embedding
--embedder-embedding-namespace "http://marklogic.com/appservices/model"
1.4.0
This minor release addresses the following items:
- Flux now supports reading from and writing to Azure Blob Storage.
- When importing from structured data sources such as JDBC or delimited text files, you can now generate and load a TDE template based on the data source, allowing you to immediately query data using SQL or Optic. See the guide on TDE generation for more information.
- When exporting documents, you can specify a secondary query to include URIs in addition to those retrieved by your primary query. See the guide on exporting documents for more information on secondary queries.
- When generating an embedding, you can now specify a prompt to provide context or instructions to the embedding model. See the guide on generating embeddings for more information.
- When generating an embedding, you can now choose to encode vectors within documents via the new
--embedder-base64-encodeoption. See the guide on generating embeddings for more information. - The
import-jdbccommand now accepts a table name, specified via--table, instead of requiring a SQL query. - The
import-archive-filescommand now supports a--document-typeoption to force the type of each document written to MarkLogic. - Bug fix - classifying text and generating embeddings now works properly when importing from structured data sources (JDBC, delimited text, Avro, ORC, and Parquet).
- Bug fix - can now export documents with URIs containing multiple colons to files.
- Bug fix - document URIs containing schemes were incorrectly modified when documents were written as entries in a zip file. URIs are now used as zip entry names.
1.3.0
This release provides the following enhancements:
- Documents and chunks can now have their text classified via Progress Semaphore. Please see the text classifier documentation for more information.
- Text can now be extracted when importing files.
- Document properties and metadata values can be configured when writing documents to MarkLogic.
- All Apache Spark configuration options can now be configured when running any Flux command.
1.2.1
This patch release addresses the following items:
- Out-of-memory errors occurring when reading data with the
reprocesscommand have been addressed via a fix in the underlying MarkLogic Java Client library. - The
--embedderoption for specifying an embedding model no longer lowercases its value when it is a class name instead of an abbreviation.
1.2.0
This release provides the following enhancements:
- Support for Retrieval-augmented generation, or RAG with MarkLogic via support for splitting text and adding embeddings during any import or copy operation. See the Import guide for more information on splitting text and adding embeddings.
- When exporting rows, can now use any Optic data access function and not just
op.fromView. - When exporting documents, Flux no longer requires any query options to be specified and defaults to exporting all documents that a user can read.
- When exporting data to files, Flux will automatically create the given path if it does not exist, regardless of the target file type (prior to 1.2.0, this only happened for export commands).
- The reprocess command has a new
--thread-countoption to simplify parallelizing the processing of data in MarkLogic.
To get started, download the marklogic-flux-1.2.0.zip file below and visit our Getting Started guide.
1.1.3
This patch release addresses a single issue - an unused transitive dependency (via Spark and Hadoop) on log4j 1.2.17 is no longer included in Flux. Flux did not make use of this dependency in any of its prior releases, and it can be safely removed from the ./lib folder in prior releases.
Note that while this unused log4j dependency has several open CVEs assigned to it, it is not impacted by the LogShell log4j vulnerability. Flux has never been impacted by this vulnerability, as it has used log4j 2.19.0 or higher since its 1.0.0 release.
This release is otherwise identical to the Flux 1.1.2 release, and in fact is equivalent to Flux 1.1.2 once the log4j 1.2.17 jar is removed from the Flux 1.1.2 ./lib folder.
1.1.2
This patch release addresses the following two issues:
- Export commands that cannot depend on the requirement of a consistent snapshot when querying MarkLogic can now use the new
--no-snapshotoption so that Flux queries MarkLogic at multiple times for data. Please see the user guide for more information on consistent snapshots and when to use them when exporting data. - JSON Lines files can now be imported "as is" via the new
--json-lines-rawoption. Please see the user guide for more information on this fix for avoiding modifications to the JSON documents associated with each line in a JSON Lines file.
1.1.1
1.1.0
This minor release provides the following enhancements and fixes:
- The
import-files,import-archive-files,export-files, andexport-archive-filesall support a new--streamingoption that ensures files and documents from MarkLogic are never read into memory. This is intended to be used when importing or exporting large binary files that either Flux and/or MarkLogic cannot read into memory. Please see the user guide for more details - you can search on "streaming" to find the relevant sections. - Commands that export data to files now default to a save mode of
appendinstead ofoverwrite. - All commands that import files now support filenames with spaces in them.
For users running Flux on Linux - please use the 1.1.1 release as this fixes a logging configuration issue that only occurs on Linux.
1.0.0
This is the first major release of the Progress MarkLogic Flux application for supporting data movement use cases with the MarkLogic data platform.
Please see the Flux user guide for information on how to use Flux.
To obtain the Flux application, download the marklogic-flux-1.0.0.zip file below and see the Getting Started guide.
If you are interested in using Flux with the Apache Spark spark-submit tool, download the marklogic-flux-1.0.0-all.jar and see the Spark Integration guide.
For any questions, enhancement requests, or bug reports, you can file an issue in this repository or contact Progress MarkLogic Support if you are a licensed customer. Progress MarkLogic provides technical support for this release to licensed customers under the terms outlined in the Support Handbook.