Skip to content

Releases: marklogic/flux

2.0.0

23 Jan 18:11
284cddf

Choose a tag to compare

This major release upgrades the underlying Apache Spark platform to version 4.1.0 and thus requires Java 17 to run. Please see below for a complete list of breaking changes, enhancements, and bug fixes, as well as a migration guide.

Breaking Changes

  1. Single-letter options from Flux 1.x (such as -P and -C) are no longer supported. These have been replaced with their standard long-form option names. This change fixes a bug where argument values starting with a dash followed by a single letter (e.g., -Psome-value) were incorrectly interpreted as option flags. Please see the migration guide below for a mapping of removed option names to new option names.
  2. When splitting text and creating chunks, Flux now defaults to creating one sidecar document per chunk as opposed to defaulting to adding all chunks to the source document.
  3. Vector embeddings in XML documents now default to a QName of vector with a namespace of http://marklogic.com/vector, matching upcoming default index exclusions in the MarkLogic server.
  4. Vector embeddings in JSON documents now default to a name of _vector, also matching upcoming default index exclusions in the MarkLogic server.
  5. Removed usage of the Spark fs.s3n options, as Spark 4 no longer supports those for connecting to AWS S3.

Enhancements

  1. When a URI template specified via --uri-template has an expression that cannot be resolved for a given document, the new --uri-template-warn-on-missing-field option can be included to log a warning instead of failing. The expression will have its value replaced with UNRESOLVED- prepended to a random UUID.
  2. When reading files, Flux now defaults the number of partitions to Spark's underlying parallelism default value, helping avoid performance issues due to large numbers of very small partitions.
  3. When aggregating rows during import, aggregated rows can now be sorted via the new --aggregate-order-by and --aggregate-order-desc options.
  4. When classifying text via a Semaphore instance in Progress Data Cloud (PDC), the PDC token will be renewed if it expires during the course of a connector job.
  5. When exporting documents to zip file, a warning will be logged once a zip file contains 500,000 entries. Writing multiple large zip files at once can lead to heap space exhaustion in the JVM; users can avoid this by increasing the number of partitions.
  6. When reprocessing items and defining partitions via custom code, the new --read-partitions-var option can be used to specify custom variables to send to the code. Previously, the variables defined via --read-var were sent to the custom partition code, but that is no longer the case.
  7. When connecting to Progress Data Cloud (PDC), --auth-type cloud no longer needs to be specified; the use of --cloud-api-key is sufficient for enabling authentication with PDC.
  8. When copying data, the --output-database option can be used without any other "output" options to copy from one database to another using the same app server connection.
  9. When importing from or exporting to AWS S3, an S3 session token can now be specified via --s3-session-token.
  10. The version command now has aliases of -v and -version.
  11. Improved documentation on tuning performance with better explanations of partitions and parallelism.

Bug Fixes

  1. When importing triples where datatype is only set if lang does not exist.
  2. When copying data, classifier options - e.g. --classifier-host are now properly applied.
  3. When importing files, a partition will now never have zero files, which caused an error. Files are now evenly distributed across partitions.
  4. Documents can be exported to zip files on Windows without encountering path errors.

Migration Guide

Updating Command-Line Arguments

If you have existing Flux 1.x scripts or commands, replace any single-letter options with their standard equivalents using the below table:

Single letter option Standard option
-E --embedder-prop
-L --classifier-prop
-M --doc-metadata
-P --spark-prop
-R --doc-prop
-S --splitter-prop
-X --xpath-namespace

For example, if you had an option -Pkey=value, you would change that to --spark-prop key=value.

Java Version Requirement

Ensure your environment has Java 17 or 21 installed. Java 11 is no longer supported.

URI Template Behavior

If you rely on --uri-template and want to continue failing on unresolved expressions, no action is needed. To allow missing fields and generate UUIDs instead, add the --uri-template-warn-on-missing-field flag.

Text Splitting Default Change

If you previously relied on chunks being added to source documents (the 1.x default), you will now need to explicitly configure this behavior. Flux 2.0 defaults to creating separate sidecar documents for each chunk. To preserve the 1.x behavior, include the following option:

--splitter-sidecar-max-chunks 0

Embedding Data Structure Changes

If you are using Flux to generate embeddings in JSON documents and wish to preserve the 1.x default data structure for embeddings in JSON documents, use the following options:

--embedder-embedding-name embedding

To preserve the 1.x default data structure for embeddings in XML documents, use the following options:

--embedder-embedding-name embedding
--embedder-embedding-namespace "http://marklogic.com/appservices/model"

1.4.0

06 Aug 13:39
54aabb3

Choose a tag to compare

This minor release addresses the following items:

  1. Flux now supports reading from and writing to Azure Blob Storage.
  2. When importing from structured data sources such as JDBC or delimited text files, you can now generate and load a TDE template based on the data source, allowing you to immediately query data using SQL or Optic. See the guide on TDE generation for more information.
  3. When exporting documents, you can specify a secondary query to include URIs in addition to those retrieved by your primary query. See the guide on exporting documents for more information on secondary queries.
  4. When generating an embedding, you can now specify a prompt to provide context or instructions to the embedding model. See the guide on generating embeddings for more information.
  5. When generating an embedding, you can now choose to encode vectors within documents via the new --embedder-base64-encode option. See the guide on generating embeddings for more information.
  6. The import-jdbc command now accepts a table name, specified via --table, instead of requiring a SQL query.
  7. The import-archive-files command now supports a --document-type option to force the type of each document written to MarkLogic.
  8. Bug fix - classifying text and generating embeddings now works properly when importing from structured data sources (JDBC, delimited text, Avro, ORC, and Parquet).
  9. Bug fix - can now export documents with URIs containing multiple colons to files.
  10. Bug fix - document URIs containing schemes were incorrectly modified when documents were written as entries in a zip file. URIs are now used as zip entry names.

1.3.0

02 May 15:23
fb9be37

Choose a tag to compare

This release provides the following enhancements:

  1. Documents and chunks can now have their text classified via Progress Semaphore. Please see the text classifier documentation for more information.
  2. Text can now be extracted when importing files.
  3. Document properties and metadata values can be configured when writing documents to MarkLogic.
  4. All Apache Spark configuration options can now be configured when running any Flux command.

1.2.1

07 Jan 22:12
eed8198

Choose a tag to compare

This patch release addresses the following items:

  1. Out-of-memory errors occurring when reading data with the reprocess command have been addressed via a fix in the underlying MarkLogic Java Client library.
  2. The --embedder option for specifying an embedding model no longer lowercases its value when it is a class name instead of an abbreviation.

1.2.0

19 Dec 20:46

Choose a tag to compare

This release provides the following enhancements:

  1. Support for Retrieval-augmented generation, or RAG with MarkLogic via support for splitting text and adding embeddings during any import or copy operation. See the Import guide for more information on splitting text and adding embeddings.
  2. When exporting rows, can now use any Optic data access function and not just op.fromView.
  3. When exporting documents, Flux no longer requires any query options to be specified and defaults to exporting all documents that a user can read.
  4. When exporting data to files, Flux will automatically create the given path if it does not exist, regardless of the target file type (prior to 1.2.0, this only happened for export commands).
  5. The reprocess command has a new --thread-count option to simplify parallelizing the processing of data in MarkLogic.

To get started, download the marklogic-flux-1.2.0.zip file below and visit our Getting Started guide.

1.1.3

21 Oct 18:38
5610ce5

Choose a tag to compare

This patch release addresses a single issue - an unused transitive dependency (via Spark and Hadoop) on log4j 1.2.17 is no longer included in Flux. Flux did not make use of this dependency in any of its prior releases, and it can be safely removed from the ./lib folder in prior releases.

Note that while this unused log4j dependency has several open CVEs assigned to it, it is not impacted by the LogShell log4j vulnerability. Flux has never been impacted by this vulnerability, as it has used log4j 2.19.0 or higher since its 1.0.0 release.

This release is otherwise identical to the Flux 1.1.2 release, and in fact is equivalent to Flux 1.1.2 once the log4j 1.2.17 jar is removed from the Flux 1.1.2 ./lib folder.

1.1.2

17 Oct 19:30
8916c8d

Choose a tag to compare

This patch release addresses the following two issues:

  1. Export commands that cannot depend on the requirement of a consistent snapshot when querying MarkLogic can now use the new --no-snapshot option so that Flux queries MarkLogic at multiple times for data. Please see the user guide for more information on consistent snapshots and when to use them when exporting data.
  2. JSON Lines files can now be imported "as is" via the new --json-lines-raw option. Please see the user guide for more information on this fix for avoiding modifications to the JSON documents associated with each line in a JSON Lines file.

1.1.1

07 Oct 23:51
e4e7660

Choose a tag to compare

This patch release addresses an issue with logging when running Flux on Linux. It is otherwise equivalent to the 1.1.0 release.

1.1.0

02 Oct 20:08
6257b6c

Choose a tag to compare

This minor release provides the following enhancements and fixes:

  1. The import-files, import-archive-files, export-files, and export-archive-files all support a new --streaming option that ensures files and documents from MarkLogic are never read into memory. This is intended to be used when importing or exporting large binary files that either Flux and/or MarkLogic cannot read into memory. Please see the user guide for more details - you can search on "streaming" to find the relevant sections.
  2. Commands that export data to files now default to a save mode of append instead of overwrite.
  3. All commands that import files now support filenames with spaces in them.

For users running Flux on Linux - please use the 1.1.1 release as this fixes a logging configuration issue that only occurs on Linux.

1.0.0

28 Aug 18:07

Choose a tag to compare

This is the first major release of the Progress MarkLogic Flux application for supporting data movement use cases with the MarkLogic data platform.

Please see the Flux user guide for information on how to use Flux.

To obtain the Flux application, download the marklogic-flux-1.0.0.zip file below and see the Getting Started guide.

If you are interested in using Flux with the Apache Spark spark-submit tool, download the marklogic-flux-1.0.0-all.jar and see the Spark Integration guide.

For any questions, enhancement requests, or bug reports, you can file an issue in this repository or contact Progress MarkLogic Support if you are a licensed customer. Progress MarkLogic provides technical support for this release to licensed customers under the terms outlined in the Support Handbook.