Releases · marklogic/marklogic-spark-connector

05 Aug 21:05

rjrudin

2.7.0

d61de9b

2.7.0 Latest

Latest

This minor release addresses the following items:

Can now provide a secondary query when reading documents from MarkLogic. This is supported via the following new options:
- spark.marklogic.read.secondaryUris.invoke
- spark.marklogic.read.secondaryUris.javascript
- spark.marklogic.read.secondaryUris.javascriptFile
- spark.marklogic.read.secondaryUris.xquery
- spark.marklogic.read.secondaryUris.xqueryFile
- spark.marklogic.read.secondaryUris.vars.
Can now provide a prompt when generating an embedding via the new spark.marklogic.write.embedder.prompt option.
Can now encode vectors in documents when generating embeddings via the new spark.marklogic.write.embedder.base64encode option.
Fixed a bug where classifying text and generating embeddings did not work when data was read from a structured data source such as JDBC or a delimited text file.
Fixed a bug where a document with a URI containing multiple colons could not be read from MarkLogic and written to a file.
Fixed a bug where URIs were incorrectly modified when documents were written as entries in a zip file. URIs are now used as the zip entry name.

Assets 4

02 May 14:43

rjrudin

2.6.0

b03d3a1

2.6.0

This release addresses the following items:

Can now extract text from binary documents via Apache Tika .
Can now classify text via Progress Semaphore.
Can now specify document properties and metadata values when writing documents to MarkLogic.

Assets 4

07 Jan 19:04

rjrudin

2.5.1

a5f9bbe

2.5.1

This patch release addresses the following items:

Depends on the MarkLogic Java Client 7.1.0 release, which includes an important bug fix that affects how the connector reads data via custom code.
Added debug-level logging for reading and writing data via custom code.
Fixed an issue with logging progress when reading rows via an Optic query.

Assets 4

17 Dec 21:49

rjrudin

2.5.0

d0d6d9c

2.5.0

This release addresses the following items:

Can now split text in documents when writing them to MarkLogic. Chunks of text can be added to the source document itself or written to separate sidecar documents.
Can now add embeddings to chunks in documents before writing them to MarkLogic. You can reuse the Flux embedding model integrations available from the Flux releases site by adding one or more of these JAR files to your Spark classpath.
When reading rows via an Optic query, the Optic query no longer requires the use of op.fromView. However, when not using op.fromView, the Optic query will be executed in a single call to MarkLogic.
When writing files to a directory, the given path will be created automatically if it does not exist, matching the behavior of Spark file-based data sources.

Please see the writing guide for more information on the splitter and embedder features.

Assets 4

17 Oct 18:44

rjrudin

2.4.2

afd19a3

2.4.2

This patch release addresses the following two issues:

spark.marklogic.read.snapshot was added to allow a user to configure a non-consistent snapshot when reading documents by setting the option to false. This avoids bugs where a consistent snapshot is not feasible and the downsides of reading at multiple times are not a concern.
Issues with importing JSON Lines files via Flux - such as keys being reordered and added - can be avoided by setting the existing
spark.marklogic.read.files.type option to a value of json_lines. The connector will read each line as a separate JSON document and will not perform any modifications on any line, thereby avoiding the issue in Flux of JSON documents being unexpectedly altered.

Assets 4

17 Oct 17:06

rjrudin

2.4.1

7ff9e0f

2.4.1

This patch release addresses a single issue:

The org.slf4j:slf4j-api transitive dependency is forced to be version 2.0.13, ensuring that no occurrences of the 1.x version of that dependency are included in the connector jar. This resolves a logging issue in the Flux application.

Assets 4

02 Oct 19:25

rjrudin

2.4.0

168cf5f

2.4.0

This minor release addresses the following items:

Can now stream regular files, ZIP files, gzip files, and archive files by setting the new spark.marklogic.streamFiles option to a value of true. Using this option in the reader phase results in the reading of files being deferred until the writer phase. Using this option in the writer phase results in each file being streamed to MarkLogic in a separate request to MarkLogic, thus avoiding ever reading the contents of the file or zip entry into memory.
Can now stream documents from MarkLogic to regular files, ZIP files, gzip files, and archive files by setting the same option above - spark.marklogic.streamFiles - to a value of `true. Using this option in the reader phase results in the reading of documents being deferred until the writer phase. Using this option in the writer phase results in each document being streamed from MarkLogic to a file or zip entry, thus avoiding ever reading the contents of the document into memory.
Files with spaces in the path are now handled correctly when reading files into MarkLogic. However, when streaming files into MarkLogic, the spaces in the path will be encoded due to a pending server fix.
Archive files - zip files containing content and metadata - now contain the metadata entry followed by the content entry for each document. This supports streaming archive files. Archive files generated by version 2.3.x of the connector - with the content entry followed by the metadata entry - can still be read, though they cannot be streamed.
Now compiled and tested against Spark 3.5.3.

Assets 4

22 Aug 10:10

rjrudin

2.3.1

0235348

2.3.1

This patch release addresses the following issues:

Can now read document URIs that include non-US-ASCII characters. This was fixed via an upgrade of the Java Client to its 7.0.0 release, whose breaking changes do not have impact on this connector release.
Registered collatedString as a known TDE type, thereby avoiding warnings when reading rows from a TDE that uses that type.
Significantly improved performance when reading aggregate XML files and extracting a URI value from an element.
Fixed bug where a message of "Wrote failed documents to archive file at" was logged when no documents failed.

Assets 4

26 Jul 17:55

rjrudin

2.3.0

0eaf6b1

2.3.0

This minor release provides significant new functionality in support of the 1.0.0 release of the new MarkLogic Flux data movement tool. Much of this functionality is documented in the Flux documentation. We will soon have complete documentation of all the new options in this repository's documentation as well.

In the meantime, the new options in this release are listed below.

Read Options

spark.marklogic.read.javascriptFile and spark.marklogic.read.xqueryFile allow for custom code to be read from a file path.
spark.marklogic.read.partitions.javascriptFile and spark.marklogic.read.partitions.xqueryFile allow for custom code to be read from a file path.
Can now read document rows by specifying a list of newline-delimited URIs via the spark.marklogic.read.documents.uris option.
Can now read rows containing semantic triples in MarkLogic via spark.marklogic.read.triples.graphs, spark.marklogic.read.triples.collections, spark.marklogic.read.triples.query, spark.marklogic.read.triples.stringQuery, spark.marklogic.read.triples.uris, spark.marklogic.read.triples.directory, spark.marklogic.read.triples.options, spark.marklogic.read.triples.filtered, and spark.marklogic.read.triples.baseIri.
Can now read Flux and MLCP archives by setting spark.marklogic.read.files.type to archive or mlcp_archive.
Can control which categories of metadata are read from Flux archives via spark.marklogic.read.archives.categories.
Can now specify the encoding of a file to read via spark.marklogic.read.files.encoding.
Can now see progress logged of reading data from MarkLogic via spark.marklogic.read.logProgress.
Can specify whether to fail on a file read error via spark.marklogic.read.files.abortOnFailure.

Write Options

spark.marklogic.write.threadCount has been altered to reflect the common user understanding of "number of threads used to connect to MarkLogic". If you need to specify a thread count per partition, use spark.marklogic.write.threadCountPerPartition.
Can now see progress logged of data written to MarkLogic via spark.marklogic.write.logProgress.
spark.marklogic.write.javascriptFile and spark.marklogic.write.xqueryFile allow for custom code to be read from a file path.
Settingspark.marklogic.write.archivePathForFailedDocuments to a file path will result in any failed documents being added to an archive zip file at that file path.
spark.marklogic.write.jsonRootName allows for a root field to be added to a JSON document constructed from an arbitrary row.
spark.marklogic.write.xmlRootName and spark.marklogic.write.xmlNamespace allow for an XML document to be constructed from an arbitrary row.
Options starting with spark.marklogic.write.json. will be used to configure how the connector serializes a Spark row into a JSON object.
Can use spark.marklogic.write.graph and spark.marklogic.write.graphOverride to specify the graph when writing RDF triples to MarkLogic.
Deprecated spark.marklogic.write.fileRows.documentType in favor of using spark.marklogic.write.documentType to force a document type on documents written to MarkLogic with an extension unrecognized by MarkLogic.
Can use spark.marklogic.write.files.prettyPrint to pretty-print JSON and XML files written by the connector.
Can use spark.marklogic.write.files.encoding to write files in a different encoding.
Can use spark.marklogic.write.files.rdf.format to specify an RDF type when writing triples to RDF files.
Can use spark.marklogic.write.files.rdf.graph to specify a graph when writing RDF files.

Assets 4

22 Feb 21:10

rjrudin

2.2.0

f1bbf9c

2.2.0

This minor release provides the following enhancements:

Document rows can now be read via MarkLogic search queries.
Document rows can also be written to MarkLogic, thereby allowing for copy operations that read document rows from one database and write them to another database.
Now depends on the MarkLogic Java Client 6.5.0 release, which eliminates some security vulnerabilities via upgrades to OkHttp and Jackson.

Assets 4

Releases: marklogic/marklogic-spark-connector

2.7.0

Uh oh!

2.6.0

Uh oh!

2.5.1

Uh oh!

2.5.0

Uh oh!

2.4.2

Uh oh!

2.4.1

Uh oh!

2.4.0

Uh oh!

2.3.1

Uh oh!

2.3.0

Read Options

Write Options

Uh oh!

2.2.0

Uh oh!