Skip to content

Commit e56706f

Browse files
authored
Merge pull request #495 from marklogic/release/2.6.0
Merge release/2.6.0 into master
2 parents 7e83179 + e5541b5 commit e56706f

File tree

349 files changed

+4592
-2271
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

349 files changed

+4592
-2271
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,3 +20,4 @@ docker/marklogic
2020
docker/sonarqube/data
2121
docker/sonarqube/logs
2222
export
23+
.run

CODEOWNERS

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,4 @@
22
# Each line is a file pattern followed by one or more owners.
33

44
# These owners will be the default owners for everything in the repo.
5-
* @anu3990 @billfarber @rjrudin
5+
* @anu3990 @billfarber @rjrudin @stevebio

CONTRIBUTING.md

Lines changed: 35 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,12 @@
11
This guide covers how to develop and test this project. It assumes that you have cloned this repository to your local
22
workstation.
33

4-
You must use Java 11 or higher for developing, testing, and building this project. If you wish to use Sonar as
5-
described below, you must use Java 17 or higher.
4+
**You must use Java 17 for developing, testing, and building this project**, even though the connector supports
5+
running on Java 11. For users, Java 17 is only required if using the splitting and embedding features, as those
6+
depend on a third party module that requires Java 17.
7+
8+
**You also need Java 11 installed** so that the subprojects in this repository that require Java 11 have access to a
9+
Java 11 SDK. [sdkman](https://sdkman.io/) is highly recommend for installing multiple JDKs.
610

711
# Setup
812

@@ -40,49 +44,37 @@ To run the tests against the test application, run the following Gradle task:
4044

4145
./gradlew test
4246

43-
## Generating code quality reports with SonarQube
44-
45-
In order to use SonarQube, you must have used Docker to run this project's `docker-compose.yml` file, and you must
46-
have the services in that file running and you must use Java 17 to run the Gradle `sonar` task.
47-
48-
To configure the SonarQube service, perform the following steps:
47+
**To run the tests in Intellij**, you must configure your JUnit template to include a few JVM args:
4948

50-
1. Go to http://localhost:9000 .
51-
2. Login as admin/admin. SonarQube will ask you to change this password; you can choose whatever you want ("password" works).
52-
3. Click on "Create project manually".
53-
4. Enter "marklogic-spark" for the Project Name; use that as the Project Key too.
54-
5. Enter "develop" as the main branch name.
55-
6. Click on "Next".
56-
7. Click on "Use the global setting" and then "Create project".
57-
8. On the "Analysis Method" page, click on "Locally".
58-
9. In the "Provide a token" panel, click on "Generate". Copy the token.
59-
10. Add `systemProp.sonar.login=your token pasted here` to `gradle-local.properties` in the root of your project, creating
60-
that file if it does not exist yet.
49+
1. Go to Run -> Edit Configurations.
50+
2. Delete any JUnit configurations you already have.
51+
3. Click on "Edit configuration templates" and click on "JUnit".
52+
4. Click on "Modify options" and select "Add VM options" if it's not already selected.
53+
5. In the VM options text input, add the following:
54+
--add-exports=java.base/sun.nio.ch=ALL-UNNAMED --add-exports=java.base/sun.util.calendar=ALL-UNNAMED --add-exports=java.base/sun.security.action=ALL-UNNAMED
55+
6. Click "Apply".
56+
7. In the dropdown that has "Class" selected, change that to "Method" and hit "Apply" again.
6157

62-
To run SonarQube, run the following Gradle tasks using Java 17, which will run all the tests with code coverage and
63-
then generate a quality report with SonarQube:
58+
You may need to repeat steps 6 and 7. I've found Intellij to be a little finicky with actually applying these changes.
6459

65-
./gradlew test sonar
60+
The net effect should be that when you run a JUnit class or method or suite of tests, those VM options are automatically
61+
added to the run configuration that Intellij creates for the class/method/suite. Those VM options are required to give
62+
Spark access to certain JVM modules. They are applied automatically when running the tests via Gradle.
6663

67-
If you do not add `systemProp.sonar.login` to your `gradle-local.properties` file, you can specify the token via the
68-
following:
64+
**Alternatively**, you can open Preferences in Intellij and go to
65+
"Build, Execution, and Deployment" -> "Build Tools" -> "Gradle". Then change "Build and run using" and "Run tests using"
66+
to "Gradle". This should result in Intellij using the `test` configuration in the `marklogic-spark-connector/build.gradle`
67+
file that registers the required JVM options, allowing for tests to run on Java 17.
6968

70-
./gradlew test sonar -Dsonar.login=paste your token here
69+
## Testing text classification
7170

72-
When that completes, you will see a line like this near the end of the logging:
71+
See the `ClassifyAdHocTest` class for instructions on how to test the text classification feature with a
72+
valid connection to Semaphore.
7373

74-
ANALYSIS SUCCESSFUL, you can find the results at: http://localhost:9000/dashboard?id=marklogic-spark
75-
76-
Click on that link. If it's the first time you've run the report, you'll see all issues. If you've run the report
77-
before, then SonarQube will show "New Code" by default. That's handy, as you can use that to quickly see any issues
78-
you've introduced on the feature branch you're working on. You can then click on "Overall Code" to see all issues.
79-
80-
Note that if you only need results on code smells and vulnerabilities, you can repeatedly run `./gradlew sonar`
81-
without having to re-run the tests.
82-
83-
You can also force Gradle to run `sonar` if any tests fail:
74+
## Generating code quality reports with SonarQube
8475

85-
./gradlew clean test sonar --continue
76+
Please see our internal Wiki page - search for "Developer Experience SonarQube" -
77+
for information on setting up SonarQube and using it with this repository.
8678

8779
# Testing with PySpark
8880

@@ -98,7 +90,7 @@ This will produce a single jar file for the connector in the `./build/libs` dire
9890

9991
You can then launch PySpark with the connector available via:
10092

101-
pyspark --jars marklogic-spark-connector/build/libs/marklogic-spark-connector-2.5-SNAPSHOT.jar
93+
pyspark --jars marklogic-spark-connector/build/libs/marklogic-spark-connector-2.6-SNAPSHOT.jar
10294

10395
The below command is an example of loading data from the test application deployed via the instructions at the top of
10496
this page.
@@ -158,8 +150,8 @@ spark.read.option("header", True).csv("marklogic-spark-connector/src/test/resour
158150
When you run PySpark, it will create its own Spark cluster. If you'd like to try against a separate Spark cluster
159151
that still runs on your local machine, perform the following steps:
160152

161-
1. Use [sdkman to install Spark](https://sdkman.io/sdks#spark). Run `sdk install spark 3.4.3` since we are currently
162-
building against Spark 3.4.3.
153+
1. Use [sdkman to install Spark](https://sdkman.io/sdks#spark). Run `sdk install spark 3.5.5` since we are currently
154+
building against Spark 3.5.5.
163155
2. `cd ~/.sdkman/candidates/spark/current/sbin`, which is where sdkman will install Spark.
164156
3. Run `./start-master.sh` to start a master Spark node.
165157
4. `cd ../logs` and open the master log file that was created to find the address for the master node. It will be in a
@@ -174,7 +166,7 @@ The Spark master GUI is at <http://localhost:8080>. You can use this to view det
174166

175167
Now that you have a Spark cluster running, you just need to tell PySpark to connect to it:
176168

177-
pyspark --master spark://NYWHYC3G0W:7077 --jars marklogic-spark-connector/build/libs/marklogic-spark-connector-2.5-SNAPSHOT.jar
169+
pyspark --master spark://NYWHYC3G0W:7077 --jars marklogic-spark-connector/build/libs/marklogic-spark-connector-2.6-SNAPSHOT.jar
178170

179171
You can then run the same commands as shown in the PySpark section above. The Spark master GUI will allow you to
180172
examine details of each of the commands that you run.
@@ -193,15 +185,15 @@ You will need the connector jar available, so run `./gradlew clean shadowJar` if
193185
You can then run a test Python program in this repository via the following (again, change the master address as
194186
needed); note that you run this outside of PySpark, and `spark-submit` is available after having installed PySpark:
195187

196-
spark-submit --master spark://NYWHYC3G0W:7077 --jars marklogic-spark-connector/build/libs/marklogic-spark-connector-2.5-SNAPSHOT.jar marklogic-spark-connector/src/test/python/test_program.py
188+
spark-submit --master spark://NYWHYC3G0W:7077 --jars marklogic-spark-connector/build/libs/marklogic-spark-connector-2.6-SNAPSHOT.jar marklogic-spark-connector/src/test/python/test_program.py
197189

198190
You can also test a Java program. To do so, first move the `com.marklogic.spark.TestProgram` class from `src/test/java`
199191
to `src/main/java`. Then run the following:
200192

201193
```
202194
./gradlew clean shadowJar
203195
cd marklogic-spark-connector
204-
spark-submit --master spark://NYWHYC3G0W:7077 --class com.marklogic.spark.TestProgram build/libs/marklogic-spark-connector-2.5-SNAPSHOT.jar
196+
spark-submit --master spark://NYWHYC3G0W:7077 --class com.marklogic.spark.TestProgram build/libs/marklogic-spark-connector-2.6-SNAPSHOT.jar
205197
```
206198

207199
Be sure to move `TestProgram` back to `src/test/java` when you are done.

Jenkinsfile

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ def runtests(String javaVersion){
88
cd marklogic-spark-connector
99
echo "Waiting for MarkLogic server to initialize."
1010
sleep 30s
11+
./gradlew clean
1112
./gradlew -i mlDeploy
1213
echo "Loading data a second time to try to avoid Optic bug with duplicate rows being returned."
1314
./gradlew -i mlLoadData
@@ -85,8 +86,9 @@ pipeline{
8586
export JAVA_HOME=$JAVA17_HOME_DIR
8687
export GRADLE_USER_HOME=$WORKSPACE/$GRADLE_DIR
8788
export PATH=$GRADLE_USER_HOME:$JAVA_HOME/bin:$PATH
88-
cp ~/.gradle/gradle.properties $GRADLE_USER_HOME;
8989
cd marklogic-spark-connector
90+
./gradlew clean
91+
cp ~/.gradle/gradle.properties $GRADLE_USER_HOME;
9092
./gradlew publish
9193
'''
9294
}

NOTICE.txt

Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,17 +13,20 @@ Third Party Notices
1313
jackson-dataformat-xml 2.15.2 (Apache-2.0)
1414
jdom2 2.0.6.1 (Apache-2.0)
1515
jena-arq 4.10.0 (Apache-2.0)
16-
langchain4j 0.35.0 (Apache-2.0)
16+
langchain4j 1.0.0-beta2 (Apache-2.0)
1717
marklogic-client-api 7.1.0 (Apache-2.0)
1818
okhttp 4.12.0 (Apache-2.0)
19+
Semaphore-CS-Client 5.6.1 (Apache-2.0)
20+
Semaphore-Cloud-Client 5.6.1 (Apache-2.0)
21+
tika-core 3.1.0 (Apache-2.0)
1922

2023
Common Licenses
2124

2225
Apache License 2.0 (Apache-2.0)
2326

2427
Third-Party Components
2528

26-
The following is a list of the third-party components used by the MarkLogic® Spark connector 2.5.1 (last updated January 7, 2025):
29+
The following is a list of the third-party components used by the MarkLogic® Spark connector 2.6.0 (last updated May 1, 2025):
2730

2831
jackson-dataformat-xml 2.15.2 (Apache-2.0)
2932
https://repo1.maven.org/maven2/com/fasterxml/jackson/dataformat/jackson-dataformat-xml/
@@ -37,7 +40,7 @@ jena-arq 4.10.0 (Apache-2.0)
3740
https://repo1.maven.org/maven2/org/apache/jena/jena-arq/
3841
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
3942

40-
langchain4j 0.35.0 (Apache-2.0)
43+
langchain4j 1.0.0-beta2 (Apache-2.0)
4144
https://repo1.maven.org/maven2/dev/langchain4j/langchain4j/
4245
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
4346

@@ -49,10 +52,21 @@ okhttp 4.12.0 (Apache-2.0)
4952
https://repo1.maven.org/maven2/com/squareup/okhttp3/okhttp/
5053
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
5154

55+
Semaphore-CS-Client 5.6.1 (Apache-2.0)
56+
https://repo1.maven.org/maven2/com/smartlogic/csclient/Semaphore-CS-Client/
57+
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
58+
59+
Semaphore-CS-Client 5.6.1 (Apache-2.0)
60+
https://repo1.maven.org/maven2/com/smartlogic/cloud/Semaphore-Cloud-Client/
61+
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
62+
63+
tika-core 3.1.0 (Apache-2.0)
64+
https://repo1.maven.org/maven2/org/apache/tika/tika-core/
65+
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
5266

5367
Common Licenses
5468

55-
The following is a list of the third-party components used by the MarkLogic® Spark connector 2.5.1 (last updated January 7, 2025):
69+
The following is a list of the third-party components used by the MarkLogic® Spark connector 2.6.0 (last updated May 1, 2025):
5670

5771
Apache License 2.0 (Apache-2.0)
5872
https://spdx.org/licenses/Apache-2.0.html

build.gradle

Lines changed: 58 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
plugins {
2+
id 'net.saliman.properties' version '1.5.2'
23
id "java-library"
34
id "org.sonarqube" version "6.0.1.5171"
45
}
@@ -12,15 +13,25 @@ sonar {
1213
}
1314

1415
subprojects {
16+
apply plugin: "net.saliman.properties"
1517
apply plugin: "java-library"
1618
apply plugin: "jacoco"
1719

1820
group = "com.marklogic"
19-
version "2.5.1"
21+
version "2.6.0"
2022

23+
// Defaults to the Java 11 toolchain. Overridden by subprojects that require Java 17.
24+
// See https://docs.gradle.org/current/userguide/toolchains.html .
2125
java {
22-
sourceCompatibility = 11
23-
targetCompatibility = 11
26+
toolchain {
27+
languageVersion = JavaLanguageVersion.of(11)
28+
}
29+
}
30+
31+
// Allows for quickly identifying compiler warnings.
32+
tasks.withType(JavaCompile) {
33+
options.compilerArgs << '-Xlint:unchecked'
34+
options.deprecation = true
2435
}
2536

2637
repositories {
@@ -32,11 +43,36 @@ subprojects {
3243
}
3344

3445
configurations.all {
35-
// Ensures that slf4j-api 1.x does not appear on the Flux classpath in particular, which can lead to this
36-
// issue - https://www.slf4j.org/codes.html#StaticLoggerBinder .
46+
resolutionStrategy.eachDependency { DependencyResolveDetails details ->
47+
// Added after upgrading langchain4j to 1.0.0-beta2, which brought Jackson 2.18.2 in.
48+
if (details.requested.group.startsWith('com.fasterxml.jackson')) {
49+
details.useVersion '2.15.2'
50+
details.because 'Need to match the version used by Spark.'
51+
}
52+
if (details.requested.group.equals("org.slf4j")) {
53+
details.useVersion "2.0.16"
54+
details.because "Ensures that slf4j-api 1.x does not appear on the Flux classpath in particular, which can " +
55+
"lead to this issue - https://www.slf4j.org/codes.html#StaticLoggerBinder."
56+
}
57+
if (details.requested.group.equals("org.apache.logging.log4j")) {
58+
details.useVersion "2.24.3"
59+
details.because "Need to match the version used by Apache Tika. Spark uses 2.20.0 but automated tests confirm " +
60+
"that Spark seems fine with 2.24.3."
61+
}
62+
}
63+
3764
resolutionStrategy {
38-
force "org.slf4j:slf4j-api:2.0.13"
65+
// Avoids a classpath conflict between Spark and the tika-parser-microsoft-module. Tika needs a
66+
// more recent version and Spark (and Jena as well) both seems fine with this (as they should be per semver).
67+
force "org.apache.commons:commons-compress:1.27.1"
3968
}
69+
70+
// Excluded from Flux for size reasons, so excluded here as well to ensure we don't need it when running tests.
71+
exclude module: "rocksdbjni"
72+
}
73+
74+
task allDeps(type: DependencyReportTask) {
75+
description = "Allows for generating dependency reports for every subproject in a single task."
4076
}
4177

4278
test {
@@ -46,6 +82,9 @@ subprojects {
4682
events 'started', 'passed', 'skipped', 'failed'
4783
exceptionFormat 'full'
4884
}
85+
environment "SEMAPHORE_API_KEY", semaphoreApiKey
86+
environment "SEMAPHORE_HOST", semaphoreHost
87+
environment "SEMAPHORE_PATH", semaphorePath
4988
}
5089

5190
// See https://docs.gradle.org/current/userguide/jacoco_plugin.html .
@@ -55,4 +94,17 @@ subprojects {
5594
xml.required = true
5695
}
5796
}
97+
98+
// Turning off coverage verification, as we haven't figured out how to configure Sonar correctly to pick up the jacoco
99+
// report from the "code-coverage-report" subproject and accurately capture coverage.
100+
// See https://docs.gradle.org/current/userguide/jacoco_plugin.html#sec:jacoco_report_violation_rules
101+
jacocoTestCoverageVerification {
102+
violationRules {
103+
rule {
104+
limit {
105+
minimum = 0.0
106+
}
107+
}
108+
}
109+
}
58110
}

code-coverage-report/build.gradle

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,13 @@ plugins {
66
}
77

88
dependencies {
9-
jacocoAggregation project(':marklogic-langchain4j')
109
jacocoAggregation project(':marklogic-spark-api')
11-
jacocoAggregation project(':marklogic-spark-langchain4j')
10+
jacocoAggregation project(':marklogic-langchain4j')
1211
jacocoAggregation project(':marklogic-spark-connector')
12+
13+
// With the separation of tests into this project, we aren't getting any jacoco code coverage data.
14+
// A problem to be fixed later.
15+
jacocoAggregation project(':tests')
1316
}
1417

1518
reporting {

docker-compose.yaml

Lines changed: 1 addition & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
name: marklogic_spark
1+
name: docker-tests-marklogic-spark
22

33
services:
44

@@ -27,31 +27,3 @@ services:
2727
- "8000-8002:8000-8002"
2828
- "8015:8015"
2929
- "8016:8016"
30-
31-
# Copied from https://docs.sonarsource.com/sonarqube/latest/setup-and-upgrade/install-the-server/#example-docker-compose-configuration .
32-
sonarqube:
33-
image: sonarqube:lts-community
34-
depends_on:
35-
- postgres
36-
environment:
37-
SONAR_JDBC_URL: jdbc:postgresql://postgres:5432/sonar
38-
SONAR_JDBC_USERNAME: sonar
39-
SONAR_JDBC_PASSWORD: sonar
40-
volumes:
41-
- sonarqube_data:/opt/sonarqube/data
42-
ports:
43-
- "9000:9000"
44-
45-
postgres:
46-
image: postgres:15
47-
environment:
48-
POSTGRES_USER: sonar
49-
POSTGRES_PASSWORD: sonar
50-
volumes:
51-
- postgresql:/var/lib/postgresql
52-
- postgresql_data:/var/lib/postgresql/data
53-
54-
volumes:
55-
sonarqube_data:
56-
postgresql:
57-
postgresql_data:

docs/getting-started/jupyter.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,15 +32,15 @@ connector and also to initialize Spark:
3232

3333
```
3434
import os
35-
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars "/path/to/marklogic-spark-connector-2.5.1.jar" pyspark-shell'
35+
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars "/path/to/marklogic-spark-connector-2.6.0.jar" pyspark-shell'
3636
3737
from pyspark.sql import SparkSession
3838
spark = SparkSession.builder.master("local[*]").appName('My Notebook').getOrCreate()
3939
spark.sparkContext.setLogLevel("WARN")
4040
spark
4141
```
4242

43-
The path of `/path/to/marklogic-spark-connector-2.5.1.jar` should be changed to match the location of the connector
43+
The path of `/path/to/marklogic-spark-connector-2.6.0.jar` should be changed to match the location of the connector
4444
jar on your filesystem. You are free to customize the `spark` variable in any manner you see fit as well.
4545

4646
Now that you have an initialized Spark session, you can run any of the examples found in the

0 commit comments

Comments
 (0)