1- This is an evolving guide for developers interested in developing and testing this project. This guide assumes that you
2- have cloned this repository to your local workstation.
1+ This guide covers how to develop and test this project. It assumes that you have cloned this repository to your local
2+ workstation.
33
4- # Do this first!
4+ Due to the use of the Sonar plugin for Gradle, you must use Java 11 or higher for developing and testing the project.
5+ The ` build.gradle ` file for this project ensures that the connector is built to run on Java 8 or higher.
56
6- In order to develop and/or test the connector, or to try out the PySpark instructions below, you first
7- need to deploy the test application in this project to MarkLogic. You can do so either on your own installation of
8- MarkLogic, or you can use ` docker-compose ` to install MarkLogic, optionally as a 3-node cluster with a load balancer
9- in front of it.
7+ # Setup
8+
9+ To begin, you need to deploy the test application in this project to MarkLogic. You can do so either on your own
10+ installation of MarkLogic, or you can use ` docker-compose ` to install MarkLogic, optionally as a 3-node cluster with
11+ a load balancer in front of it.
1012
1113## Installing MarkLogic with docker-compose
1214
@@ -22,9 +24,9 @@ The above will result in a new MarkLogic instance with a single node.
2224Alternatively, if you would like to test against a 3-node MarkLogic cluster with a load balancer in front of it,
2325run ` docker-compose -f docker-compose-3nodes.yaml up -d --build ` .
2426
25- ### Accessing MarkLogic logs in Grafana
27+ ## Accessing MarkLogic logs in Grafana
2628
27- This project's ` docker-compose.yaml ` file includes
29+ This project's ` docker-compose-3nodes .yaml ` file includes
2830[ Grafana, Loki, and promtail services] ( https://grafana.com/docs/loki/latest/clients/promtail/ ) for the primary reason of
2931collecting MarkLogic log files and allowing them to be viewed and searched via Grafana.
3032
@@ -75,6 +77,46 @@ You can then run the tests from within the Docker environment via the following
7577 ./gradlew dockerTest
7678
7779
80+ ## Generating code quality reports with SonarQube
81+
82+ In order to use SonarQube, you must have used Docker to run this project's ` docker-compose.yml ` file and you must
83+ have the services in that file running.
84+
85+ To configure the SonarQube service, perform the following steps:
86+
87+ 1 . Go to http://localhost:9000 .
88+ 2 . Login as admin/admin. SonarQube will ask you to change this password; you can choose whatever you want ("password" works).
89+ 3 . Click on "Create project manually".
90+ 4 . Enter "marklogic-spark" for the Project Name; use that as the Project Key too.
91+ 5 . Enter "develop" as the main branch name.
92+ 6 . Click on "Next".
93+ 7 . Click on "Use the global setting" and then "Create project".
94+ 8 . On the "Analysis Method" page, click on "Locally".
95+ 9 . In the "Provide a token" panel, click on "Generate". Copy the token.
96+ 10 . Add ` systemProp.sonar.token=your token pasted here ` to ` gradle-local.properties ` in the root of your project, creating
97+ that file if it does not exist yet.
98+
99+ To run SonarQube, run the following Gradle tasks, which will run all the tests with code coverage and then generate
100+ a quality report with SonarQube:
101+
102+ ./gradlew test sonar
103+
104+ If you do not add ` systemProp.sonar.token ` to your ` gradle-local.properties ` file, you can specify the token via the
105+ following:
106+
107+ ./gradlew test sonar -Dsonar.token=paste your token here
108+
109+ When that completes, you will see a line like this near the end of the logging:
110+
111+ ANALYSIS SUCCESSFUL, you can find the results at: http://localhost:9000/dashboard?id=marklogic-spark
112+
113+ Click on that link. If it's the first time you've run the report, you'll see all issues. If you've run the report
114+ before, then SonarQube will show "New Code" by default. That's handy, as you can use that to quickly see any issues
115+ you've introduced on the feature branch you're working on. You can then click on "Overall Code" to see all issues.
116+
117+ Note that if you only need results on code smells and vulnerabilities, you can repeatedly run ` ./gradlew sonar `
118+ without having to re-run the tests.
119+
78120# Testing with PySpark
79121
80122The documentation for this project
@@ -89,19 +131,16 @@ This will produce a single jar file for the connector in the `./build/libs` dire
89131
90132You can then launch PySpark with the connector available via:
91133
92- pyspark --jars build/libs/marklogic-spark-connector-2.1.0 .jar
134+ pyspark --jars build/libs/marklogic-spark-connector-2.2-SNAPSHOT .jar
93135
94136The below command is an example of loading data from the test application deployed via the instructions at the top of
95137this page.
96138
97139```
98- df = spark.read.format("com.marklogic.spark")\
99- .option("spark.marklogic.client.host", "localhost")\
100- .option("spark.marklogic.client.port", "8016")\
101- .option("spark.marklogic.client.username", "admin")\
102- .option("spark.marklogic.client.password", "admin")\
103- .option("spark.marklogic.client.authType", "digest")\
140+ df = spark.read.format("marklogic")\
141+ .option("spark.marklogic.client.uri", "spark-test-user:spark@localhost:8016")\
104142 .option("spark.marklogic.read.opticQuery", "op.fromView('Medical', 'Authors')")\
143+ .option("spark.marklogic.read.numPartitions", 8)\
105144 .load()
106145```
107146
@@ -114,6 +153,74 @@ You now have a Spark dataframe - try some commands out on it:
114153Check out the [ PySpark docs] ( https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html ) for
115154more commands you can try out.
116155
156+ You can query for documents as well - the following shows a simple example along with a technique for converting the
157+ binary content of each document into a string of JSON.
158+
159+ ```
160+ import json
161+ from pyspark.sql import functions as F
162+
163+ df = spark.read.format("marklogic")\
164+ .option("spark.marklogic.client.uri", "spark-test-user:spark@localhost:8016")\
165+ .option("spark.marklogic.read.documents.collections", "author")\
166+ .load()
167+ df.show()
168+
169+ df2 = df.select(F.col("content").cast("string"))
170+ df2.head()
171+ json.loads(df2.head()['content'])
172+ ```
173+
174+
175+ # Testing against a local Spark cluster
176+
177+ When you run PySpark, it will create its own Spark cluster. If you'd like to try against a separate Spark cluster
178+ that still runs on your local machine, perform the following steps:
179+
180+ 1 . Use [ sdkman to install Spark] ( https://sdkman.io/sdks#spark ) . Run ` sdk install spark 3.4.1 ` since we are currently
181+ building against Spark 3.4.1.
182+ 2 . ` cd ~/.sdkman/candidates/spark/current/sbin ` , which is where sdkman will install Spark.
183+ 3 . Run ` ./start-master.sh ` to start a master Spark node.
184+ 4 . ` cd ../logs ` and open the master log file that was created to find the address for the master node. It will be in a
185+ log message similar to ` Starting Spark master at spark://NYWHYC3G0W:7077 ` - copy that address at the end of the message.
186+ 5 . ` cd ../sbin ` .
187+ 6 . Run ` ./start-worker.sh spark://NYWHYC3G0W:7077 ` , changing that address as necessary.
188+
189+ You can of course simplify the above steps by adding ` SPARK_HOME ` to your env and adding ` $SPARK_HOME/sbin ` to your
190+ path, which thus avoids having to change directories. The log files in ` ./logs ` are useful to tail as well.
191+
192+ The Spark master GUI is at < http://localhost:8080 > . You can use this to view details about jobs running in the cluster.
193+
194+ Now that you have a Spark cluster running, you just need to tell PySpark to connect to it:
195+
196+ pyspark --master spark://NYWHYC3G0W:7077 --jars build/libs/marklogic-spark-connector-2.2-SNAPSHOT.jar
197+
198+ You can then run the same commands as shown in the PySpark section above. The Spark master GUI will allow you to
199+ examine details of each of the commands that you run.
200+
201+ The above approach is ultimately a sanity check to ensure that the connector works properly with a separate cluster
202+ process.
203+
204+ ## Testing spark-submit
205+
206+ Once you have the above Spark cluster running, you can test out
207+ [ spark-submit] ( https://spark.apache.org/docs/latest/submitting-applications.html ) which enables submitting a program
208+ and an optional set of jars to a Spark cluster for execution.
209+
210+ You will need the connector jar available, so run ` ./gradlew clean shadowJar ` if you have not already.
211+
212+ You can then run a test Python program in this repository via the following (again, change the master address as
213+ needed); note that you run this outside of PySpark, and ` spark-submit ` is available after having installed PySpark:
214+
215+ spark-submit --master spark://NYWHYC3G0W:7077 --jars build/libs/marklogic-spark-connector-2.2-SNAPSHOT.jar src/test/python/test_program.py
216+
217+ You can also test a Java program. To do so, first move the ` com.marklogic.spark.TestProgram ` class from ` src/test/java `
218+ to ` src/main/java ` . Then run ` ./gradlew clean shadowJar ` to rebuild the connector jar. Then run the following:
219+
220+ spark-submit --master spark://NYWHYC3G0W:7077 --class com.marklogic.spark.TestProgram build/libs/marklogic-spark-connector-2.2-SNAPSHOT.jar
221+
222+ Be sure to move ` TestProgram ` back to ` src/test/java ` when you are done.
223+
117224# Testing the documentation locally
118225
119226See the section with the same name in the
0 commit comments