Skip to content

Commit b2a7d7c

Browse files
committed
Add documentation for Scala 2.13
1 parent 991a0d6 commit b2a7d7c

File tree

2 files changed

+395
-14
lines changed

2 files changed

+395
-14
lines changed

docs/en/install.md

Lines changed: 31 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ Spark NLP {{ site.sparknlp_version }} is built with ONNX 1.17.0 and TensorFlow 2
4747

4848
### Scala 2.13
4949

50-
Note that Spark NLP from PyPI can not start a PySpark Scala 2.13 session. Please use the instructions above.
50+
**NOTE**: PySpark from PyPI is based on Scala 2.12 by default, and you can use our Scala 2.12 version. If you need to start a Scala 2.13 instance, you can set the `SPARK_HOME` environment variable to a Spark Scala 2.13 installation, or install PySpark from the official Spark archives.
5151

5252
```bash
5353
# Load Spark NLP with Spark Shell
@@ -121,6 +121,7 @@ spark = SparkSession.builder \
121121
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:{{ site.sparknlp_version }}") \
122122
.getOrCreate()
123123
```
124+
124125
If using local jars, you can use `spark.jars` instead for comma-delimited jar files. For cluster setups, of course,
125126
you'll have to put the jars in a reachable location for all driver and executor nodes.
126127

@@ -268,14 +269,24 @@ Maven Central: [https://mvnrepository.com/artifact/com.johnsnowlabs.nlp](https:/
268269

269270
If you are interested, there is a simple SBT project for Spark NLP to guide you on how to use it in your projects [Spark NLP SBT Starter](https://github.com/maziyarpanahi/spark-nlp-starter)
270271

271-
#### Scala 2.13 Support
272+
### Scala 2.13 Support
273+
274+
**NOTE**: PyPi installed Pyspark only runs on Scala 2.12, so the following section will not apply for it. If you need to start a Scala 2.13 instance, you can set the `SPARK_HOME` environment variable to a Spark Scala 2.13 installation, or install PySpark from the official Spark archives.
275+
276+
If you are using `DependencyParserModel` or `TextMatcherModel` in your pipelines and wish to import from the Scala 2.12 version to 2.13, then you will need to export them manually. For this, please see the example notebook [Converting Spark NLP Scala 2.12 models to Scala 2.13](https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/scala213/converting_models_from_212.ipynb).
277+
278+
`spark-nlp` with Scala 2.13 support has been published to [Maven Central](https://central.sonatype.com/artifact/com.johnsnowlabs.nlp/spark-nlp_2.13). You can use these coordinates to set up your Spark instance with config `--packages` or download the jar directly. For example:
279+
280+
```sh
281+
# Load Spark NLP with Spark Submit
282+
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.3.2
283+
```
272284

273-
**NOTE**: PyPi installed Pyspark only runs on Scala 2.12, so the following section will not apply for it. If you need to start a Scala 2.13 instance, you can set the `SPARK_HOME` environment variable to a Spark Scala 2.13 installation.
285+
See our [cheat sheet](#spark-nlp-cheatsheet) for more examples.
274286

275-
The `spark-nlp` with Scala 2.13 support has been published to
276-
the [Maven Central](https://central.sonatype.com/artifact/com.johnsnowlabs.nlp/spark-nlp_2.13).
287+
To use spark-nlp Scala 2.13 as a dependency, change the `2.12` string in our dependencies to `2.13`.
277288

278-
For Scala 2.13 support, change the `2.12` string in our dependencies to `2.13`.
289+
**spark-nlp:**
279290

280291
```xml
281292
<dependency>
@@ -317,6 +328,8 @@ For Scala 2.13 support, change the `2.12` string in our dependencies to `2.13`.
317328

318329
If you are running an sbt project in Scala 2.13, then you you don't require any changes, as the sbt syntax handles it automatically:
319330

331+
**spark-nlp:**
332+
320333
```scala
321334
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "{{ site.sparknlp_version }}"
322335
```
@@ -727,9 +740,10 @@ Note: You can import these notebooks by using their URLs.
727740
Microsoft Fabric notebooks run on managed Spark 3.4 clusters, so you need to provide the Spark NLP fat JARs through OneLake/ABFSS and wire them into the runtime via Spark properties.
728741
729742
### Spark NLP on Microsoft Fabric
743+
730744
1. Inside Fabric go to a workspace and click on `+New Item` button, type `lake` on the search bar and chose `Lakehouse` and type a name for it.
731745
<img class="image image--xl" src="/assets/images/installation/ms-fabric-lake-house-item.png" style="width:100%; align:center; box-shadow: 0 3px 6px rgba(0,0,0,0.16), 0 3px 6px rgba(0,0,0,0.23);"/>
732-
<img class="image image--xl" src="/assets/images/installation/ms-fabric-lake-house.png" style="width:100%; align:center; box-shadow: 0 3px 6px rgba(0,0,0,0.16), 0 3px 6px rgba(0,0,0,0.23);"/>
746+
<img class="image image--xl" src="/assets/images/installation/ms-fabric-lake-house.png" style="width:100%; align:center; box-shadow: 0 3px 6px rgba(0,0,0,0.16), 0 3px 6px rgba(0,0,0,0.23);"/>
733747
2. Inside Fabric go to a workspace and click on `+New Item` button, type `env` on the search bar and chose `Environment` and type a name for it.
734748
<img class="image image--xl" src="/assets/images/installation/ms-fabric-spark-env.png" style="width:100%; align:center; box-shadow: 0 3px 6px rgba(0,0,0,0.16), 0 3px 6px rgba(0,0,0,0.23);"/>
735749
3. Choose **Fabric Runtime 1.2** (Spark 3.4 + Delta 2.4) then go to `Spark properties` and set `spark.jars`
@@ -738,7 +752,9 @@ Microsoft Fabric notebooks run on managed Spark 3.4 clusters, so you need to pro
738752
5. Create a Notebook and attach it to the environment you created before.
739753
740754
### Spark NLP ONNX compatibility on Microsoft Fabric
755+
741756
Follow the steps above to set up Spark NLP, then add the following additional steps to enable ONNX inference support:
757+
742758
1. On `Spark properties` point `spark.executor.extraClassPath` and `spark.driver.extraClassPath` to the ABFSS jar directory to ensure ONNX classes are visible `abfss://workspace@storage.dfs.core.windows.net/jars/spark-nlp-assembly-{{ site.sparknlp_version }}.jar`.
743759
2. On `Spark properties` enable `spark.executor.userClassPathFirst=true` and `spark.driver.userClassPathFirst=true` so the Spark NLP/ONNX classes take precedence over the Fabric runtime defaults.
744760
@@ -867,16 +883,16 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \
867883
--properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:{{ site.sparknlp_version }}
868884
```
869885
870-
2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI.
871-
872-
3. Now, you can attach your notebook to the cluster and use the Spark NLP!
886+
1. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI.
873887
888+
2. Now, you can attach your notebook to the cluster and use the Spark NLP!
874889
875890
## Apache Spark Support
876891
877892
Spark NLP *{{ site.sparknlp_version }}* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
878893
879894
{:.table-model-big}
895+
880896
| Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x |
881897
| --------- | ------------------ | ------------------ | ------------------ | ------------------ | ------------------ | ------------------ | ------------------ | ------------------ |
882898
| 5.4.x | YES | YES | YES | YES | YES | YES | NO | NO |
@@ -895,6 +911,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github
895911
## Scala and Python Support
896912
897913
{:.table-model-big}
914+
898915
| Spark NLP | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10 | Scala 2.11 | Scala 2.12 |
899916
| --------- | ---------- | ---------- | ---------- | ---------- | ----------- | ---------- | ---------- |
900917
| 5.3.x | NO | YES | YES | YES | YES | NO | YES |
@@ -907,12 +924,12 @@ Find out more about `Spark NLP` versions from our [release notes](https://github
907924
| 4.1.x | YES | YES | YES | YES | NO | NO | YES |
908925
| 4.0.x | YES | YES | YES | YES | NO | NO | YES |
909926
910-
911927
## Databricks Support
912928
913929
Spark NLP {{ site.sparknlp_version }} has been tested and is compatible with the following runtimes:
914930
915931
{:.table-model-big}
932+
916933
| CPU | GPU |
917934
|--------------------|--------------------|
918935
| 9.1 / 9.1 ML | 9.1 ML & GPU |
@@ -1081,9 +1098,9 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \
10811098
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh
10821099
```
10831100

1084-
2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI.
1101+
1. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI.
10851102

1086-
3. Now, you can attach your notebook to the cluster and use the Spark NLP!
1103+
2. Now, you can attach your notebook to the cluster and use the Spark NLP!
10871104

10881105
</div><div class="h3-box" markdown="1">
10891106

@@ -1114,7 +1131,6 @@ You can pick the index number (I am using java-8 as default - index 2):
11141131

11151132
<img class="image image--xl" src="/assets/images/installation/amazon-linux.png" style="width:100%; align:center; box-shadow: 0 3px 6px rgba(0,0,0,0.16), 0 3px 6px rgba(0,0,0,0.23);"/>
11161133

1117-
11181134
If you dont have java-11 or java-8 in you system, you can easily install via:
11191135

11201136
```bash
@@ -1252,6 +1268,7 @@ Follow the below steps to set up Spark NLP with Spark 3.2.3:
12521268
7. Create folders `C:\tmp` and `C:\tmp\hive`
12531269
- If you encounter issues with permissions to these folders, you might need
12541270
to change the permissions by running the following commands:
1271+
12551272
```
12561273
%HADOOP_HOME%\bin\winutils.exe chmod 777 /tmp/hive
12571274
%HADOOP_HOME%\bin\winutils.exe chmod 777 /tmp/

0 commit comments

Comments
 (0)