sbt is your friend - that's how you execute scala:
sbt
sbt:learning-scala> runI use a cli argument dispatcher, so you select an exercise by it's object name and run it: sbt "run Ex01". View all available exercises by running sbt run.
Create a new Ex<> file in hello/src/main/scala, and add an Exercise object with a run method. After that you should add it to the exercise list in hello/src/main/scala/hello and run it with sbt "run Ex<>".
A spark standalone cluster is automatically started when the devcontainer is created, available at spark://localhost:7077.
You can monitor the Spark cluster at http://localhost:8080, the history server at http://localhost:18080, and more granularly at $SPARK_HOME/logs/.
You have 3 ways to experiment interactively with Scala: spark-shell / sbt console, jupyterlab or worksheets.
You can experiment interactively in a spark configured console using:
spark-shell --master spark://localhost:7077To have all the dependencies, use this instead:
sbt console
...
scala> import org.apache.spark.sql.SparkSession
scala> val spark = SparkSession.builder()
.appName("console")
.master(s"spark://localhost:7077")
.getOrCreate()A Jupyter-lab server is also automatically started (with Almond Scala kernel, Metals LSP support and jupytext).
Open it by looking for the Jupyter URL in .jupyter.log file, or run:
jupyter server list
# Currently running servers:
# http://127.0.0.1:8888/?token=<TOKEN> :: /workspaces/learning-scalaAll notebooks are stored in ./notebooks as jupytext files (those are easier to manage with git).
To stop Jupyter, run ./scripts/stop-jupyter.sh.
Note: VSCode notebooks do not communicate correctly with Almond kernel, leading to non existent autocomplete support, so I recommend using Jupyter Lab directly.
Worksheets provide interactive computation with tight integration to sbt. The directory ./worksheets supports worksheets using VSCode's Metals and automatically provides all the required dependencies.
TLDR:
source ./.venv/bin/activate
./scripts/download-taxi.sh
./scripts/download-himalayas.sh
./scripts/download-natural-earth.sh
./scripts/download-himalayas-vectors.sh Exercise 6 (./src/main/scala/Ex06.scala) requires part of the NYC TLC dataset. Fetch it using ./scripts/download-taxi.sh.
Exercise 7 (./src/main/scala/Ex07.scala) requires the Copernicus dataset. Fetching it is complex enough to require Python scripting. The Python environment is automatically configured, but the .venv must be activated! If VSCode doesn't activate it when you open the integrated terminal, run source ./.venv/bin/activate before running ./scripts/download-himalayas.sh
In exercise 7 (Ex07) I used a custom projection to perform projected analysis of the Himalayas without UTM stitching. To evaluate it (and other projections) I performed CRS distortion analysis which you can read about here: Projection Considerations.
Use Graphviz Interactive Preview extension to analyze & debug sbt's dependency graph:
sbt dependencyDot
# ...
# [info] Wrote dependency graph to '/workspaces/learning-scala/target/dependencies-compile.dot'The quickest build debug cycle for devcontainer is: edit ./.devcontainer content, run docker build -f .devcontainer/Dockerfile ., debug and repeat. Once it builds run VSCode's Dev Containers: Rebuild and Reopen in Container.
Driver logs are persisted in /var/spark-events/driverLog. Unfortunately I couldn't manage to display them in History Server UI, so you'll have to read them via your terminal.
All logs are configured in src/main/resources/log4j2.properties, and is symlinked to /opt/spark/conf/log4j2.properties to also apply to the Spark cluster. When testing, a dedicated set of logs are produced, one per JVM test process (each test is a JVM fork to guarantee each JVM manages only one local Spark instance).
As for enabling debug level logging, modify logger.learningscala.level = warn in src/main/resources/log4j2.properties.
- https://github.com/datablist/sample-csv-files for
./data/customers-100.csv. - https://file-examples.com/index.php/sample-images-download/sample-tiff-download/ for
./data/0x00000001-0x00000002-0x00000001.tiff(downsampled). - https://people.math.sc.edu/Burkardt/data/tif/tif.html (at3_1m4_01.tif, biological cells, frame 1;) for
./data/0x00000001-0x00000002-0x00000003.tiff(downsampled). - https://portal.opentopography.org/raster?opentopoID=OTSDEM.032021.4326.3 for
./data/Everest_COP30.tif. Extent details (lon, lat):TL: [86.899444, 28.031667] TR: [86.983056, 28.031667] BR: [86.983056, 27.957778] BL: [86.899444, 27.957778] - https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf for NYC-TLC data dictionary PDF.
- https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/physical/ne_10m_geography_regions_polys.zip for
./data/himalayas.geojson(extracted withogr2ogr himalayas.geojson ne_10m_geography_regions_polys.shp -where "name = 'Himalayas'")