Skip to content

aloni636/learning-scala

Repository files navigation

Workflow

sbt is your friend - that's how you execute scala:

sbt
sbt:learning-scala> run

Running Exercises

I use a cli argument dispatcher, so you select an exercise by it's object name and run it: sbt "run Ex01". View all available exercises by running sbt run.

Creating Exercises

Create a new Ex<> file in hello/src/main/scala, and add an Exercise object with a run method. After that you should add it to the exercise list in hello/src/main/scala/hello and run it with sbt "run Ex<>".

Running Spark Exercises

A spark standalone cluster is automatically started when the devcontainer is created, available at spark://localhost:7077.

You can monitor the Spark cluster at http://localhost:8080, the history server at http://localhost:18080, and more granularly at $SPARK_HOME/logs/.

Interactive Shells

You have 3 ways to experiment interactively with Scala: spark-shell / sbt console, jupyterlab or worksheets.

Shell

You can experiment interactively in a spark configured console using:

spark-shell --master spark://localhost:7077

To have all the dependencies, use this instead:

sbt console
...
scala> import org.apache.spark.sql.SparkSession
scala> val spark = SparkSession.builder()
        .appName("console")
        .master(s"spark://localhost:7077")
        .getOrCreate()

Jupyterlab

A Jupyter-lab server is also automatically started (with Almond Scala kernel, Metals LSP support and jupytext).

Open it by looking for the Jupyter URL in .jupyter.log file, or run:

jupyter server list
# Currently running servers:
# http://127.0.0.1:8888/?token=<TOKEN> :: /workspaces/learning-scala

All notebooks are stored in ./notebooks as jupytext files (those are easier to manage with git).

To stop Jupyter, run ./scripts/stop-jupyter.sh.

Note: VSCode notebooks do not communicate correctly with Almond kernel, leading to non existent autocomplete support, so I recommend using Jupyter Lab directly.

Worksheets

Worksheets provide interactive computation with tight integration to sbt. The directory ./worksheets supports worksheets using VSCode's Metals and automatically provides all the required dependencies.

Downloading Data

TLDR:

source ./.venv/bin/activate
./scripts/download-taxi.sh
./scripts/download-himalayas.sh
./scripts/download-natural-earth.sh      
./scripts/download-himalayas-vectors.sh  

Exercise 6 (./src/main/scala/Ex06.scala) requires part of the NYC TLC dataset. Fetch it using ./scripts/download-taxi.sh.

Exercise 7 (./src/main/scala/Ex07.scala) requires the Copernicus dataset. Fetching it is complex enough to require Python scripting. The Python environment is automatically configured, but the .venv must be activated! If VSCode doesn't activate it when you open the integrated terminal, run source ./.venv/bin/activate before running ./scripts/download-himalayas.sh

Projections

In exercise 7 (Ex07) I used a custom projection to perform projected analysis of the Himalayas without UTM stitching. To evaluate it (and other projections) I performed CRS distortion analysis which you can read about here: Projection Considerations.

Debugging

Dependencies Hell

Use Graphviz Interactive Preview extension to analyze & debug sbt's dependency graph:

sbt dependencyDot
# ...
# [info] Wrote dependency graph to '/workspaces/learning-scala/target/dependencies-compile.dot'

Docker

The quickest build debug cycle for devcontainer is: edit ./.devcontainer content, run docker build -f .devcontainer/Dockerfile ., debug and repeat. Once it builds run VSCode's Dev Containers: Rebuild and Reopen in Container.

Spark Driver logs

Driver logs are persisted in /var/spark-events/driverLog. Unfortunately I couldn't manage to display them in History Server UI, so you'll have to read them via your terminal.

Logs

All logs are configured in src/main/resources/log4j2.properties, and is symlinked to /opt/spark/conf/log4j2.properties to also apply to the Spark cluster. When testing, a dedicated set of logs are produced, one per JVM test process (each test is a JVM fork to guarantee each JVM manages only one local Spark instance).

As for enabling debug level logging, modify logger.learningscala.level = warn in src/main/resources/log4j2.properties.

Credits

About

Scala, GeoTrellis, GIS, Spark, RDDs - Real data, custom exercises

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors