Workflow

sbt is your friend - that's how you execute scala:

sbt
sbt:learning-scala> run

Running Exercises

I use a cli argument dispatcher, so you select an exercise by it's object name and run it: sbt "run Ex01". View all available exercises by running sbt run.

Creating Exercises

Create a new Ex<> file in hello/src/main/scala, and add an Exercise object with a run method. After that you should add it to the exercise list in hello/src/main/scala/hello and run it with sbt "run Ex<>".

Running Spark Exercises

A spark standalone cluster is automatically started when the devcontainer is created, available at spark://localhost:7077.

You can monitor the Spark cluster at http://localhost:8080, the history server at http://localhost:18080, and more granularly at $SPARK_HOME/logs/.

Interactive Shells

You have 3 ways to experiment interactively with Scala: spark-shell / sbt console, jupyterlab or worksheets.

Shell

You can experiment interactively in a spark configured console using:

spark-shell --master spark://localhost:7077

To have all the dependencies, use this instead:

sbt console
...
scala> import org.apache.spark.sql.SparkSession
scala> val spark = SparkSession.builder()
        .appName("console")
        .master(s"spark://localhost:7077")
        .getOrCreate()

Jupyterlab

A Jupyter-lab server is also automatically started (with Almond Scala kernel, Metals LSP support and jupytext).

Open it by looking for the Jupyter URL in .jupyter.log file, or run:

jupyter server list
# Currently running servers:
# http://127.0.0.1:8888/?token=<TOKEN> :: /workspaces/learning-scala

All notebooks are stored in ./notebooks as jupytext files (those are easier to manage with git).

To stop Jupyter, run ./scripts/stop-jupyter.sh.

Note: VSCode notebooks do not communicate correctly with Almond kernel, leading to non existent autocomplete support, so I recommend using Jupyter Lab directly.

Worksheets

Worksheets provide interactive computation with tight integration to sbt. The directory ./worksheets supports worksheets using VSCode's Metals and automatically provides all the required dependencies.

Downloading Data

TLDR:

source ./.venv/bin/activate
./scripts/download-taxi.sh
./scripts/download-himalayas.sh
./scripts/download-natural-earth.sh      
./scripts/download-himalayas-vectors.sh

Exercise 6 (./src/main/scala/Ex06.scala) requires part of the NYC TLC dataset. Fetch it using ./scripts/download-taxi.sh.

Exercise 7 (./src/main/scala/Ex07.scala) requires the Copernicus dataset. Fetching it is complex enough to require Python scripting. The Python environment is automatically configured, but the .venv must be activated! If VSCode doesn't activate it when you open the integrated terminal, run source ./.venv/bin/activate before running ./scripts/download-himalayas.sh

Projections

In exercise 7 (Ex07) I used a custom projection to perform projected analysis of the Himalayas without UTM stitching. To evaluate it (and other projections) I performed CRS distortion analysis which you can read about here: Projection Considerations.

Debugging

Dependencies Hell

Use Graphviz Interactive Preview extension to analyze & debug sbt's dependency graph:

sbt dependencyDot
# ...
# [info] Wrote dependency graph to '/workspaces/learning-scala/target/dependencies-compile.dot'

Docker

The quickest build debug cycle for devcontainer is: edit ./.devcontainer content, run docker build -f .devcontainer/Dockerfile ., debug and repeat. Once it builds run VSCode's Dev Containers: Rebuild and Reopen in Container.

Spark Driver logs

Driver logs are persisted in /var/spark-events/driverLog. Unfortunately I couldn't manage to display them in History Server UI, so you'll have to read them via your terminal.

Logs

All logs are configured in src/main/resources/log4j2.properties, and is symlinked to /opt/spark/conf/log4j2.properties to also apply to the Spark cluster. When testing, a dedicated set of logs are produced, one per JVM test process (each test is a JVM fork to guarantee each JVM manages only one local Spark instance).

As for enabling debug level logging, modify logger.learningscala.level = warn in src/main/resources/log4j2.properties.

Credits

https://github.com/datablist/sample-csv-files for ./data/customers-100.csv.
https://file-examples.com/index.php/sample-images-download/sample-tiff-download/ for ./data/0x00000001-0x00000002-0x00000001.tiff (downsampled).
https://people.math.sc.edu/Burkardt/data/tif/tif.html (at3_1m4_01.tif, biological cells, frame 1;) for ./data/0x00000001-0x00000002-0x00000003.tiff (downsampled).
https://portal.opentopography.org/raster?opentopoID=OTSDEM.032021.4326.3 for ./data/Everest_COP30.tif. Extent details (lon, lat):
```
TL: [86.899444, 28.031667]
TR: [86.983056, 28.031667]
BR: [86.983056, 27.957778]
BL: [86.899444, 27.957778]
```
https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf for NYC-TLC data dictionary PDF.
https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/physical/ne_10m_geography_regions_polys.zip for ./data/himalayas.geojson (extracted with ogr2ogr himalayas.geojson ne_10m_geography_regions_polys.shp -where "name = 'Himalayas'")

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.devcontainer		.devcontainer
.vscode		.vscode
data		data
media		media
notebooks		notebooks
project		project
scripts		scripts
src		src
worksheets		worksheets
.gitattributes		.gitattributes
.gitignore		.gitignore
.jvmopts		.jvmopts
.scalafix.conf		.scalafix.conf
.scalafmt.conf		.scalafmt.conf
EXERCISES.md		EXERCISES.md
PROJECTIONS.md		PROJECTIONS.md
README.md		README.md
build.sbt		build.sbt
jupyter_server_config.json		jupyter_server_config.json
jupytext.toml		jupytext.toml
requirements.txt		requirements.txt
spark-defaults.conf		spark-defaults.conf
spark-env.sh		spark-env.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Workflow

Running Exercises

Creating Exercises

Running Spark Exercises

Interactive Shells

Shell

Jupyterlab

Worksheets

Downloading Data

Projections

Debugging

Dependencies Hell

Docker

Spark Driver logs

Logs

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Workflow

Running Exercises

Creating Exercises

Running Spark Exercises

Interactive Shells

Shell

Jupyterlab

Worksheets

Downloading Data

Projections

Debugging

Dependencies Hell

Docker

Spark Driver logs

Logs

Credits

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages