COM6012 Scalable Machine Learning 2026 by Shuo Zhou at The University of Sheffield
- Task 1: To finish in the lab session on 11th Feb. Critical
- Task 2: To finish in the lab session on 11th Feb. Critical
- Task 3: To finish in the lab session on 11th Feb. Essential
- Task 4: To finish in the lab session on 11th Feb. Essential
- Task 5: To finish by the following Monday 16th Feb. Exercise
- Task 6: To explore further. Optional
Suggested reading:
- Spark Overview
- Spark Quick Start (Choose Python rather than the default Scala)
- Chapters 2 and 4 of Learning Apache Spark with Python
- Reference: PySpark 4.1.0 documentation
- Reference: PySpark source code
Note - Please READ before proceeding:
- HPC nodes are shared resources (like buses/trains) relying on considerate usage of every user. When requesting resources, if you ask for too much (e.g. 50 cores), it will take a long time to get allocated, particularly during "rush hours" (e.g. close to deadlines) and once allocated, it will leave much less for the others. If everybody is asking for too much, the system won't work and everyone suffers.
- Please follow ALL steps (step by step without skipping) unless you are very confident in handling problems by yourself.
- Please try your best to follow the study schedule above to finish the tasks on time. If you start early/on time, you will find your problems early so that you can make good use of the labs and online sessions to get help from the instructors and teaching assistants to fix your problems early, rather than getting panic close to an assessment deadline. Based on our experience from the past five years, rushing towards an assessment deadline in this module is likely to make you fall, sometimes painfully.
Follow the official instructions from our university. Your HPC account has created already due to the need of this module. You have been asked to complete and pass the HPC Driving License test by 11:00 on Wednesday 11th Feb. If you have not done so, please do it as soon as possible.
To access Stanage, log in using SSH with your university username such as abc1de and the associated password. When connecting while on campus using Eduroam or off campus, you must keep the university's VPN connected all the time. Multifactor authentication (MFA) will be mandatory. The standard University DUO MFA is utilized.
Mac OS/X and Linux users, following the official connection instructions, open a terminal and connect to Stanage via SSH by
ssh $USER@stanage.shef.ac.uk # Use lowercase for your username, without `$`You need to replace $USER with your username. Let's assume it is abc1de, then use the command ssh abc1de@stanage.shef.ac.uk (using lowercase and without $).
Windows users are recommended to use MobaXterm for SSH. Download the portable edition. Unzip MobaXterm, launch it, and click Session --> SSH. Then, enter the following details:
If successful, you will be on the login node and should see
[abc1de@login2 [stanage] ~]$abc1de should be your username.
If you have problem logging in, do the following in sequence:
- Check the Frequently Asked Questions to see whether you have a similar problem listed there, e.g.
bash-4.x$being displayed instead of your username at the bash prompt. - Change your password through Muse > My Services > Account Settings.
- Come to the labs on Thursdays or office hours on Wednesdays for in-person help, and online sessions on Thursdays for online help.
- You can save the host, username (and password if your computer is secure) as a Session if you want to save time in future.
- You can edit
settings --> keyboard shortcutsto customize the keyboard shortcuts, e.g. change the paste shortcut from the defaultShift + Insertto our familiarCtrl + V.
- You can open multiple sessions (but do not open more than what you need as these are shared resources).
During the lab sessions, you can access the reserved nodes for this module via
srun --account=rse-com6012 --reservation=rse-com6012-$LAB_ID --time=00:30:00 --pty /bin/bashReplace $LAB_ID with the session number of the lab you are taking. For example, if you are in Lab 1, you should use
srun --account=rse-com6012 --reservation=rse-com6012-1 --time=00:30:00 --pty /bin/bashThe reservation ends by the end of the lab session. You can also access the general queue via srun --pty bash -i. If successful, you should see
[abc1de@node*** [stanage] ~]$ # *** is the node numberOtherwise, try srun --account=rse-com6012 --reservation=rse-com6012-1 --time=00:30:00 --pty /bin/bash or srun --pty bash -i again. You will not be able to run the following commands if you are still on the login node.
Note: you can only access the reserved nodes during the lab sessions. Outside the lab sessions, you can only access the general queue.
module load Java/17.0.4module load Anaconda3/2024.02-1conda create -n myspark python=3.13When you are asked whether to proceed, say y. When seeing Please update conda by running ..., do NOT try to update conda following the given command. As a regular user in HPC, you will NOT be able to update conda.
source activate mysparkThe prompt says to use conda activate myspark but it does not always work. You must see (myspark) in front, before proceeding. Otherwise, you did not get the proper environment. Check the above steps.
pip install pyspark==4.1.0You should see the last line of the output as
Successfully installed py4j-0.10.9.9 pyspark-4.1.0py4j enables Python programmes to Java objects. We need it because Spark is written in scala, which is a Java-based language.
If you found that you’ve messed your environment up and encountered seemingly unrecoverable errors, please follow the steps below to reset your environment:
- Log out of Stanage and log back in, then start an interactive session
run --pty bash -i. - Restore the default files by command
resetenv. If the commandrestenvis not working (possibly removed from $PATH), you can run the resetenv command directly:/opt/site/bin/resetenv. See the instructions here for more details. - Remove the conda environment files via
rm -rf ~/.condaandrm -rf ~/.condarc. - Log out fully & then back in again.
- Start an interactive session and follow the steps above to create the
mysparkenvironment.
pysparkYou should see spark version 4.1.0 displayed like below
......
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 4.1.0
/_/
Using Python version 3.13.5 (main, Jun 12 2025 16:09:02)
Spark context Web UI available at http://node001.pri.stanage.alces.network:4040
Spark context available as 'sc' (master = local[*], app id = local-1769974318182).
SparkSession available as 'spark'.
>>> Bingo! Now you are in pyspark! Quit pyspark shell by Ctrl + D.
If you are experiencing a segmentation fault when entering the pyspark interactive shell, run export LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 to fix it.
You are expected to have passed the HPC Driving License test and become familiar with the HPC environment.
Learn the basic use of the command line in Linux, e.g. use pwd to find out your current directory.
Learn how to transfer files to/from Stanage HPC. For easier file transfer, Stanage recommends using the FileZilla SFTP client, which can be downloaded for Windows, Mac and Linux from filezilla-project.org.
Instructions on configuring FileZilla for Stanage can be found here. (Warning: Remember to change the logon type to "interactive", and not to let FileZilla store your password on shared machines.)
NOTE: While MobaXterm also supports SFTP, it has been reported that this does not work properly on the Stanage cluster
Line ending WARNING!!!: if you are using Windows, you should be aware that line endings differ between Windows and Linux. If you edit a shell script (below) in Windows, make sure that you use a Unix/Linux compatible editor or do the conversion before using it on HPC.
File recovery: the Stanage currently does not support file recovery following the instructions on recovering files from snapshots.
From this point on, we will assume that you are using the HPC terminal unless otherwise stated. Run PySpark shell on your own machine can do the same job.
Once PySpark has been installed, after each log-in, you need to do the following to run PySpark.
-
Get a node via
srun --account=rse-com6012 --reservation=rse-com6012-$LAB_ID --time=00:30:00 --pty /bin/bashorsrun --pty bash -i. -
Activate the environment by
module load Java/17.0.4 module load Anaconda3/2024.02-1 source activate mysparkAlternatively, put
HPC/myspark.shunder your root directory (see above on how to transfer files) and run the above three commands in sequence viasource myspark.sh(see more details here). You could modify it further to suit yourself better. You can also includeexport LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8in the script to fix thesegmentation faultproblem.
Run pyspark (optionally, specify to use multiple cores):
pysparkYou will see the spark splash above. spark (SparkSession) and sc (SparkContext) are automatically created.
Check your SparkSession and SparkContext object and you will see something like
>>> spark
<pyspark.sql.session.SparkSession object at 0x7f82156b1750>
>>> sc
<SparkContext master=local[*] appName=PySparkShell>Let us do some simple computing (squares)
>>> nums = sc.parallelize([1,2,3,4])
>>> nums.map(lambda x: x*x).collect()
[1, 4, 9, 16]NOTE: Review the two common causes to the file not found or cannot open file errors below (line ending and relative path problems), and how to deal with them.
This example deals with Semi-Structured data in a text file.
Firstly, you need to make sure the file is in the proper directory and change the file path if necessary, on either HPC or local machine, e.g. using pwd to see the current directly, ls (or dir in Windows) to see the content. Also review how to transfer files to HPC and MobaXterm tips for Windows users.
Now quit pyspark by Ctrl + D. Take a look at where you are
(myspark) [abc1de@node*** [stanage] ~]$ pwd
/users/abc1deabc1de should be your username. Let us make a new directory called com6012 and go to it
mkdir com6012
cd com6012Let us make a copy of our teaching materials at this directory via
git clone --depth 1 https://github.com/COM6012/ScalableMLIf ScalableML is not empty (e.g. you have cloned a copy already), this will give you an error. You need to delete the cloned version (the whole folder) via rm -rf ScalableML. Be careful that you can NOT undo this delete so make sure you do not have anything valuable (e.g. your assignment) there if you do this delete.
You are advised to create a separate folder for your own work under com6012, e.g. mywork.
Let us check
(myspark) [abc1de@node*** [stanage] com6012]$ ls
ScalableML
(myspark) [abc1de@node*** [stanage] com6012]$ cd ScalableML
(myspark) [abc1de@node*** [stanage] ScalableML]$ ls
Code Data HPC Lab 1 - Introduction to Spark and HPC.md Output README.md Slides
(myspark) [abc1de@node*** [stanage] ScalableML]$ pwd
/users/abc1de/com6012/ScalableMLYou can see that files on the GitHub has been downloaded to your HPC directory /users/abc1de/com6012/ScalableML. In some cases, you may only see the conda environment (myspark) only, without the [abc1de@node*** [stanage] ~]$ part. You can still proceed with ls and cd commands. Now start spark shell by
pyspark
again you should see the splash, and now we
- read the log file
NASA_Aug95_100.txtunder the folderData - count the number of lines
- take a look at the first line
>>> logFile=spark.read.text("Data/NASA_Aug95_100.txt")
>>> logFile
DataFrame[value: string]
>>> logFile.count()
100
>>> logFile.first()
Row(value='in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839')You may open the text file to verify that pyspark is doing the right things.
Question: How many accesses are from Japan?
Now suppose you are asked to answer the question above. What do you need to do?
- Find those logs from Japan (by IP domain
.jp) - Show the first 5 logs to check whether you are getting what you want.
>>> hostsJapan = logFile.filter(logFile.value.contains(".jp"))
>>> hostsJapan.show(5,False)
+--------------------------------------------------------------------------------------------------------------+
|value |
+--------------------------------------------------------------------------------------------------------------+
|kgtyk4.kj.yamagata-u.ac.jp - - [01/Aug/1995:00:00:17 -0400] "GET / HTTP/1.0" 200 7280 |
|kgtyk4.kj.yamagata-u.ac.jp - - [01/Aug/1995:00:00:18 -0400] "GET /images/ksclogo-medium.gif HTTP/1.0" 200 5866|
|kgtyk4.kj.yamagata-u.ac.jp - - [01/Aug/1995:00:00:21 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0 |
|kgtyk4.kj.yamagata-u.ac.jp - - [01/Aug/1995:00:00:21 -0400] "GET /images/MOSAIC-logosmall.gif HTTP/1.0" 304 0 |
|kgtyk4.kj.yamagata-u.ac.jp - - [01/Aug/1995:00:00:22 -0400] "GET /images/USA-logosmall.gif HTTP/1.0" 304 0 |
+--------------------------------------------------------------------------------------------------------------+
only showing top 5 rows
>>> hostsJapan.count()
11Now you have used pyspark for some (very) simple data analytic task.
To run a self-contained application, you need to exit your shell, by Ctrl+D first.
Create your own personal folder in the /mnt/parscratch/users area. As this doesn’t exist by default, you can create it with safe permissions by running the command:
mkdir -m 0700 /mnt/parscratch/users/YOUR_USERNAMESee Managing your files in fastdata areas for more details.
Create a file LogMining100.py under /users/abc1de/com6012/ScalableML directory.
Tip: You can use nano or vim to create the file. If you are not familiar with these editors, you can create the file on your local machine and transfer it to HPC follow the section transfer-files.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local[2]") \
.appName("COM6012 Spark Intro") \
.config("spark.local.dir","/mnt/parscratch/users/YOUR_USERNAME") \
.getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("WARN") # This can only affect the log level after it is executed.
logFile=spark.read.text("./Data/NASA_Aug95_100.txt").cache()
hostsJapan = logFile.filter(logFile.value.contains(".jp")).count()
print("\n\nHello Spark: There are %i hosts from Japan.\n\n" % (hostsJapan))
spark.stop()Change YOUR_USERNAME in /mnt/parscratch/users/YOUR_USERNAME to your username. Actually, the file has been created for you under the folder Code so you can just run it
spark-submit Code/LogMining100.pyYou will see lots of logging info output such as
26/02/01 19:56:58 INFO SparkContext: Running Spark version 4.1.0
26/02/01 19:56:58 INFO SparkContext: OS info Linux, 3.10.0-1160.142.1.el7.x86_64, amd64
26/02/01 19:56:58 INFO SparkContext: Java version 17.0.4+8
26/02/01 19:56:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
26/02/01 19:56:58 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in standalone/kubernetes and LOCAL_DIRS in YARN).
26/02/01 19:56:58 INFO ResourceUtils: ==============================================================
26/02/01 19:56:58 INFO ResourceUtils: No custom resources configured for spark.driver.
26/02/01 19:56:58 INFO ResourceUtils: ==============================================================
26/02/01 19:56:58 INFO SparkContext: Submitted application: COM6012 Spark Intro
26/02/01 19:56:58 INFO SecurityManager: Changing view acls to: your_username
26/02/01 19:56:58 INFO SecurityManager: Changing modify acls to: your_username
26/02/01 19:56:58 INFO SecurityManager: Changing view acls groups to: your_username
26/02/01 19:56:58 INFO SecurityManager: Changing modify acls groups to: your_username
26/02/01 19:56:58 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: your_username groups with view permissions: EMPTY; users with modify permissions: your_username; groups with modify permissions: EMPTY; RPC SSL disabled
26/02/01 19:56:58 INFO Utils: Successfully started service 'sparkDriver' on port 45008.
26/02/01 19:56:58 INFO SparkEnv: Registering MapOutputTracker
26/02/01 19:56:58 INFO SparkEnv: Registering BlockManagerMaster
26/02/01 19:56:58 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
26/02/01 19:56:58 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
26/02/01 19:56:58 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
26/02/01 19:56:58 INFO DiskBlockManager: Created local directory at /mnt/parscratch/users/your_username/blockmgr-ecfb3543-b130-4f31-93d7-545210c77f9e
26/02/01 19:56:58 INFO SparkEnv: Registering OutputCommitCoordinator
26/02/01 19:56:59 INFO JettyUtils: Start Jetty 0.0.0.0:4040 for SparkUI
26/02/01 19:56:59 INFO Utils: Successfully started service 'SparkUI' on port 4040.
26/02/01 19:56:59 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
26/02/01 19:56:59 INFO ResourceProfile: Limiting resource is cpu
26/02/01 19:56:59 INFO ResourceProfileManager: Added ResourceProfile id: 0
26/02/01 19:56:59 INFO SecurityManager: Changing view acls to: your_username
26/02/01 19:56:59 INFO SecurityManager: Changing modify acls to: your_username
26/02/01 19:56:59 INFO SecurityManager: Changing view acls groups to: your_username
26/02/01 19:56:59 INFO SecurityManager: Changing modify acls groups to: your_username
26/02/01 19:56:59 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: your_username groups with view permissions: EMPTY; users with modify permissions: your_username; groups with modify permissions: EMPTY; RPC SSL disabled
26/02/01 19:56:59 INFO Executor: Starting executor ID driver on host node001.pri.stanage.alces.network
26/02/01 19:56:59 INFO Executor: OS info Linux, 3.10.0-1160.142.1.el7.x86_64, amd64
26/02/01 19:56:59 INFO Executor: Java version 17.0.4+8
26/02/01 19:56:59 INFO Executor: Starting executor with user classpath (userClassPathFirst = false): ''
26/02/01 19:56:59 INFO Executor: Created or updated repl class loader org.apache.spark.util.MutableURLClassLoader@1ee4605 for default.
26/02/01 19:56:59 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41878.
26/02/01 19:56:59 INFO NettyBlockTransferService: Server created on node001.pri.stanage.alces.network:41878
26/02/01 19:56:59 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
26/02/01 19:56:59 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, node001.pri.stanage.alces.network, 41878, None)
26/02/01 19:56:59 INFO BlockManagerMasterEndpoint: Registering block manager node001.pri.stanage.alces.network:41878 with 413.9 MiB RAM, BlockManagerId(driver, node001.pri.stanage.alces.network, 41878, None)
26/02/01 19:56:59 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, node001.pri.stanage.alces.network, 41878, None)
26/02/01 19:56:59 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, node001.pri.stanage.alces.network, 41878, None)
Hello Spark: There are 11 hosts from Japan.We can set the log level easily after sparkContext is created but not before (it is a bit complicated). I leave two blank lines before printing the result so it is early to see.
Data: Download the August data in gzip (NASA_access_log_Aug95.gz) from NASA HTTP server access log (this file is uploaded to ScalableML/Data if you have problems downloading, so actually it is already downloaded on your HPC earlier) and put into your Data folder. NASA_Aug95_100.txt above is the first 100 lines of the August data.
Question: How many accesses are from Japan and UK respectively?
Create a file LogMiningBig.py under /users/abc1de/com6012/ScalableML/Code directory
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local[2]") \
.appName("COM6012 Spark Intro") \
.config("spark.local.dir","/mnt/parscratch/users/YOUR_USERNAME") \
.getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("WARN") # This can only affect the log level after it is executed.
logFile=spark.read.text("./Data/NASA_access_log_Aug95.gz").cache()
hostsJapan = logFile.filter(logFile.value.contains(".jp")).count()
hostsUK = logFile.filter(logFile.value.contains(".uk")).count()
print("\n\nHello Spark: There are %i hosts from UK.\n" % (hostsUK))
print("Hello Spark: There are %i hosts from Japan.\n\n" % (hostsJapan))
spark.stop()Spark can read gzip file directly. You do not need to unzip it to a big file. Also note the use of cache() above.
See how to submit batch jobs to Stanage and follow the instructions for SLURM. Reminder: The more resources you request, the longer you need to queue.
Interactive mode will be good for learning, exploring and debugging, with smaller data. For big data, it will be more convenient to use batch processing. You submit the job to the node to join a queue. Once allocated, your job will run, with output properly recorded. This is done via a shell script.
Create a file Lab1_SubmitBatch.sh under /users/abc1de/com6012/ScalableML/HPC directory for reserved nodes. For using general queue outside the lab sessions, please remove the two lines in the script as indicated in the comments.
#!/bin/bash
#SBATCH --job-name=JOB_NAME # Replace JOB_NAME with a name you like
#SBATCH --account=rse-com6012 # Remove this line for the *general queue*
#SBATCH --reservation=rse-com6012-1 # Replace 1 with the real Lab_ID in the future lab sessions, or remove this line for the *general queue*
#SBATCH --time=00:30:00 # Change this to a longer time if you need more time
#SBATCH --nodes=1 # Specify a number of nodes
#SBATCH --mem=4G # Request 4 gigabytes of real memory (mem)
#SBATCH --output=./Output/COM6012_Lab1.txt # This is where your output and errors are logged
#SBATCH --mail-user=username@sheffield.ac.uk # Request job update email notifications, remove this line if you don't want to be notified
module load Java/17.0.4
module load Anaconda3/2024.02-1
source activate myspark
spark-submit ./Code/LogMiningBig.py # . is a relative path, meaning the current directoryGo to the /users/abc1de/com6012/ScalableML directory to submit your job via the following sbatch command (can be run at the login node)
sbatch HPC/Lab1_SubmitBatch.shCheck your output file, which is COM6012_Lab1.txt in the Output folder specified with option -o above. You can change it to a name you like. A sample output file named COM6012_Lab1_SAMPLE.txt is in the GitHub Output folder for your reference. The results are
Hello Spark: There are 35924 hosts from UK.
Hello Spark: There are 71600 hosts from Japan.Common causes and fixes to file not found or cannot open file errors
-
Make sure that your
.shfile, e.g.myfile.sh, has Linux/Unix rather than Windows line ending. To check, do the following on HPC[abc1de@node001 [stanage] HPC]$ file myfile.sh myfile.sh: ASCII text, with CRLF line terminators # OutputIn the above example, it shows the file has "CRLF line terminators", which will not be recognized by Linux/Unix. You can fix it by
[abc1de@node001 [stanage] HPC]$ dos2unix myfile.sh dos2unix: converting file myfile.sh to Unix format ... # OutputNow check again, and it shows no "CRLF line terminators", which means it is now in the Linux/Unix line endings and ready to go.
[abc1de@node001 [stanage] HPC]$ file myfile.sh myfile.sh: ASCII text # Output -
Make sure that you are at the correct directory and the file exists using
pwd(the current working directory) andls(list the content). Check the status of your queuing/ running job(s) usingsqueue --me(jobs not shown are finished already). Check the SLURM job status (see details of the status code) and usescancel job-idto delete the job you want to terminate. If you want to print out the working directory when your code is running, you would useimport os print(os.getcwd())
If you have verified that you can run the same command in interactive mode, but cannot run it in batch mode, it may be due to the environment you are using has been corrupted.
I suggest you to remove and re-install the environment. You can do this by
- Remove the
mysparkenvironment by runningconda remove --name myspark --all, following conda's managing environments documentation and redo Lab 1 (i.e. install everything) to see whether you can run spark-submit in batch mode again. - If the above does not work, delete the
mysparkenvironment (folder) at/users/abc1de/.conda/envs/mysparkvia the terminal folder window on the left of the screen on MobaXterm or use linux command. Then redo Lab 1 (i.e. install everything) to see whether you can run spark-submit in batch mode again. - If the above still does not work, you may have installed
pyspark==4.1.0wrongly, e.g. before but not after activating themysparkenvironment. If you made this mistake, when reinstallingpyspark==4.1.0, you may be prompted withRequirement already satisfied: pyspark==4.1.0andRequirement already satisfied: py4j==0.10.9.5. To fix the problem, you can try uninstallpysparkandpy4jbefore activatingmysparkenvironment bypip uninstall pyspark==4.1.0andpip uninstall py4j==0.10.9.5and then activate themysparkenvironment bysource activate mysparkand reinstall pyspark bypip install pyspark==4.1.0.
The analytic task you are doing above is Log Mining. You can imaging nowadays, log files are big and manual analysis will be time consuming. Follow examples above, answer the following questions on NASA_access_log_Aug95.gz.
- How many requests are there in total?
- How many requests are from
gateway.timken.com? - How many requests are on 15th August 1995?
- How many 404 (page not found) errors are there in total?
- How many 404 (page not found) errors are there on 15th August?
- How many 404 (page not found) errors from
gateway.timken.comare there on 15th August?
You are encouraged to try out in the pyspark shell first to figure out the right solutions and then write a Python script, e.g. Lab1_exercise.py with a batch file (e.g. Lab1_Exercise_Batch.sh to produce the output neatly under Output, e.g. in a file Lab1_exercise.txt.
You are encouraged to explore these more challenging questions by consulting the pyspark.sql APIs to learn more. We will not provide solutions but Session 2 will make answering these questions easier.
- How many unique hosts on a particular day (e.g., 15th August)?
- How many unique hosts in total (i.e., in August 1995)?
- Which host is the most frequent visitor?
- How many different types of return codes?
- How many requests per day on average?
- How many requests per host on average?
- Any other question that you (or your imagined clients) are interested in to find out.
- Compare the time taken to complete your jobs with and without
cache().
- Compare the time taken to complete your jobs with 2, 4, 8, 16, and 32 cores.
Many thanks to Haiping, Mauricio, Twin, Will, Mike, Xianyuan, Desmond, and Vamsi for their kind help and all those kind contributors of open resources.
The log mining problem is adapted from UC Berkeley cs105x L3.
