|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Getting Started with Bash Scripting" |
| 4 | +description: Linux |
| 5 | +date: 2025-03-29 15:00:00 +0530 |
| 6 | +image: '/images/blogs/rebecca_Intro_to_bash_scripting_square.png' |
| 7 | +tags: [linux, bash, opensource] |
| 8 | +author-name: "Sangam Swadi K" |
| 9 | +author-image: "/images/people/sangam.jpeg" |
| 10 | +author-linkedin: "https://www.linkedin.com/in/sangam-swadi-k/" |
| 11 | +author-website: "https://github.com/SangamSwadiK" |
| 12 | +--- |
| 13 | + |
| 14 | +This post explores the fundamentals of Bash scripting, focusing on how it's used in data science workflows and how to write more effective scripts. It builds upon the core concepts presented in the Data Umbrella webinar, [ Intro to Bash Scripting](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=189s). |
| 15 | + |
| 16 | +## Intro to Bash Scripting: by Rebecca BurWei |
| 17 | + |
| 18 | +<p> |
| 19 | +<iframe src="https://www.youtube.com/embed/1pQ527fGhVQ" loading="lazy" frameborder="0" allowfullscreen></iframe> |
| 20 | +</p> |
| 21 | + |
| 22 | +## Resources |
| 23 | +- Repo: [https://github.com/rebecca-burwei/intro-to-bash-scripting/](https://github.com/rebecca-burwei/intro-to-bash-scripting/) |
| 24 | +- Slides: [https://docs.google.com/presentation/d/1X9pOOEFOIK2oI26VvuNKRBC8psIM8HnGqOZ6fOUF8jM/](https://docs.google.com/presentation/d/1X9pOOEFOIK2oI26VvuNKRBC8psIM8HnGqOZ6fOUF8jM/) |
| 25 | +- Bash file examples: [https://github.com/rebecca-burwei/intro-to-bash-scripting/tree/main/bin](https://github.com/rebecca-burwei/intro-to-bash-scripting/tree/main/bin) |
| 26 | + |
| 27 | +## Section Timestamps of Video |
| 28 | + |
| 29 | +- [00:00](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=0s) Data Umbrella Introduction |
| 30 | +- [04:05](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=245s) Rebecca begins presentation |
| 31 | +- [05:24](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=324s) Agenda |
| 32 | +- [05:39](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=339s) What is bash? A brief history of shells |
| 33 | +- [06:56](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=416s) What is bash used for? |
| 34 | +- [07:58](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=476s) Scripting basics + resources |
| 35 | +- [09:36](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=576s) What does this code do? Analyzing an ETL script |
| 36 | +- [11:20](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=680s) (three min. of quiet time begins) |
| 37 | +- [14:24](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=864s) Begin code walkthrough |
| 38 | +- [15:33](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=933s) Deeper dive on curl and ssconvert |
| 39 | +- [18:33](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=1113s) Continuing code walkthrough |
| 40 | +- [20:16](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=1216s) Quick summary |
| 41 | +- [20:45](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=1245s) Making the code into a script |
| 42 | +- [22:45](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=1365s) Making the script executable |
| 43 | +- [24:17](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=1457s) Running the script |
| 44 | +- [25:52](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=1552s) Turning code into a script - best practices |
| 45 | +- [26:01](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=1561s) Ways to customize your bash environment |
| 46 | +- [26:54](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=1614s) Customize bash environment - .bashrc example |
| 47 | +- [28:36](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=1716s) Song break + interactive chat time |
| 48 | +- [33:06](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=1986s) Returning from break, reviewing chat |
| 49 | +- [34:27](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=2067s) Improving the script - error handling + demo |
| 50 | +- [41:04](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=2464s) Error handling - more resources |
| 51 | +- [41:24](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=2484s) Improving the script - logging + demo |
| 52 | +- [46:15](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=2775s) Improving the script - options + demo |
| 53 | +- [50:00](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=3000s) (two min of quiet time begins) |
| 54 | +- [52:05](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=3125s) Continuing options demo |
| 55 | +- [54:16](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=3256s) Summary + mentorship/collaboration |
| 56 | +- [55:07](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=3307s) Q&A - ssconvert/gnumeric, GNU, esac |
| 57 | + |
| 58 | +## What is Bash and Why Use It? |
| 59 | + |
| 60 | +Bash is a command-line interpreter that allows users to interact with Unix-like operating systems (e.g., macOS, Linux). [A fun fact, Bash stands for "Born Again Shell", which is just a play on the name of that first shell](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=387s). It's a powerful tool for tasks like file management, process control, and job automation. While not typically used for core data analysis tasks like model building or statistical analysis, Bash is invaluable for setting up environments, managing files, and automating repetitive tasks. |
| 61 | + |
| 62 | +## A Basic ETL Script in Bash |
| 63 | + |
| 64 | +Let's examine a simple script that performs an Extract, Transform, Load (ETL) process. This script downloads an Excel file, converts it to CSVs, extracts a specific column, and then cleans up the downloaded files. |
| 65 | + |
| 66 | +```bash |
| 67 | +#!/bin/bash |
| 68 | +# Title: Process Data |
| 69 | +# Date: 2024-03-14 |
| 70 | +# Author: Rebecca BurWei |
| 71 | +# Version: 1.0 |
| 72 | +# Description: Download data, convert to CSV, extract column, and clean up. |
| 73 | +# Options: None |
| 74 | + |
| 75 | +URL="http://www.econ.yale.edu/~shiller/data/ie_data.xls" |
| 76 | +echo "Downloading data from $URL" # Print a message indicating the download source. |
| 77 | +curl -O stock_data.xls "$URL" # Download the file using curl. The -O option saves it with its original name. |
| 78 | +ssconvert -S stock_data.xls stock_data.csv # Convert the Excel file to multiple CSV files (one per sheet). |
| 79 | +cut -d, -f10 stock_data.csv.4 | head # Extract the 10th column (using comma as delimiter) and display the first few lines. |
| 80 | +rm stock_data* # Remove all downloaded and created files (using wildcard *). |
| 81 | +``` |
| 82 | + |
| 83 | +The script begins with a shebang (`#!/bin/bash`), which specifies the interpreter. [ The first comment is a special one. It's a special type of comment called a shebang or a hashbang.](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=1304s) It then defines a variable `URL` containing the download link. The `echo` command prints a message to the console. The `curl` command downloads the file, and `ssconvert` converts it into multiple CSV files. [ So I think this will not work if you don't have SS convert installed.](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=1016s) The `cut` command extracts the 10th column, using a comma (`,`) as the delimiter, and `head` displays the first few lines of the output. Finally, `rm` removes the downloaded and created files. |
| 84 | +<figure> |
| 85 | +<a href="https://www.youtube.com/watch?v=1pQ527fGhVQ&t=505.82s" target="_blank"><img src="/images/blogs/rebecca_linux_image_at_505.82.png" alt=" So here's some code." width="450"/></a> |
| 86 | + <figcaption>Figure 1: Resources</figcaption> |
| 87 | +</figure> |
| 88 | + |
| 89 | +To make this code executable as a script, save it in a file (e.g., `process.sh`), and use the `chmod` command to grant execute permissions: |
| 90 | + |
| 91 | +```bash |
| 92 | +chmod u+x process.sh |
| 93 | +``` |
| 94 | +[ so I'm going to give the owner execute permissions so that the script can be run by using `chmod`.](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=119s) |
| 95 | + |
| 96 | +You can run the script using either `source process.sh` (runs in the current shell) or `./process.sh` (runs in a subshell). [ If you type just the name on then the current shell will create a subshell and run your script in just that subshell.](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=178s) The latter is generally preferred for cleaner execution. |
| 97 | + |
| 98 | +```bash |
| 99 | +source process.sh |
| 100 | + |
| 101 | +Bash provides several files to [customize your environment](https://youtu.be/1pQ527fGhVQ?si=lGkFFhyLODqmZLEP&t=1559). Key files include: |
| 102 | + |
| 103 | +1. **`.bash_profile`**: Executed upon login (opening a new terminal). There's a bash profile and this will run on login. |
| 104 | +2. **`.bashrc`**: Executed when a new subshell is started. |
| 105 | +3. **`.bash_logout`**: Executed upon logout, useful for cleanup tasks. |
| 106 | +
|
| 107 | +These files can be used to set aliases, define environment variables, export variables, and configure shell behavior. |
| 108 | +
|
| 109 | +## Enhancing the Script: Error Handling |
| 110 | +
|
| 111 | +By default, Bash continues execution even if a command encounters an error. This behavior can be modified. One approach is to use `set -e` (or `set -o errexit`), which causes the script to exit immediately upon encountering an error. [ You can set air exit and this will make the bash stop when it runs into an error.](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=796s) |
| 112 | +
|
| 113 | +A more robust approach is to use `trap`. `trap` allows you to define custom actions to be taken when specific signals are received. A common use case is to handle the `ERR` signal, which is triggered when a command exits with a non-zero status (indicating an error). |
| 114 | +
|
| 115 | +Here's how to implement error handling with `trap` in the script: |
| 116 | + |
| 117 | +```bash |
| 118 | +#!/bin/bash |
| 119 | +# ... (previous comments) ... |
| 120 | +
|
| 121 | +handle_error() { |
| 122 | + echo "Error on line $1 with exit status $?" |
| 123 | + exit 1 |
| 124 | +} |
| 125 | +
|
| 126 | +trap 'handle_error $LINENO' ERR |
| 127 | +
|
| 128 | +URL="invalid_url" # Introduce an error for demonstration. |
| 129 | +echo "Attempting to download data from $URL" |
| 130 | +curl -O "$URL" |
| 131 | +ssconvert -S stock_data.xlsx stock_data.csv |
| 132 | +cut -d, -f10 stock_data.csv.0 | head |
| 133 | +rm stock_data* |
| 134 | +
|
| 135 | +``` |
| 136 | +<figure> |
| 137 | +<a href="https://www.youtube.com/watch?v=1pQ527fGhVQ&t=895.0s" target="_blank"><img src="/images/blogs/rebecca_linux_image_at_895.00.png" alt=" OK, so in this file we have the comments at the top as before." width="450"/></a> |
| 138 | +<figcaption>Figure 2: Code explanation</figcaption> |
| 139 | +</figure> |
| 140 | + |
| 141 | +In this modified script, `handle_error` is a function that prints an error message including the line number (`$1`, passed from `$LINENO`) and the exit status (`$?`). [ $1 refers to the first argument or the first positional parameter and $? refers to the status code or the exit status of the last command that was run.](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=1071s) The `trap` command associates this function with the `ERR` signal. Setting `URL` to an invalid value demonstrates the error handling in action. |
| 142 | + |
| 143 | +## Enhancing the Script: Logging |
| 144 | + |
| 145 | +Instead of printing output directly to the terminal, it's often beneficial to redirect output and errors to files. This is called logging. |
| 146 | +
|
| 147 | +You can achieve this using redirection operators: |
| 148 | +
|
| 149 | +* `>` redirects standard output (stdout) to a file. |
| 150 | +* `>>` redirects standard error (stderr) to a file. |
| 151 | +
|
| 152 | +Here's how to modify the script execution to implement logging: |
| 153 | + |
| 154 | +```bash |
| 155 | +./process.sh > output.txt 2> errors.txt |
| 156 | +``` |
| 157 | + |
| 158 | +[ So anything that would have gone to the terminal on standard out, send that to this file instead.](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=1236s) |
| 159 | + |
| 160 | +This command runs `process.sh` and redirects its standard output to `output.txt` and its standard error to `errors.txt`. |
| 161 | + |
| 162 | +To encapsulate the logging within a separate script, you can use a wrapper script like this. Here is the `logger.sh` script: |
| 163 | + |
| 164 | +```bash |
| 165 | +#!/bin/bash |
| 166 | +# ... (comments) ... |
| 167 | +# Options: |
| 168 | +# $1: Path to the script to be logged. |
| 169 | +
|
| 170 | +"$1" > "${1}_output.txt" 2> "${1}_errors.txt" |
| 171 | +``` |
| 172 | +<figure> |
| 173 | + <a href="https://www.youtube.com/watch?v=1pQ527fGhVQ&t=1298.28s" target="_blank"><img src="/images/blogs/rebecca_linux_image_at_1298.28.png" alt=" OK, so I've already prepared an example of logging in a script called logger." width="450"/></a> |
| 174 | + <figcaption>Figure 3: Making the code into script</figcaption> |
| 175 | +</figure> |
| 176 | +This `logger.sh` script takes the path to another script (passed as the first positional parameter, `$1`) and executes it, redirecting output and errors to files with appropriate names. |
| 177 | + |
| 178 | +## Enhancing the Script: Adding Options |
| 179 | + |
| 180 | +Using positional parameters (`$1`, `$2`, etc.) can be fragile because the order matters. Options (e.g., `-f`, `-d`) provide a more robust and user-friendly way to configure script behavior. The `getopts` built-in command helps parse options. |
| 181 | + |
| 182 | +Here's a modified version of the script that accepts an `-f` option to specify the columns to extract: |
| 183 | +
|
| 184 | +```bash |
| 185 | +#!/bin/bash |
| 186 | +# ... (previous comments) ... |
| 187 | +# Options: |
| 188 | +# -f <fields>: Comma-separated list of fields to extract. |
| 189 | +
|
| 190 | +while getopts "f:" opt; do |
| 191 | + case $opt in |
| 192 | + f) |
| 193 | + fields="$OPTARG" |
| 194 | + ;; |
| 195 | + \?) |
| 196 | + echo "Usage: $0 [-f <fields>]" |
| 197 | + exit 1 |
| 198 | + ;; |
| 199 | + esac |
| 200 | +done |
| 201 | +
|
| 202 | +URL="http://www.econ.yale.edu/~shiller/data/ie_data.xls" |
| 203 | +echo "Downloading data from $URL" |
| 204 | +curl -O "$URL" |
| 205 | +ssconvert -S stock_data.xlsx stock_data.csv |
| 206 | +cut -d, -f"$fields" stock_data.csv.0 | head |
| 207 | +rm stock_data* |
| 208 | +``` |
| 209 | +
|
| 210 | +This version uses `getopts "f:" opt` to parse the `-f` option. [ In line 16, that string f colon tells you about the valid options.](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=537s) The colon after `f` indicates that it requires an argument. The `case` statement handles the option: if `-f` is provided, its value (accessed via `$OPTARG`) is stored in the `fields` variable. [ So fields would be 10 or 10,1.](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=564s) If an invalid option is provided, a usage message is printed, and the script exits. The `cut` command now uses `-f"$fields"` to extract the specified columns. The `esac` closes out the case statement. [ When you create a lot of these control structures, like you type case to start it, and then ESAC, which is case backwards, to end the case statement, to tell Bash, you know, case part is done.](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=982s) |
| 211 | +
|
| 212 | +These examples show how to build upon a simple Bash script to add functionality, improve robustness, and make it more user-friendly. |
| 213 | +
|
| 214 | +
|
| 215 | +## About the Speaker: Rebecca BurWei |
| 216 | +
|
| 217 | +Rebecca BurWei is a Staff Data Scientist at Mozilla. She has a patent in computer vision and a PhD in mathematics. She learned to code in open-source communities, and is passionate about developing the technical leadership of others. |
| 218 | +
|
| 219 | +### Connect with the Speaker |
| 220 | +- GitHub: [rebecca-burwei](https://github.com/rebecca-burwei) |
| 221 | +- LinkedIn: [Rebecca BurWei](https://www.linkedin.com/in/rebecca-burwei/) |
| 222 | +
|
0 commit comments