Add example code for regular expressions

gjbex · gjbex · commit d1758a72c28f · 2019-11-14T07:20:19.000+01:00
diff --git a/source-code/regexes/README.md b/source-code/regexes/README.md
@@ -0,0 +1,15 @@
+# Regexes
+
+Regular expressionn are a formalism to describe sets of strings. They
+can be used to check whether a string matches a given pattern, for
+exstracting parts of strings, or substituting part of strings in a
+way that is much more powerful and flexible than the `str` methods
+for those purposes.
+
+## What is it?
+1. `regexes.ipynb`: Jupiter notebook illustrating various aspects of
+    using regular expressions in string-related tasks.  This conveys
+    the flavor, rather than being a comprehensive introduction.
+
+More information on regular expressions can be found in the Python
+introduction slides.
diff --git a/source-code/regexes/regexes.ipynb b/source-code/regexes/regexes.ipynb
@@ -0,0 +1,247 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Regular expressions in Python"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Regular expressions are very useful in many situations, and not exclusive to Python.  In fact, once you grasp the concepts, you'll find them indispensible and use them (or miss) them for many programming and data management tasks.  This notebook intends to give you a flavor of the possibilities, it doesn't intend to be a comprehensive overview."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In Python, regular expressions are implemented in the standard library's `re` module."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "import re"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Match making"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "One of the tasks regular expressions are useful for is verifying whether a (collection of) string(s) matches a certain pattern."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Example: for a list of file names, select only the ones that start with `dev_`, and end with `.txt`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "file_list = ['dev_counter.txt', 'dev_reset.txt', 'shm_counter.txt', 'dev_start.txt']\n",
+    "for file_name in file_list:\n",
+    "    if re.match(r'dev_.*\\.txt', file_name):\n",
+    "        print(file_name)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A somewhat more complex example, select file names that have a base name ending in digits, and extension either `.txt`, or `.dat`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "file_list = ['exp_01.txt', 'exp.txt', 'exp_02.dat', 'exp.dat', 'exp05.dat', 'exp_03.jpg']\n",
+    "for file_name in file_list:\n",
+    "    if re.search(r'\\d+\\.(?:txt|dat)', file_name):\n",
+    "        print(file_name)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Is this really correct?  Let's try something nasty."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "if re.search(r'\\d+\\.(?:txt|dat)', 'exp_09.data'):\n",
+    "    print('Oops!')\n",
+    "else:\n",
+    "    print(\"Yay!\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's ensure that the strings have to end with either `.txt`, or `.dat`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "if re.search(r'\\d+\\.(?:txt|dat)$', 'exp_09.data'):\n",
+    "    print('Oops!')\n",
+    "else:\n",
+    "    print(\"Yay!\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Extracting stuff"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Regular expressions can also be used to capture parts of a string while matching."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Suppose we are only interested in the numbers in file names like `exp_01.dat`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "file_list = ['exp_01.dat', 'meta.txt', 'exp_02.dat', 'exp_10.dat', 'exp_05.dat', 'exp_03.jpg']\n",
+    "for file_name in file_list:\n",
+    "    match = re.search(r'exp_(\\d+)\\.dat', file_name)\n",
+    "    if match:\n",
+    "        print(match.group(1))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note the difference between grouping brackets such as `(?:txt|dat)`, and capturing brackets such as `(\\d+)`.  Capturing brackets also group, but grouping brackets don't capture."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Substitution"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Regular expressions can also be used to substitute parts of strings that match a given pattern.  For instance, replace all extensions in file names by `.txt`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "file_list = ['exp_01.dat', 'exp_03.txt', 'exp_02.dat', 'exp_10.text']\n",
+    "for file_name in file_list:\n",
+    "    new_file_name = re.sub(r'\\.\\w+$', '.txt', file_name)\n",
+    "    print('{old:15s} -> {new}'.format(old=file_name, new=new_file_name))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The substituion can infact include part of the string captured in the regular expression.  We can replace a file name such as `exp_03.txt` by `03_exp.txt`, and `dev_05.dat` by `05_dev.dat`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "file_list = ['exp_01.dat', 'dev_03.txt', 'exp_02.txt', 'exp_10.text']\n",
+    "for file_name in file_list:\n",
+    "    new_file_name = re.sub(r'(\\w+)_(\\d+)\\.', r'\\2_\\1.', file_name)\n",
+    "    print('{old:15s} -> {new}'.format(old=file_name, new=new_file_name))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.5.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}