|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "# Regular expressions in Python" |
| 8 | + ] |
| 9 | + }, |
| 10 | + { |
| 11 | + "cell_type": "markdown", |
| 12 | + "metadata": {}, |
| 13 | + "source": [ |
| 14 | + "Regular expressions are very useful in many situations, and not exclusive to Python. In fact, once you grasp the concepts, you'll find them indispensible and use them (or miss) them for many programming and data management tasks. This notebook intends to give you a flavor of the possibilities, it doesn't intend to be a comprehensive overview." |
| 15 | + ] |
| 16 | + }, |
| 17 | + { |
| 18 | + "cell_type": "markdown", |
| 19 | + "metadata": {}, |
| 20 | + "source": [ |
| 21 | + "In Python, regular expressions are implemented in the standard library's `re` module." |
| 22 | + ] |
| 23 | + }, |
| 24 | + { |
| 25 | + "cell_type": "code", |
| 26 | + "execution_count": null, |
| 27 | + "metadata": { |
| 28 | + "collapsed": true |
| 29 | + }, |
| 30 | + "outputs": [], |
| 31 | + "source": [ |
| 32 | + "import re" |
| 33 | + ] |
| 34 | + }, |
| 35 | + { |
| 36 | + "cell_type": "markdown", |
| 37 | + "metadata": {}, |
| 38 | + "source": [ |
| 39 | + "## Match making" |
| 40 | + ] |
| 41 | + }, |
| 42 | + { |
| 43 | + "cell_type": "markdown", |
| 44 | + "metadata": {}, |
| 45 | + "source": [ |
| 46 | + "One of the tasks regular expressions are useful for is verifying whether a (collection of) string(s) matches a certain pattern." |
| 47 | + ] |
| 48 | + }, |
| 49 | + { |
| 50 | + "cell_type": "markdown", |
| 51 | + "metadata": {}, |
| 52 | + "source": [ |
| 53 | + "Example: for a list of file names, select only the ones that start with `dev_`, and end with `.txt`." |
| 54 | + ] |
| 55 | + }, |
| 56 | + { |
| 57 | + "cell_type": "code", |
| 58 | + "execution_count": null, |
| 59 | + "metadata": { |
| 60 | + "collapsed": false |
| 61 | + }, |
| 62 | + "outputs": [], |
| 63 | + "source": [ |
| 64 | + "file_list = ['dev_counter.txt', 'dev_reset.txt', 'shm_counter.txt', 'dev_start.txt']\n", |
| 65 | + "for file_name in file_list:\n", |
| 66 | + " if re.match(r'dev_.*\\.txt', file_name):\n", |
| 67 | + " print(file_name)" |
| 68 | + ] |
| 69 | + }, |
| 70 | + { |
| 71 | + "cell_type": "markdown", |
| 72 | + "metadata": {}, |
| 73 | + "source": [ |
| 74 | + "A somewhat more complex example, select file names that have a base name ending in digits, and extension either `.txt`, or `.dat`." |
| 75 | + ] |
| 76 | + }, |
| 77 | + { |
| 78 | + "cell_type": "code", |
| 79 | + "execution_count": null, |
| 80 | + "metadata": { |
| 81 | + "collapsed": false |
| 82 | + }, |
| 83 | + "outputs": [], |
| 84 | + "source": [ |
| 85 | + "file_list = ['exp_01.txt', 'exp.txt', 'exp_02.dat', 'exp.dat', 'exp05.dat', 'exp_03.jpg']\n", |
| 86 | + "for file_name in file_list:\n", |
| 87 | + " if re.search(r'\\d+\\.(?:txt|dat)', file_name):\n", |
| 88 | + " print(file_name)" |
| 89 | + ] |
| 90 | + }, |
| 91 | + { |
| 92 | + "cell_type": "markdown", |
| 93 | + "metadata": {}, |
| 94 | + "source": [ |
| 95 | + "Is this really correct? Let's try something nasty." |
| 96 | + ] |
| 97 | + }, |
| 98 | + { |
| 99 | + "cell_type": "code", |
| 100 | + "execution_count": null, |
| 101 | + "metadata": { |
| 102 | + "collapsed": false |
| 103 | + }, |
| 104 | + "outputs": [], |
| 105 | + "source": [ |
| 106 | + "if re.search(r'\\d+\\.(?:txt|dat)', 'exp_09.data'):\n", |
| 107 | + " print('Oops!')\n", |
| 108 | + "else:\n", |
| 109 | + " print(\"Yay!\")" |
| 110 | + ] |
| 111 | + }, |
| 112 | + { |
| 113 | + "cell_type": "markdown", |
| 114 | + "metadata": {}, |
| 115 | + "source": [ |
| 116 | + "Let's ensure that the strings have to end with either `.txt`, or `.dat`." |
| 117 | + ] |
| 118 | + }, |
| 119 | + { |
| 120 | + "cell_type": "code", |
| 121 | + "execution_count": null, |
| 122 | + "metadata": { |
| 123 | + "collapsed": false |
| 124 | + }, |
| 125 | + "outputs": [], |
| 126 | + "source": [ |
| 127 | + "if re.search(r'\\d+\\.(?:txt|dat)$', 'exp_09.data'):\n", |
| 128 | + " print('Oops!')\n", |
| 129 | + "else:\n", |
| 130 | + " print(\"Yay!\")" |
| 131 | + ] |
| 132 | + }, |
| 133 | + { |
| 134 | + "cell_type": "markdown", |
| 135 | + "metadata": {}, |
| 136 | + "source": [ |
| 137 | + "## Extracting stuff" |
| 138 | + ] |
| 139 | + }, |
| 140 | + { |
| 141 | + "cell_type": "markdown", |
| 142 | + "metadata": {}, |
| 143 | + "source": [ |
| 144 | + "Regular expressions can also be used to capture parts of a string while matching." |
| 145 | + ] |
| 146 | + }, |
| 147 | + { |
| 148 | + "cell_type": "markdown", |
| 149 | + "metadata": {}, |
| 150 | + "source": [ |
| 151 | + "Suppose we are only interested in the numbers in file names like `exp_01.dat`." |
| 152 | + ] |
| 153 | + }, |
| 154 | + { |
| 155 | + "cell_type": "code", |
| 156 | + "execution_count": null, |
| 157 | + "metadata": { |
| 158 | + "collapsed": false |
| 159 | + }, |
| 160 | + "outputs": [], |
| 161 | + "source": [ |
| 162 | + "file_list = ['exp_01.dat', 'meta.txt', 'exp_02.dat', 'exp_10.dat', 'exp_05.dat', 'exp_03.jpg']\n", |
| 163 | + "for file_name in file_list:\n", |
| 164 | + " match = re.search(r'exp_(\\d+)\\.dat', file_name)\n", |
| 165 | + " if match:\n", |
| 166 | + " print(match.group(1))" |
| 167 | + ] |
| 168 | + }, |
| 169 | + { |
| 170 | + "cell_type": "markdown", |
| 171 | + "metadata": {}, |
| 172 | + "source": [ |
| 173 | + "Note the difference between grouping brackets such as `(?:txt|dat)`, and capturing brackets such as `(\\d+)`. Capturing brackets also group, but grouping brackets don't capture." |
| 174 | + ] |
| 175 | + }, |
| 176 | + { |
| 177 | + "cell_type": "markdown", |
| 178 | + "metadata": {}, |
| 179 | + "source": [ |
| 180 | + "## Substitution" |
| 181 | + ] |
| 182 | + }, |
| 183 | + { |
| 184 | + "cell_type": "markdown", |
| 185 | + "metadata": {}, |
| 186 | + "source": [ |
| 187 | + "Regular expressions can also be used to substitute parts of strings that match a given pattern. For instance, replace all extensions in file names by `.txt`." |
| 188 | + ] |
| 189 | + }, |
| 190 | + { |
| 191 | + "cell_type": "code", |
| 192 | + "execution_count": null, |
| 193 | + "metadata": { |
| 194 | + "collapsed": false |
| 195 | + }, |
| 196 | + "outputs": [], |
| 197 | + "source": [ |
| 198 | + "file_list = ['exp_01.dat', 'exp_03.txt', 'exp_02.dat', 'exp_10.text']\n", |
| 199 | + "for file_name in file_list:\n", |
| 200 | + " new_file_name = re.sub(r'\\.\\w+$', '.txt', file_name)\n", |
| 201 | + " print('{old:15s} -> {new}'.format(old=file_name, new=new_file_name))" |
| 202 | + ] |
| 203 | + }, |
| 204 | + { |
| 205 | + "cell_type": "markdown", |
| 206 | + "metadata": {}, |
| 207 | + "source": [ |
| 208 | + "The substituion can infact include part of the string captured in the regular expression. We can replace a file name such as `exp_03.txt` by `03_exp.txt`, and `dev_05.dat` by `05_dev.dat`." |
| 209 | + ] |
| 210 | + }, |
| 211 | + { |
| 212 | + "cell_type": "code", |
| 213 | + "execution_count": null, |
| 214 | + "metadata": { |
| 215 | + "collapsed": false |
| 216 | + }, |
| 217 | + "outputs": [], |
| 218 | + "source": [ |
| 219 | + "file_list = ['exp_01.dat', 'dev_03.txt', 'exp_02.txt', 'exp_10.text']\n", |
| 220 | + "for file_name in file_list:\n", |
| 221 | + " new_file_name = re.sub(r'(\\w+)_(\\d+)\\.', r'\\2_\\1.', file_name)\n", |
| 222 | + " print('{old:15s} -> {new}'.format(old=file_name, new=new_file_name))" |
| 223 | + ] |
| 224 | + } |
| 225 | + ], |
| 226 | + "metadata": { |
| 227 | + "kernelspec": { |
| 228 | + "display_name": "Python 3", |
| 229 | + "language": "python", |
| 230 | + "name": "python3" |
| 231 | + }, |
| 232 | + "language_info": { |
| 233 | + "codemirror_mode": { |
| 234 | + "name": "ipython", |
| 235 | + "version": 3 |
| 236 | + }, |
| 237 | + "file_extension": ".py", |
| 238 | + "mimetype": "text/x-python", |
| 239 | + "name": "python", |
| 240 | + "nbconvert_exporter": "python", |
| 241 | + "pygments_lexer": "ipython3", |
| 242 | + "version": "3.5.1" |
| 243 | + } |
| 244 | + }, |
| 245 | + "nbformat": 4, |
| 246 | + "nbformat_minor": 0 |
| 247 | +} |
0 commit comments