diff --git a/README.md b/README.md index 181066d..e38130b 100644 --- a/README.md +++ b/README.md @@ -1,26 +1,26 @@ ## Preface -This GitBook was written by David Backus, Sarah Beckett-Hile, Chase Coleman, and Spencer Lyon for [a course](http://databootcamp.nyuecon.com/) at [NYU](http://www.nyu.edu/)'s [Stern School of Business](http://www.stern.nyu.edu/). We plan to give students experience with economic and financial data and introduce programming newbies to the benefits of moving beyond Excel. We use the Python programming language, specifically Python's data management and graphics tools. If that doesn't whet your appetite, we have a [more elaborate sales pitch](http://databootcamp.nyuecon.com/bootcamp_faq/). +This GitBook was written by David Backus, Sarah Beckett-Hile, Chase Coleman, and Spencer Lyon for [a course](http://databootcamp.nyuecon.com/) at [NYU](http://www.nyu.edu/)'s [Stern School of Business](http://www.stern.nyu.edu/). We plan to give students experience with economic and financial data and introduce programming newbies to the benefits of moving beyond Excel. We use the Python programming language, specifically Python's data management and graphics tools. If that doesn't whet your appetite, we have a [more elaborate sales pitch](http://databootcamp.nyuecon.com/bootcamp_faq/). -We designed the book to accompany a live class. We've tried to make it self-contained, but the written word is a poor substitute for the interaction you get in a classroom. +We designed the book to accompany a live class. We've tried to make it self-contained, but the written word is a poor substitute for the interaction you get in a classroom. -The book comes in multiple formats. You can access it on the internet. Or you can download (and print) a pdf file. The former comes with links, which we think is a huge advantage, and can be updated quickly, but if you like paper by all means try the pdf. All of them are available at +The book comes in multiple formats. You can access it on the internet. Or you can download (and print) a pdf file. The former comes with links, which we think is a huge advantage, and can be updated quickly, but if you like paper by all means try the pdf. All of them are available at https://www.gitbook.com/book/davebackus/test/details -We welcome suggestions. Send them to Dave Backus at [db3@nyu.edu](mailto:db3@nyu.edu). Or, even better, post an issue on our [GitHub repository](https://github.com/DaveBackus/Data_Bootcamp_Book/issues). +We welcome suggestions. Send them to Dave Backus at [db3@nyu.edu](mailto:db3@nyu.edu). Or, even better, post an issue on our [GitHub repository](https://github.com/DaveBackus/Data_Bootcamp_Book/issues). ## Warning -This is **work in progress**. We've written seven chapters so far, more are on the way. +This is **work in progress**. We've written seven chapters so far, more are on the way. -## Acknowledgements +## Acknowledgements -This project was Glenn Okun's idea. He really should have done it himself, but we thank him for the idea and his ongoing support. Paul Backus, Hersh Iyer (MBA17), Matt McKay, Kim Ruhl, and Itamar Snir (MBA17) contributed technical support and applications. Ian Stewart provided his usual expert advice on teaching methods. You may also notice a family resemblance to Tom Sargent and John Stachurski's [Quantitative Economics](http://quant-econ.net/), a Python- and Julia-based course in dynamic macroeconomic theory. We thank them for their advice and encouragement. +This project was Glenn Okun's idea. He really should have done it himself, but we thank him for the idea and his ongoing support. Paul Backus, Hersh Iyer (MBA17), Matt McKay, Kim Ruhl, and Itamar Snir (MBA17) contributed technical support and applications. Ian Stewart provided his usual expert advice on teaching methods. You may also notice a family resemblance to Tom Sargent and John Stachurski's [Quantitative Economics](http://quant-econ.net/), a Python- and Julia-based course in dynamic macroeconomic theory. We thank them for their advice and encouragement. -## License +## License This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/ diff --git a/SUMMARY.md b/SUMMARY.md index 9cca266..4a5aedd 100644 --- a/SUMMARY.md +++ b/SUMMARY.md @@ -2,10 +2,10 @@ * [Preface](README.md) * [Where are we headed?](intro.md) -* [The data mentality](data-mentality.md) +* [The data mentality](data-mentality.md) * [Installing Python](installing-python.md) -* [Python fundamentals 1](py-fun1.md) -* [Python fundamentals 2](py-fun2.md) +* [Python fundamentals 1](py-fun1.md) +* [Python fundamentals 2](py-fun2.md) * [Data input: Packages and Pandas](pandas-input.md) * [Python graphics: Matplotlib fundamentals](graphs1.md) @@ -19,6 +19,6 @@ * [Business cycle indicators](indicators.md) * [Describing data 1: Distributions of things](random.md) * [Other cool stuff](other.md) ---> +--> * [Glossary](glossary.md) diff --git a/conda-pip.md b/conda-pip.md index b7ac088..ab60b69 100644 --- a/conda-pip.md +++ b/conda-pip.md @@ -1,64 +1,64 @@ -# Updating Python: Conda and Pip +# Updating Python: Conda and Pip --- **Overview.** We describe the tools used to update Anaconda and Python, and for adding new packages. -**Python tools.** conda, pip. +**Python tools.** conda, pip. -**Buzzwords.** +**Buzzwords.** -**Applications.** +**Applications.** --- -## Conda +## Conda -This is the Anaconda tool, useful for updating and extending Anaconda. +This is the Anaconda tool, useful for updating and extending Anaconda. Link: http://conda.pydata.org/docs/ Cheat sheet: http://conda.pydata.org/docs/_downloads/conda-cheatsheet.pdf -Command line... +Command line... -conda info -conda update conda -conda update anaconda +conda info +conda update conda +conda update anaconda conda install [package] -## Pip +## Pip Link: https://pip.readthedocs.org/en/stable/ -## Quandl +## Quandl -Access to lots of economic and financial data... +Access to lots of economic and financial data... -## Seaborn +## Seaborn -A terrific interface for Matplotlib... +A terrific interface for Matplotlib... ## tqdm -Progress bar for data loads... +Progress bar for data loads... ## Pyopendata https://pypi.python.org/pypi/pyopendata/0.0.2 -Use to get OECD data? +Use to get OECD data? ## Flappy bird -https://www.youtube.com/watch?v=h2Uhla6nLDU -https://github.com/Max00355/FlappyBird/blob/master/flappybird.py \ No newline at end of file +https://www.youtube.com/watch?v=h2Uhla6nLDU +https://github.com/Max00355/FlappyBird/blob/master/flappybird.py diff --git a/data-mentality.md b/data-mentality.md index ada1cf7..11d5e39 100644 --- a/data-mentality.md +++ b/data-mentality.md @@ -1,9 +1,9 @@ # The data mentality --- -**Overview.** Thinking about data, ideas for projects. Things to remember: (1) Ideas aren't discovered, they're developed. (2) Ideas have friends: when you find one, there are others nearby. +**Overview.** Thinking about data, ideas for projects. Things to remember: (1) Ideas aren't discovered, they're developed. (2) Ideas have friends: when you find one, there are others nearby. -**Buzzwords.** Questions, data, idea machines. +**Buzzwords.** Questions, data, idea machines. **Code.** Related [examples](https://github.com/DaveBackus/Data_Bootcamp/blob/master/Code/IPython/bootcamp_examples.ipynb). @@ -11,80 +11,80 @@ Data analysis starts with a question. Generally, we want to learn something. In our world, we might ask: -* How is the US economy doing? -* What emerging market countries offer the best business opportunities? -* How do returns on US and European stocks compare? +* How is the US economy doing? +* What emerging market countries offer the best business opportunities? +* How do returns on US and European stocks compare? -You'll notice that the starting point is a question, something we'd like to know more about. We provide a toolkit for working effectively with data to find answers. Most of our examples are about economics and finance -- that's what we know -- but the same tools can be used to address any data we like. +You'll notice that the starting point is a question, something we'd like to know more about. We provide a toolkit for working effectively with data to find answers. Most of our examples are about economics and finance -- that's what we know -- but the same tools can be used to address any data we like. +* What data would be helpful in answering our question? +* Where can we find it? +* What should we do with it once we have it? +--> -## Thinking about data +## Thinking about data -It's not that we have no lives or anything, but we think about data all the time. If we see an interesting graphic in *The Economist* -- or the *Wall Street Journal*, or the *New York Times* -- it triggers a series of questions. +It's not that we have no lives or anything, but we think about data all the time. If we see an interesting graphic in *The Economist* -- or the *Wall Street Journal*, or the *New York Times* -- it triggers a series of questions. * What did we learn from the graph? -* What else would we like to know? -* Where does the data come from? +* What else would we like to know? +* Where does the data come from? -Following up on these questions often leads to interesting insights. And it's fun. +Following up on these questions often leads to interesting insights. And it's fun. -Let's give it a try: +Let's give it a try: **Exercise.** The 538 blog has a nice summary of [salaries of recent college graduates](http://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/). Skip to the bottom to sort by major and play around. Answer these questions: -* What did you learn from their table? -* What else would you like to know? -* Where did they get their data? +* What did you learn from their table? +* What else would you like to know? +* Where did they get their data? -**Exercise.** What kinds of things would you like to know more about? Think of this as improv, there are no bad answers. +**Exercise.** What kinds of things would you like to know more about? Think of this as improv, there are no bad answers. -## Generating project ideas +## Generating project ideas -One of our goals is for you to produce a piece of work -- data and graphics -- that you can show potential employers. There's nothing like a concrete example (regardless of the topic) to show off your skill set. We still have lots of time, but it can't hurt to start thinking about it now. +One of our goals is for you to produce a piece of work -- data and graphics -- that you can show potential employers. There's nothing like a concrete example (regardless of the topic) to show off your skill set. We still have lots of time, but it can't hurt to start thinking about it now. -**Idea machines.** The question is how to find a good project idea. That's not something you run across a lot in modern education, where our job is typically to absorb what's taught rather than come up with our ideas. So how would we get started? +**Idea machines.** The question is how to find a good project idea. That's not something you run across a lot in modern education, where our job is typically to absorb what's taught rather than come up with our ideas. So how would we get started? We're looking for a topic that covers two bases: (1) we find it interesting and (2) we have access to data related to it. We can start with either one, or with an existing example we would ike to reproduce and extend: -* **Start with what interests you.** Economics, finance, marketing, emerging markets, movies, sports. You be the judge. Be specific: You want a topic, not a category. +* **Start with what interests you.** Economics, finance, marketing, emerging markets, movies, sports. You be the judge. Be specific: You want a topic, not a category. * **Start with data.** Take a dataset you find interesting, ask what you might do with it. If you're not sure where to look, try our list of [data sources](http://databootcamp.nyuecon.com/bootcamp_data/). -* **Start with an example.** Find an analyst report, blog post, or graphic you like. Ask where the data comes from and think about whether you can replicate and/or extend it. +* **Start with an example.** Find an analyst report, blog post, or graphic you like. Ask where the data comes from and think about whether you can replicate and/or extend it. -If you're not sure how this works, watch Steve Levitt's [video](https://youtu.be/r5jATFtKtI8?t=5m10s) about working with company data. It's an entertaining and informative 50 minutes. +If you're not sure how this works, watch Steve Levitt's [video](https://youtu.be/r5jATFtKtI8?t=5m10s) about working with company data. It's an entertaining and informative 50 minutes. -Keep in mind: we're not looking for a perfect idea. Perfection takes time, and we may never get there. Long experience has shown us: +Keep in mind: we're not looking for a perfect idea. Perfection takes time, and we may never get there. Long experience has shown us: -* **Ideas have friends.** If you have an idea, even a not very good one, it often triggers thoughts of other ideas, sometimes even better ones. +* **Ideas have friends.** If you have an idea, even a not very good one, it often triggers thoughts of other ideas, sometimes even better ones. -* **Ideas aren't discovered, they're developed.** Allow your ideas to mature, to evolve and improve. Like kimchi and red wine, they often get better with time. +* **Ideas aren't discovered, they're developed.** Allow your ideas to mature, to evolve and improve. Like kimchi and red wine, they often get better with time. -**Common mistakes -- and how to fix them.** We mean this in a good way, but in our experience there are a number of things students do that make this harder than it should be. Here's a list, with suggestions for overcoming them: +**Common mistakes -- and how to fix them.** We mean this in a good way, but in our experience there are a number of things students do that make this harder than it should be. Here's a list, with suggestions for overcoming them: -* **Reject an idea too soon,** before you’ve given it enough thought. Solution: Don't be critical too early, you don't want to inhibit your creativity. Collect ideas first, whittle them down later. +* **Reject an idea too soon,** before you’ve given it enough thought. Solution: Don't be critical too early, you don't want to inhibit your creativity. Collect ideas first, whittle them down later. -* **Choose a project that’s too large**. Solution: Keep it simple. Think it over for a while, and choose a small part of a larger project that is interesting on its own. You can always do more later. +* **Choose a project that’s too large**. Solution: Keep it simple. Think it over for a while, and choose a small part of a larger project that is interesting on its own. You can always do more later. -* **Your dataset doesn't have everything you want.** To be honest, that's pretty much every dataset we've ever seen. Solution: Make do with what you have. +* **Your dataset doesn't have everything you want.** To be honest, that's pretty much every dataset we've ever seen. Solution: Make do with what you have. -* **Pick a dataset that's not available.** Solution: Start with what you have, ask what you can do with it. We call this the [Jeopardy](https://en.wikipedia.org/wiki/Jeopardy!) approach: start with the answer, come up with the question. If that fails, find another dataset. +* **Pick a dataset that's not available.** Solution: Start with what you have, ask what you can do with it. We call this the [Jeopardy](https://en.wikipedia.org/wiki/Jeopardy!) approach: start with the answer, come up with the question. If that fails, find another dataset. -**Bottom line.** Projects are less structured than most things you'll run across in the academic world. It's challenging, at first, to work with so little structure, but most students find that the freedom to develop their own projects is one of the most rewarding things they can do. +**Bottom line.** Projects are less structured than most things you'll run across in the academic world. It's challenging, at first, to work with so little structure, but most students find that the freedom to develop their own projects is one of the most rewarding things they can do. -**Exercise.** Write down three project ideas. Don't overthink this, one or two lines each will do. \ No newline at end of file +**Exercise.** Write down three project ideas. Don't overthink this, one or two lines each will do. diff --git a/emerging.md b/emerging.md index d0dfefb..3b0ac29 100644 --- a/emerging.md +++ b/emerging.md @@ -12,11 +12,11 @@ Include cases from Global.... -Data from +Data from -* Penn World Table +* Penn World Table * World Bank -* Doing Business +* Doing Business * Maddison ## Assessing the business climate diff --git a/glossary.md b/glossary.md index 607f3ae..9d7e24c 100644 --- a/glossary.md +++ b/glossary.md @@ -1,9 +1,9 @@ -# Glossary +# Glossary -function -list -package -slicing -string -variable \ No newline at end of file +function +list +package +slicing +string +variable diff --git a/graphs1.md b/graphs1.md index 2b5dbf4..6f75bf4 100644 --- a/graphs1.md +++ b/graphs1.md @@ -1,31 +1,31 @@ # Python graphics: Matplotlib fundamentals --- -**Overview.** We introduce and apply Python's popular graphics package, Matplotlib. We do this Jupyter, using an IPython notebook. +**Overview.** We introduce and apply Python's popular graphics package, Matplotlib. We do this Jupyter, using an IPython notebook. -**Python tools.** Graphing with Matplotlib: dataframe methods, the `plot(x,y)` function, and figure/axis objects. +**Python tools.** Graphing with Matplotlib: dataframe methods, the `plot(x,y)` function, and figure/axis objects. -**Buzzwords.** Data visualization. +**Buzzwords.** Data visualization. -**Applications.** US GDP, GDP per capita and life expectancy, Fama-French asset returns, PISA math scores. +**Applications.** US GDP, GDP per capita and life expectancy, Fama-French asset returns, PISA math scores. **Code.** [Link](https://github.com/DaveBackus/Data_Bootcamp/blob/master/Code/IPython/bootcamp_graphics_1.ipynb). --- -Computer graphics are one of the great advances of the modern world. Graphs have always been helpful in describing data or concepts, but they're a lot easier to produce now than they were a few years ago. In fact, we've gotten so good at drawing pictures that we invented a new term for it: **visualization**. If done well, a good graph can tell us something new -- and get us thinking about other things we'd like to know. +Computer graphics are one of the great advances of the modern world. Graphs have always been helpful in describing data or concepts, but they're a lot easier to produce now than they were a few years ago. In fact, we've gotten so good at drawing pictures that we invented a new term for it: **visualization**. If done well, a good graph can tell us something new -- and get us thinking about other things we'd like to know. -That's the good news. The bad news is that graphics are inherently complicated. Programs like Excel do their best to hide this fact, but if you ever try to customize a chart it quickly rears its ugly head. Have you ever spent a couple hours trying to fine-tune an Excel graph? More? The problem is that even simple graphs have lots of moving parts: the type (line, bar, scatter, etc); the color and thickness of lines, bars, or markers; title and axis labels; their location, fonts, and font sizes; tick marks (location, size); background color; grid lines (on or off); and so on. That's not an Excel problem, it's a problem with graphics in general. +That's the good news. The bad news is that graphics are inherently complicated. Programs like Excel do their best to hide this fact, but if you ever try to customize a chart it quickly rears its ugly head. Have you ever spent a couple hours trying to fine-tune an Excel graph? More? The problem is that even simple graphs have lots of moving parts: the type (line, bar, scatter, etc); the color and thickness of lines, bars, or markers; title and axis labels; their location, fonts, and font sizes; tick marks (location, size); background color; grid lines (on or off); and so on. That's not an Excel problem, it's a problem with graphics in general. -Our goal here is to produce graphs with **Matplotlib**, Python's leading graphics package. Matplotlib can be used in several ways. We show how they work and produce a few visualizations that we hope bring data to life. There's a lot here, but don't panic, it gets easier with experience. +Our goal here is to produce graphs with **Matplotlib**, Python's leading graphics package. Matplotlib can be used in several ways. We show how they work and produce a few visualizations that we hope bring data to life. There's a lot here, but don't panic, it gets easier with experience. -One more thing before we start: **Save the IPython notebook** at the Code link above in your `Data_Bootcamp` directory/folder. The link goes to a display of the notebook; you need to click on the Raw button to get the real file. +One more thing before we start: **Save the IPython notebook** at the Code link above in your `Data_Bootcamp` directory/folder. The link goes to a display of the notebook; you need to click on the Raw button to get the real file. ## Reminders -* Packages. Collections of tools that extend Python's capabilities. We add them with `import` statements. +* Packages. Collections of tools that extend Python's capabilities. We add them with `import` statements. * Pandas. Python's data management package. We typically add it to our programs with @@ -33,41 +33,41 @@ One more thing before we start: **Save the IPython notebook** at the Code link import pandas as pd ``` -* Objects and methods. Recall -- again! -- that we apply the method `justdoit` to the object `x` with `x.justdoit`. +* Objects and methods. Recall -- again! -- that we apply the method `justdoit` to the object `x` with `x.justdoit`. -* Dataframe. A data structure like a spreadsheet that includes a table of data plus row and column labels. Typically columns are variables and rows are observations. Common dataframe methods include `columns` (column labels), `index` (row labels), and `plot()` (graph the columns). -* Series. A single variable `x` in a dataframe `df` can be expressed as the series `df['x']`. +* Dataframe. A data structure like a spreadsheet that includes a table of data plus row and column labels. Typically columns are variables and rows are observations. Common dataframe methods include `columns` (column labels), `index` (row labels), and `plot()` (graph the columns). +* Series. A single variable `x` in a dataframe `df` can be expressed as the series `df['x']`. * Reading spreadsheets. We "read" spreadsheet data into Python using the `read_csv` and `read_excel` functions in Pandas. -* API. Or we can use the APIs for FRED, the World Bank, Fama-French, and other data sources. +* API. Or we can use the APIs for FRED, the World Bank, Fama-French, and other data sources. -* Jupyter. A Python environment in which we create IPython notebooks. These notebooks combine Python code with text and output, including graphics. It's the ideal medium for this topic. +* Jupyter. A Python environment in which we create IPython notebooks. These notebooks combine Python code with text and output, including graphics. It's the ideal medium for this topic. ## IPython notebooks in Jupyter -We're going to change programming environments from Spyder to Jupyter and work with IPython notebooks. We had a brief introduction with Jupyter when we installed Anaconda, but we'll go through it again to make sure we're all on the same page. +We're going to change programming environments from Spyder to Jupyter and work with IPython notebooks. We had a brief introduction with Jupyter when we installed Anaconda, but we'll go through it again to make sure we're all on the same page. -**Creating an IPython notebook.** We can open a new IPython notebook by tracing the steps we took in the first class: +**Creating an IPython notebook.** We can open a new IPython notebook by tracing the steps we took in the first class: -* Start the Anaconda Launcher. -* Click on ipython-notbook to launch Jupyter. This will open a tab in your browser with the word Jupyter at the top and your computer's directory structure below it. -* In the browser tab, navigate to your `Data_Bootcamp` directory/folder. -* Click on the New button in the upper right and choose Python 3. +* Start the Anaconda Launcher. +* Click on ipython-notbook to launch Jupyter. This will open a tab in your browser with the word Jupyter at the top and your computer's directory structure below it. +* In the browser tab, navigate to your `Data_Bootcamp` directory/folder. +* Click on the New button in the upper right and choose Python 3. -We now have a new empty Python notebook we can use to play around with. +We now have a new empty Python notebook we can use to play around with. -**Jupyter essentials.** In your browser, you should have an empty notebook with the word Jupyter at the top. Below it is a **menubar** with the words File, Edit, View, Cell, Kernel, and Help. Below that is a **toolbar** with various buttons. If you have a few minutes, click on help in the menubar at the top and choose User Interface Tour. +**Jupyter essentials.** In your browser, you should have an empty notebook with the word Jupyter at the top. Below it is a **menubar** with the words File, Edit, View, Cell, Kernel, and Help. Below that is a **toolbar** with various buttons. If you have a few minutes, click on help in the menubar at the top and choose User Interface Tour. -Let's put some of these tools to work: +Let's put some of these tools to work: * Change the notebook name. Click on the name (`Untitled` if we just created a new notebook) to the right of the word Jupyter at the top. A textbox should open up. Use it to change the name to `bootcamp_sandbox`. -* Toobar buttons. Let your mouse hover over one of them to see what it does. +* Toobar buttons. Let your mouse hover over one of them to see what it does. -* Add a cell. Click on the `+` in the toolbar to create a new cell. Choose Code in the toolbar's dropdown menu. Type this code in the cell: +* Add a cell. Click on the `+` in the toolbar to create a new cell. Choose Code in the toolbar's dropdown menu. Type this code in the cell: ```python import datetime @@ -75,23 +75,23 @@ print('Welcome to Data Bootcamp!') print('Today is: ', datetime.date.today()) ``` -Now click on Cell in the menubar and choose Run cell. What do you see? +Now click on Cell in the menubar and choose Run cell. What do you see? -* Add another cell. Click on the `+` to create another cell and choose Markdown in the toolbar's dropdown menu. Markdown is text; more on it shortly. Type this in the cell: +* Add another cell. Click on the `+` to create another cell and choose Markdown in the toolbar's dropdown menu. Markdown is text; more on it shortly. Type this in the cell: ``` Your name Data Bootcamp sandbox for playing around with IPython notebooks ``` -Run this cell as well. +Run this cell as well. -You get the idea. To get a sense of what's possible, take a look at these [two](https://github.com/DaveBackus/Data_Bootcamp/blob/master/Code/IPython/bootcamp_test.ipynb) [notebooks](http://nbviewer.ipython.org/github/justmarkham/DAT4/blob/master/notebooks/08_linear_regression.ipynb). +You get the idea. To get a sense of what's possible, take a look at these [two](https://github.com/DaveBackus/Data_Bootcamp/blob/master/Code/IPython/bootcamp_test.ipynb) [notebooks](http://nbviewer.ipython.org/github/justmarkham/DAT4/blob/master/notebooks/08_linear_regression.ipynb). -**Markdown essentials.** Markdown is a simplified version of html ("hypertext markup language"), the language used to construct basic websites. html was a great thing in 1990, but now that the excitement has warn off we find it painful. Markdown, however, has a zen-like simplicity. Here are some things we can do with it: +**Markdown essentials.** Markdown is a simplified version of html ("hypertext markup language"), the language used to construct basic websites. html was a great thing in 1990, but now that the excitement has warn off we find it painful. Markdown, however, has a zen-like simplicity. Here are some things we can do with it: -* Headings. Large bold headings are marked by hashes (`#`). One hash for first level (very large), two for second level (a little smaller), three for third level (smaller still), four for fourth (the smallest). Try these in a Markdown cell to see how they look: +* Headings. Large bold headings are marked by hashes (`#`). One hash for first level (very large), two for second level (a little smaller), three for third level (smaller still), four for fourth (the smallest). Try these in a Markdown cell to see how they look: ``` # Data Bootcamp sandbox @@ -99,38 +99,38 @@ You get the idea. To get a sense of what's possible, take a look at these [two] ### Data Bootcamp sandbox ``` - Be sure to run the cell when you're done (`shift enter`). + Be sure to run the cell when you're done (`shift enter`). -* Bold and italics. If we put a word or phrase between double asterisks, it's displayed in bold. Thus `**bold**` is displayed as **bold**. If we use single asterisks, we get italics: `*italics*` displays as *italics*. +* Bold and italics. If we put a word or phrase between double asterisks, it's displayed in bold. Thus `**bold**` is displayed as **bold**. If we use single asterisks, we get italics: `*italics*` displays as *italics*. -* Bullet lists. If we want a list of items marked by bullets, we start with a blank line and mark each item with an asterisk on a new line. +* Bullet lists. If we want a list of items marked by bullets, we start with a blank line and mark each item with an asterisk on a new line. -* Links. We construct a link with the text in square brackets and the url in parentheses immediately afterwards. Try this one: +* Links. We construct a link with the text in square brackets and the url in parentheses immediately afterwards. Try this one: ``` [Data Bootcamp course](http://databootcamp.nyuecon.com/) ``` - You can find more information about Markdown under Help. Or use your Google fu. + You can find more information about Markdown under Help. Or use your Google fu. -Markdown is ubiquitous. This book, for example, is written in Markdown. Go [here](https://github.com/DaveBackus/Data_Bootcamp_Book/blob/master/data-mentality.md) for a list of chapter files. Click on one to see how it displays. Click on the Raw button at the top to see the Markdown file that produced it. +Markdown is ubiquitous. This book, for example, is written in Markdown. Go [here](https://github.com/DaveBackus/Data_Bootcamp_Book/blob/master/data-mentality.md) for a list of chapter files. Click on one to see how it displays. Click on the Raw button at the top to see the Markdown file that produced it. -**Exercise.** Create a description cell in Markdown near the top of your notebook. It should include your name and a description of what you're doing in the notebook. For example: "Joan Watson's notes on the Data Bootcamp Matplotlib notebook" and a date. *Bonus points:* Add a link. +**Exercise.** Create a description cell in Markdown near the top of your notebook. It should include your name and a description of what you're doing in the notebook. For example: "Joan Watson's notes on the Data Bootcamp Matplotlib notebook" and a date. *Bonus points:* Add a link. -**IPython help.** We can access documentation just as we did in Spyder's IPython console: Type a function or method and add a question mark. For example: `print?` or `df.plot?`. +**IPython help.** We can access documentation just as we did in Spyder's IPython console: Type a function or method and add a question mark. For example: `print?` or `df.plot?`. ## Getting ready -We need to do a few things before we're ready to produce graphs. +We need to do a few things before we're ready to produce graphs. -**Open the graphics notebook.** If you followed instructions -- and we're confident you did -- you saved the notebook for this chapter in your `Data_Bootcamp` directory. Return to the Jupyter tab in your browser that points to that directory. Look for the file named `bootcamp_graphics_1.ipynb`. Click to open it. That will open the notebook in a new tab. The notebook will say at the top: "Python graphics: Matplotlib fundamentals" in large bold letters. +**Open the graphics notebook.** If you followed instructions -- and we're confident you did -- you saved the notebook for this chapter in your `Data_Bootcamp` directory. Return to the Jupyter tab in your browser that points to that directory. Look for the file named `bootcamp_graphics_1.ipynb`. Click to open it. That will open the notebook in a new tab. The notebook will say at the top: "Python graphics: Matplotlib fundamentals" in large bold letters. -**Import packages.** We need to tell our Python program what packages we plan to use. The following code also checks their versions and prints the date: +**Import packages.** We need to tell our Python program what packages we plan to use. The following code also checks their versions and prints the date: ```python import sys # system module @@ -149,7 +149,7 @@ print('Today: ', dt.date.today()) All of these statements generally go at the top of our program. -**Process data.** We use three dataframes to illustrate Matplotlib graphics. +**Process data.** We use three dataframes to illustrate Matplotlib graphics. *US GDP.* The first one is several years of US GDP and Consumption. We got the numbers from FRED, but have written them out here for simplicity. The code is @@ -164,10 +164,10 @@ us = pd.DataFrame({'gdp': gdp, 'pce': pce}, index=year) print(us) ``` -Note that we created a dataframe from a dictionary. That's convenient here, but in most real applications we'll read in spreadsheets or access APIs. +Note that we created a dataframe from a dictionary. That's convenient here, but in most real applications we'll read in spreadsheets or access APIs. -*World Bank.* Our second dataframe contains 2013 data for GDP per capita (basically income per person) for several countries: +*World Bank.* Our second dataframe contains 2013 data for GDP per capita (basically income per person) for several countries: ```python code = ['USA', 'FRA', 'JPN', 'CHN', 'IND', 'BRA', 'MEX'] @@ -179,7 +179,7 @@ wbdf = pd.DataFrame({'gdppc': gdppc, 'country': country}, index=code) wbdf ``` -In IPython, the last line -- the dataframe name `wbdf` on its own -- results in the display of `wbdf`. That works as long as it's the last statement in the cell. +In IPython, the last line -- the dataframe name `wbdf` on its own -- results in the display of `wbdf`. That works as long as it's the last statement in the cell. *Fama-French returns.* Our third dataframe reads annual returns from Fama and French: @@ -192,48 +192,48 @@ ff = ff[['rm', 'rf']] # extract rm and rf (return on market, riskfree rate) ff.head(5) ``` -This gives us a dataframe with two variables: `rm` is the return on the equity market overall and `rf` is the riskfree return. +This gives us a dataframe with two variables: `rm` is the return on the equity market overall and `rf` is the riskfree return. -**Exercise.** What kind of object is `wbdf`? How can you tell? How would you access its column and row labels? What are they? +**Exercise.** What kind of object is `wbdf`? How can you tell? How would you access its column and row labels? What are they? ## Three approaches to graphics in Matplotlib -Back to graphics. Python's leading graphics package is **Matplotlib**. Matplotlib can be used in a number of different ways: +Back to graphics. Python's leading graphics package is **Matplotlib**. Matplotlib can be used in a number of different ways: -* Approach #1: Apply plot methods to dataframes. -* Approach #2: Use the `plot(x,y)` function to plot `y` against `x`. -* Approach #3: Create figure objects and apply methods to them. +* Approach #1: Apply plot methods to dataframes. +* Approach #2: Use the `plot(x,y)` function to plot `y` against `x`. +* Approach #3: Create figure objects and apply methods to them. -All three call on the same functionality, but they use different syntax. We use all three at times but favor #1 and #3. +All three call on the same functionality, but they use different syntax. We use all three at times but favor #1 and #3. ## Digression: Graphing in Excel -Before charging ahead, let's review how we would create what Excel calls a "chart". We need to choose: +Before charging ahead, let's review how we would create what Excel calls a "chart". We need to choose: -* Data. Typically we would highlight a block of cells in a spreadsheet. Typical data would be in columns with labels at the top, much like a dataframe. -* Chart type. Lines, bars, scatter plots, and so on. -* `x` and `y` variables. Typically we graph some `y` variable -- or perhaps several of them -- against an `x` variable, with `x` on the horizontal axis and `y` on the vertical axis. We need to tell Excel which is which. +* Data. Typically we would highlight a block of cells in a spreadsheet. Typical data would be in columns with labels at the top, much like a dataframe. +* Chart type. Lines, bars, scatter plots, and so on. +* `x` and `y` variables. Typically we graph some `y` variable -- or perhaps several of them -- against an `x` variable, with `x` on the horizontal axis and `y` on the vertical axis. We need to tell Excel which is which. -This might be followed by a long list of fine-tuning: what the lines look like, how the axes are labeled, and so on. We'll see the same in Matplotlib. +This might be followed by a long list of fine-tuning: what the lines look like, how the axes are labeled, and so on. We'll see the same in Matplotlib. ## Approach #1: Apply plot methods to dataframes -The simplest way to produce graphics from a dataframe is to apply a method to it. We like simple, and do this a lot. +The simplest way to produce graphics from a dataframe is to apply a method to it. We like simple, and do this a lot. If we compare this to Excel, we will see that a number of things are preset for us: -* Data. By default (meaning, if we don't do anything to change it) the data consists of the whole dataframe. -* Chart type. We'll see below that we have options for lines, bars, or other things. -* `x` and `y` variables. By default, the `x` variable is the dataframe's index and the `y` variables are the columns of the dataframe -- all of them. +* Data. By default (meaning, if we don't do anything to change it) the data consists of the whole dataframe. +* Chart type. We'll see below that we have options for lines, bars, or other things. +* `x` and `y` variables. By default, the `x` variable is the dataframe's index and the `y` variables are the columns of the dataframe -- all of them. -We can change all of these things, just as we can in Excel, but that's the starting point. +We can change all of these things, just as we can in Excel, but that's the starting point. -**Example (line plot).** Enter the statement `us.plot()` into an IPython cell and run it. This plots every column of the dataframe `us` as a line against the index, the year of the observation. The lines have different colors. We didn't ask for this, it's built in. A legend associates each variable name with a line color. This is also built in. +**Example (line plot).** Enter the statement `us.plot()` into an IPython cell and run it. This plots every column of the dataframe `us` as a line against the index, the year of the observation. The lines have different colors. We didn't ask for this, it's built in. A legend associates each variable name with a line color. This is also built in. **Example (single line plot).** We just plotted all the variables -- all two of them -- in the dataframe `us`. To plot one line, we apply the same method to a single variable. The statement `us['gdp'].plot()` plots GDP alone. The first part -- `us['gdp']` -- is the single variable GDP. The second part -- `.plot()` -- plots it. @@ -241,7 +241,7 @@ We can change all of these things, just as we can in Excel, but that's the start **Example (bar chart).** The statement `us.plot(kind='bar')` produces a bar chart of the same data. -**Exercise.** Show that the statement `us.plot.bar()` produces the same bar chart. +**Exercise.** Show that the statement `us.plot.bar()` produces the same bar chart. **Example (scatter plot).** In a scatter plot we need to be explicit about `x` and `y`. We'll use `gdp` as `x` and `pce` (consumption) as `y`. The general syntax for a dataframe `df` is `df.plot.scatter(x,y)`. In this case we use @@ -250,13 +250,13 @@ We can change all of these things, just as we can in Excel, but that's the start us.plot.scatter('gdp', 'pce') ``` -The scatter here is not far from a straight line; evidently consumption and GDP go up and down together. +The scatter here is not far from a straight line; evidently consumption and GDP go up and down together. -**Exercise.** We have lots of choices for dressing this up. Use the IPython help by typing `us.plot?` in an empty cell and running it. What arguments/parameters look interesting to you? +**Exercise.** We have lots of choices for dressing this up. Use the IPython help by typing `us.plot?` in an empty cell and running it. What arguments/parameters look interesting to you? -**Exercise.** Add each of these arguments to `us.plot()` in the code cell below and describe what they do: +**Exercise.** Add each of these arguments to `us.plot()` in the code cell below and describe what they do: * `kind='area'` * `subplots=True` @@ -264,39 +264,39 @@ The scatter here is not far from a straight line; evidently consumption and GDP * `figsize=(3,6)` * `xlim=(0,16000)` -
+
We can do the same things with the Fama-French dataframe `ff`. The basic plot statement is ```python ff.plot() ``` -This has one series (the equity market return `rm`) that varies a lot and one (the riskfree return `rf`) that does not. +This has one series (the equity market return `rm`) that varies a lot and one (the riskfree return `rf`) that does not. -**Exercise.** Let's see if we can dress this one up a little. Try adding, one at a time, the arguments `title='Fama-French returns'`, `grid=True`, and `legend=False`. What does the documentation say about them? What do they do? +**Exercise.** Let's see if we can dress this one up a little. Try adding, one at a time, the arguments `title='Fama-French returns'`, `grid=True`, and `legend=False`. What does the documentation say about them? What do they do? -Let's think about the returns a little. What does the data tell us about them? That's an easier question to answer if we use a different plot. We like histograms because they describe all the outcomes in a convenient form. Try this code: +Let's think about the returns a little. What does the data tell us about them? That's an easier question to answer if we use a different plot. We like histograms because they describe all the outcomes in a convenient form. Try this code: ```python ff.plot(kind='hist', bins=20, subplots=True) ``` -It produces separate histograms of the two variables with 20 "bins" in each. +It produces separate histograms of the two variables with 20 "bins" in each. -**Exercise.** What do the histograms tell us about the two returns? How do they differ? +**Exercise.** What do the histograms tell us about the two returns? How do they differ? -**Exercise.** Use the World Bank dataframe `wbdf` to create a bar chart of GDP per capita, the variable `'gdppc'`. *Bonus points:* Create a horizontal bar chart. +**Exercise.** Use the World Bank dataframe `wbdf` to create a bar chart of GDP per capita, the variable `'gdppc'`. *Bonus points:* Create a horizontal bar chart. ## Approach #2: `plot(x,y)` -Next up: the popular `plot(x,y)` function from the pyplot module of Matplotlib. We used pyplot a lot when we started out, and suspect you will, too. +Next up: the popular `plot(x,y)` function from the pyplot module of Matplotlib. We used pyplot a lot when we started out, and suspect you will, too. -We import the module with +We import the module with ```python import matplotlib.pyplot as plt @@ -308,20 +308,20 @@ This is a more explicit version of Matplotlib graphics in which we specify the ` plt.plot(x, y) ``` -The `plt.` identifies `plot()` as a pyplot function. This produces the same kinds of figures we -saw earlier, but we get there by a different route. +The `plt.` identifies `plot()` as a pyplot function. This produces the same kinds of figures we +saw earlier, but we get there by a different route. -**Digression.** We're doing this in an IPython notebook, where it will work fine. But if we use the same code in Spyder, we need to add the statement `plt.show()` to display the graph. In IPython/Jupyter, this happens automatically when the cell ends. **End of digression.** +**Digression.** We're doing this in an IPython notebook, where it will work fine. But if we use the same code in Spyder, we need to add the statement `plt.show()` to display the graph. In IPython/Jupyter, this happens automatically when the cell ends. **End of digression.** -**Basic plots.** Compare these plots to our earlier ones. We start with GDP on its own: +**Basic plots.** Compare these plots to our earlier ones. We start with GDP on its own: ```python plt.plot(us.index, us['gdp']) ``` -Remind yourself what the `x` and `y` variables are here. +Remind yourself what the `x` and `y` variables are here. If we want two variables in the same graph, we simply add another line: @@ -366,18 +366,18 @@ plt.tick_params(labelcolor='red') # change tick labels to red plt.legend(['GDP', 'Consumption']) # more descriptive variable names ``` -In this way we add a title (14-point type, left justified), add a label to the y axis, change the limits of the x axis, make the tick labels red, and use more descriptive names in the legend. +In this way we add a title (14-point type, left justified), add a label to the y axis, change the limits of the x axis, make the tick labels red, and use more descriptive names in the legend. -**Exercise.** Add a `plt.ylim()` statement that starts the `y` axis at zero. *Hint:* Use `plt.ylim?` to get the documentation. *Bonus points:* Change the color of the line to magenta and the linewidth to 2. *Hint:* Use `plt.plot?` to get the documentation. +**Exercise.** Add a `plt.ylim()` statement that starts the `y` axis at zero. *Hint:* Use `plt.ylim?` to get the documentation. *Bonus points:* Change the color of the line to magenta and the linewidth to 2. *Hint:* Use `plt.plot?` to get the documentation. -**Exercise.** Create a line plot for the Fama-French dataframe `ff` that includes both returns. *Bonus points:* Add a title. +**Exercise.** Create a line plot for the Fama-French dataframe `ff` that includes both returns. *Bonus points:* Add a title. ## Approach #3: Create figure objects and apply methods -This approach is the most foreign to beginners, but now that we're used to it we like it a lot. We either use it on its own, or adapt its functionality to the dataframe plot methods we saw in Approach #1. The idea is to generate an object -- two objects, in fact -- and apply methods to them to produce the various elements of a graph: the data, their axes, their labels, and so on. +This approach is the most foreign to beginners, but now that we're used to it we like it a lot. We either use it on its own, or adapt its functionality to the dataframe plot methods we saw in Approach #1. The idea is to generate an object -- two objects, in fact -- and apply methods to them to produce the various elements of a graph: the data, their axes, their labels, and so on. **Create objects.** This is how we do it: @@ -389,14 +389,14 @@ fig, ax = plt.subplots() This produces a blank figure, which is displayed in the IPython notebook. The names `fig` and `ax` can be anything, but these are standard. We say `fig` is a **figure object** and `ax` is an **axis object** (hint: try `type(fig)` and `type(ax)` to see why). This means: -* `fig` is a blank canvas for creating a figure. +* `fig` is a blank canvas for creating a figure. -* `ax` is everything in it: axes, labels, lines or bars, legend, and so on. +* `ax` is everything in it: axes, labels, lines or bars, legend, and so on. -We apply methods to both objects to create data graphs. +We apply methods to both objects to create data graphs. -**Create graphs.** We create graphs by applying plot-like methods to `ax`. This is an example that uses pyplot's `plot(x,y)` syntax: +**Create graphs.** We create graphs by applying plot-like methods to `ax`. This is an example that uses pyplot's `plot(x,y)` syntax: ```python # create objects @@ -408,31 +408,31 @@ ax.set_title('US GDP', fontsize=14, loc='left') ax.set_ylabel('Billions of USD') ``` -The first line creates a line plot with the usual `plot(x,y)` syntax. The next two lines add a title and y-axis label. We have access to the usual set of methods for refining figures. We can get a list using tab completion: type `ax.[tab]` in an IPython code cell. +The first line creates a line plot with the usual `plot(x,y)` syntax. The next two lines add a title and y-axis label. We have access to the usual set of methods for refining figures. We can get a list using tab completion: type `ax.[tab]` in an IPython code cell. -We don't use figure methods much, but here's one: +We don't use figure methods much, but here's one: ```python # a figure method: save figure as a pdf fig.savefig('us_gdp.pdf') ``` -This saves the figure as a pdf file that we can use in a report or slide. +This saves the figure as a pdf file that we can use in a report or slide. -**Exercise.** Create a bar chart of variable `rm` in the `ff` dataframe. *Bonus points:* Make the bars red. +**Exercise.** Create a bar chart of variable `rm` in the `ff` dataframe. *Bonus points:* Make the bars red. **Multiple plots.** We use the same idea -- create and use figure and axis objects. The difference is that the axis object has the same more than one component, one for each of the subplots. -Here's an example that reproduces our separate graphs of US GDP and consumption. We star by creating the objects: +Here's an example that reproduces our separate graphs of US GDP and consumption. We star by creating the objects: ```python -fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True) +fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True) print('Object ax has dimension', len(ax)) ``` -The `subplot` statement asks for a graph with two rows (top and bottom) and one column. That is, two graphs, one on top of the other. The `sharex=True` argument makes the `x` axes the same. The `print` statement tells us: "Object ax has dimension 2". There's one for the GDP graph, and one for the consumption graph. +The `subplot` statement asks for a graph with two rows (top and bottom) and one column. That is, two graphs, one on top of the other. The `sharex=True` argument makes the `x` axes the same. The `print` statement tells us: "Object ax has dimension 2". There's one for the GDP graph, and one for the consumption graph. Now add the content: @@ -444,12 +444,12 @@ ax[0].plot(us.index, us['gdp'], color='green') # first plot ax[1].plot(us.index, us['pce'], color='red') # second plot ``` -(Note that we start numbering the components of `ax` at zero, which should be getting familiar by now.) This gives us a double graph, with GDP at the top and consumption at the bottom. +(Note that we start numbering the components of `ax` at zero, which should be getting familiar by now.) This gives us a double graph, with GDP at the top and consumption at the bottom. -**Approach #1 revisited.** In approach #1, we applied plot methods to dataframes. We also used arguments to fix up the graph, but that got complicated pretty quickly. +**Approach #1 revisited.** In approach #1, we applied plot methods to dataframes. We also used arguments to fix up the graph, but that got complicated pretty quickly. -Here we combine Approaches #1 and #3. If we look at the documentation for `df.plot()`, we see that it "returns" an axis object. (Do it and see) Once we have an axis object, we can apply methods to it that do just about anything. +Here we combine Approaches #1 and #3. If we look at the documentation for `df.plot()`, we see that it "returns" an axis object. (Do it and see) Once we have an axis object, we can apply methods to it that do just about anything. The first step is to grab the axis object. We change the dataframe-based plot statement `us.plot()` to @@ -457,16 +457,16 @@ The first step is to grab the axis object. We change the dataframe-based plot s ax = us.plot() ``` -Now that we have `ax`, we dress up the graph by applying methods to it: +Now that we have `ax`, we dress up the graph by applying methods to it: ```python -ax = us.plot() +ax = us.plot() ax.set_title('US GDP and Consumption', fontsize=14, loc='left') ax.set_ylabel('Billions of 2013 USD') ax.legend(loc='center right') ``` -(Note again that we need to create and use the axis object in the same IPython cell.) We can add as many of these as we like. +(Note again that we need to create and use the axis object in the same IPython cell.) We can add as many of these as we like. If we want the figure object, we apply a method to the axis object `ax`: @@ -474,15 +474,15 @@ If we want the figure object, we apply a method to the axis object `ax`: fig = ax.get_figure() ``` -We don't see ourselves doing this much, but it ties up a loose end. +We don't see ourselves doing this much, but it ties up a loose end. ## Let's review -Take a deep breath. We've covered a lot of ground, it's time to review. +Take a deep breath. We've covered a lot of ground, it's time to review. -We looked at three ways to use Matplotlib: +We looked at three ways to use Matplotlib: * Approach #1: Apply plot methods to dataframes. * Approach #2: Use the `plot(x,y)` function. @@ -496,19 +496,19 @@ us['gdp'].plot() # Approach #1 plt.plot(us.index, us['gdp']) # Approach #2 fig, ax = plt.subplots() # Approach #3 -ax.plot(us.index, us['gdp']) +ax.plot(us.index, us['gdp']) ``` -Each one produces the same graph. +Each one produces the same graph. -Which one should we use? We prefer Approach #1, with the additions just mentioned for creating and using axis objects. You're welcome to use whichever you prefer, but we recommend you choose and stick to one until you have more experience. This is a case where having more choices can be a bad thing. +Which one should we use? We prefer Approach #1, with the additions just mentioned for creating and using axis objects. You're welcome to use whichever you prefer, but we recommend you choose and stick to one until you have more experience. This is a case where having more choices can be a bad thing. -We also suggest you not commit any of this to memory. If you use end up using it a lot, you'll remember it. If you don't, it's not worth remembering. We typically start with examples anyway rather than creating new graphs from scratch. +We also suggest you not commit any of this to memory. If you use end up using it a lot, you'll remember it. If you don't, it's not worth remembering. We typically start with examples anyway rather than creating new graphs from scratch. ## Examples -We conclude with examples that take data from the previous chapter and make better graphs than we did there. +We conclude with examples that take data from the previous chapter and make better graphs than we did there. **PISA test scores.** Recall that we had a simple plot, but it didn't look very good. The code was @@ -529,16 +529,16 @@ pisa.columns = ['Math', 'Reading', 'Science'] # simplify variable names pisa['Math'].plot(kind='barh') # create bar chart ``` -**Comment.** Yikes! That's horrible! What can we do about it? Any suggestions? +**Comment.** Yikes! That's horrible! What can we do about it? Any suggestions? -The problem seems to be that the bars and labels are squeezed together, so perhaps we should make the figure taller. We set the figure's dimensions with the argument `figsize=(width, height)`. The sizes are measured in inches, which get shrunk a bit when we display them in IPython. Here's a version with a much larger `height` that we discovered by experimenting: +The problem seems to be that the bars and labels are squeezed together, so perhaps we should make the figure taller. We set the figure's dimensions with the argument `figsize=(width, height)`. The sizes are measured in inches, which get shrunk a bit when we display them in IPython. Here's a version with a much larger `height` that we discovered by experimenting: ```python -ax = pisa['Math'].plot(kind='barh', figsize=(4,13)) +ax = pisa['Math'].plot(kind='barh', figsize=(4,13)) ax.set_title('PISA Math Score', loc='left') ``` -This creates a figure that is 4 inches wide and 13 inches tall. We added a title, too, to be clear about what we have. +This creates a figure that is 4 inches wide and 13 inches tall. We added a title, too, to be clear about what we have. Here's a more advanced version in which we made the US bar red. This is ridiculously complicated, but we used our Google fu and found [a solution](http://stackoverflow.com/questions/18973404/setting-different-bar-color-in-matplotlib-python). (Remember: The solution to many programming problems is a combination of Google fu and patience.) The code is @@ -549,7 +549,7 @@ ax.set_title('PISA Math Score', loc='left') ax.get_children()[38].set_color('r') ``` -The `38` comes from experimenting. We count from the bottom starting with zero. +The `38` comes from experimenting. We count from the bottom starting with zero. **World Bank data.** Our second example comes from using the World Bank's API, which gives us access to a huge amount of data for countries. We use it to produce two kinds of graphs and illustrate some tools we haven't seen yet: @@ -558,7 +558,7 @@ The `38` comes from experimenting. We count from the bottom starting with zero. * Scatter plot (bubble plot) of life expectancy v GDP per capita -We start with the data: +We start with the data: ```python # load packages (redundancy is ok) @@ -567,7 +567,7 @@ from pandas.io import wb # World Bank api import matplotlib.pyplot as plt # plotting tools # variable list (GDP, GDP per capita, life expectancy) -var = ['NY.GDP.PCAP.PP.KD', 'NY.GDP.MKTP.PP.KD', 'SP.DYN.LE00.IN'] +var = ['NY.GDP.PCAP.PP.KD', 'NY.GDP.MKTP.PP.KD', 'SP.DYN.LE00.IN'] # country list (ISO codes) iso = ['USA', 'FRA', 'JPN', 'CHN', 'IND', 'BRA', 'MEX'] year = 2013 @@ -586,7 +586,7 @@ df = df.sort_values(by='order', ascending=False) df ``` -Note that the index here is the country name. +Note that the index here is the country name. Here's a horizontal bar chart for (total) GDP: @@ -597,12 +597,12 @@ ax.set_xlabel('Trillions of US Dollars') ax.set_ylabel('') ``` -What do you see? What's the takeaway? +What do you see? What's the takeaway? We think the horizontal bar chart looks better than the usual vertical bar chart, which we'd get if we replaced `barh` above with `bar`. (Try it and see what you think.) -Here's a similar chart for GDP per capita: +Here's a similar chart for GDP per capita: ```python ax = df['gdppc'].plot(kind='barh', color='m', alpha=0.5) @@ -611,7 +611,7 @@ ax.set_xlabel('Thousands of US Dollars') ax.set_ylabel('') ``` -What do you see here? What's the takeway? +What do you see here? What's the takeway? And just because it's fun, here's an example of Tufte-like axes from [Matplotlib examples](http://matplotlib.org/examples/ticks_and_spines/spines_demo_dropped.html): @@ -638,22 +638,22 @@ We finish off with a bubble plot: a scatter plot in which the size of the dots ```python plt.scatter(df['gdppc'], df['life'], # x,y variables s=df['pop']/10**6, # size of bubbles - alpha=0.5) + alpha=0.5) plt.title('Life expectancy vs. GDP per capita', loc='left', fontsize=14) plt.xlabel('GDP Per Capita') plt.ylabel('Life Expectancy') plt.text(58, 66, 'Bubble size represents population', horizontalalignment='right',) ``` -The only odd thing is the `10**6` "scaling" on the second line. The bubble size is a little tricky to calibrate. Without the scaling, the bubbles are larger than the graph. We played around until they looked reasonable. +The only odd thing is the `10**6` "scaling" on the second line. The bubble size is a little tricky to calibrate. Without the scaling, the bubbles are larger than the graph. We played around until they looked reasonable. ## Styles -Ok, we lied, that wasn't the conclusion. But we think this is fun, and it's optional in any case. +Ok, we lied, that wasn't the conclusion. But we think this is fun, and it's optional in any case. -Matplotlib has a lot of basic settings for graphs. If we find some we like, we can set them once and be done with it. Or we can use some of their preset combinations, which they call **styles**. +Matplotlib has a lot of basic settings for graphs. If we find some we like, we can set them once and be done with it. Or we can use some of their preset combinations, which they call **styles**. We'll start with one of the bar charts we produced with World Bank data: @@ -670,10 +670,10 @@ Now recreate the same graph with this statement at the top: plt.style.use('fivethirtyeight') ``` -(And note that once we execute this statement, it stays executed.) +(And note that once we execute this statement, it stays executed.) -**Exercise.** Try one of these styles: `ggplot`, `bmh`, `dark_background`, and `grayscale`. +**Exercise.** Try one of these styles: `ggplot`, `bmh`, `dark_background`, and `grayscale`. Here's another one, for fans of the popular [xkcd webcomic](http://xkcd.com/1235/): @@ -686,7 +686,7 @@ ax.set_xlabel('Trillions of US Dollars') ax.set_ylabel('') ``` -Note the wiggly lines, perfect for a suggesting hand-drawn graph. +Note the wiggly lines, perfect for a suggesting hand-drawn graph. Which styles do you like? Why? @@ -702,28 +702,28 @@ mpl.rcParams.update(mpl.rcParamsDefault) ## Review -Coming... +Coming... ## Resources -We haven't found many non-technical resources we like. +We haven't found many non-technical resources we like. -* One of the best is Matplotlib's [gallery of examples](http://matplotlib.org/gallery.html). It's a great starting point for learning new things. Find an example you like, download the code, and adapt it to your needs. -* [Chris Moffitt](http://pbpython.com/simple-graphing-pandas.html) does his usual nice job looking at (mostly) dataframe methods. He also has a [nice overview](http://pbpython.com/visualization-tools-1.html) of other Python graphics packages. +* One of the best is Matplotlib's [gallery of examples](http://matplotlib.org/gallery.html). It's a great starting point for learning new things. Find an example you like, download the code, and adapt it to your needs. +* [Chris Moffitt](http://pbpython.com/simple-graphing-pandas.html) does his usual nice job looking at (mostly) dataframe methods. He also has a [nice overview](http://pbpython.com/visualization-tools-1.html) of other Python graphics packages. -If you find other resources you like, let us know. +If you find other resources you like, let us know. -One kind of component is a user-interface or **environment** for writing code and executing it. Here's an analogy: Word and Google docs are "environments" to produce text documents. Both work. We use the one we find more convenient. The same is true of Python environments. Think to yourself: An environment is analogous to Microsoft Word and a Python program is analogous to a Word document. +One kind of component is a user-interface or **environment** for writing code and executing it. Here's an analogy: Word and Google docs are "environments" to produce text documents. Both work. We use the one we find more convenient. The same is true of Python environments. Think to yourself: An environment is analogous to Microsoft Word and a Python program is analogous to a Word document. + +We'll use two Python environments in this class: -We'll use two Python environments in this class: + * **Spyder** is a graphical interface that includes an editor, a button to run code, and windows for experimenting and checking documentation. - * **Spyder** is a graphical interface that includes an editor, a button to run code, and windows for experimenting and checking documentation. - - * **Jupyter** is a web-based interface for running **IPython notebooks**, which combine code, output, and documentation. + * **Jupyter** is a web-based interface for running **IPython notebooks**, which combine code, output, and documentation. -We will write and run Python programs in both environments. +We will write and run Python programs in both environments. -This is a lot of jargon to swallow at one time. Don't panic, it will become familiar with use. And anything we don't use you can safely ignore. +This is a lot of jargon to swallow at one time. Don't panic, it will become familiar with use. And anything we don't use you can safely ignore. -## Installing Anaconda +## Installing Anaconda -Follow these instructions. By which we mean: **follow these instructions exactly!** Creativity is a wonderful thing, but here it will cost you dearly. +Follow these instructions. By which we mean: **follow these instructions exactly!** Creativity is a wonderful thing, but here it will cost you dearly. -**Step 1. Download the Anaconda installer.** +**Step 1. Download the Anaconda installer.** -* Click **[HERE](http://continuum.io/downloads)** or Google "Anaconda downloads." +* Click **[HERE](http://continuum.io/downloads)** or Google "Anaconda downloads." * Scroll to find "Anaconda for Windows" or further down for Macs "Anaconda for OS X." -* Find the option for **Python 3.5.** **NOT** Python 2.7! If you get 2.7, you'll have to start over. -* Click the **64-bit Graphical Installer** to start the download. +* Find the option for **Python 3.5.** **NOT** Python 2.7! If you get 2.7, you'll have to start over. +* Click the **64-bit Graphical Installer** to start the download. -**Step 2. Run the installer.** Click on the Anaconda installer you just downloaded to install the Anaconda distribution of Python. Do what it says. +**Step 2. Run the installer.** Click on the Anaconda installer you just downloaded to install the Anaconda distribution of Python. Do what it says. -**Step 3. Find and run Launcher.** Look wherever programs are on your computer. +**Step 3. Find and run Launcher.** Look wherever programs are on your computer. - * Windows: Click on the Start button and type "Launcher" in the search box. - * Macs: Finder, Spotlight Search, and Launchpad all work. + * Windows: Click on the Start button and type "Launcher" in the search box. + * Macs: Finder, Spotlight Search, and Launchpad all work. +**Pro tip.** Put a shortcut to Launcher in a convenient place so you can find it easily next time. In Windows, that would be the launchpad or the desktop. +--> Once Launcher is running -- be patient, it can take 60 seconds or more -- you should see something like this: @@ -60,147 +60,147 @@ Once Launcher is running -- be patient, it can take 60 seconds or more -- you sh -If so, you now have Python installed and ready to run. Congratulations! +If so, you now have Python installed and ready to run. Congratulations! -## Coding environments +## Coding environments -Coding environments are pieces of software we use to write and run code. The best ones make coding easy, even pleasurable, strange as that might sound. We'll use two: **Spyder** and **Jupyter**. We access both through Launcher, where Spyder is labelled "spyder-app" and Jupyter is labelled "ipython-notebook". See the picture above. +Coding environments are pieces of software we use to write and run code. The best ones make coding easy, even pleasurable, strange as that might sound. We'll use two: **Spyder** and **Jupyter**. We access both through Launcher, where Spyder is labelled "spyder-app" and Jupyter is labelled "ipython-notebook". See the picture above. -If Launcher is open, great. If not, please start it up (Step 3 above). +If Launcher is open, great. If not, please start it up (Step 3 above). -**Spyder.** Spyder is a graphical environment with an editor for writing programs, a console for trying out one line at a time, and access to help. It’s our preferred Python environment. Experts often use other editors, but unless you’re one of them this is where you should start. +**Spyder.** Spyder is a graphical environment with an editor for writing programs, a console for trying out one line at a time, and access to help. It’s our preferred Python environment. Experts often use other editors, but unless you’re one of them this is where you should start. To start Spyder from Launcher, **click on the blue Launch button** to the right of spyder-app. We find it a little slow, but it should start up eventually. You should then see something that looks like this: ![Spyder environment](figs/spyder_plain.png "Spyder") -You see here that Spyder has a number of different components. It's overwhelming at first, but give it some time. The most important components are: +You see here that Spyder has a number of different components. It's overwhelming at first, but give it some time. The most important components are: -* **Editor.** This is on the left. We can write and edit programs here and save them to our hard drive. +* **Editor.** This is on the left. We can write and edit programs here and save them to our hard drive. -* **Toolbar.** Above the editor, you'll see a row of buttons that we refer to as the toolbar. See the picture below. Among them are some green triangles. The big one (marked by the red arrow) runs the whole program -- whatever we have in the editor. The smaller ones run sections of code. More on this later. +* **Toolbar.** Above the editor, you'll see a row of buttons that we refer to as the toolbar. See the picture below. Among them are some green triangles. The big one (marked by the red arrow) runs the whole program -- whatever we have in the editor. The smaller ones run sections of code. More on this later. ![Spyder toolbar](figs/spyder_toolbar.png "Spyder's toolbar") -* **IPython console.** This is on the right at the bottom -- look for the tab with this label. This is where output from our programs will show up. On startup it will display something like +* **IPython console.** This is on the right at the bottom -- look for the tab with this label. This is where output from our programs will show up. On startup it will display something like ```python Python 3.5.0 |Anaconda 2.4.0 (64-bit) - etc etc - - In [1]: + etc etc + + In [1]: ``` -* **Object inspector.** This is on the right at the top. We get Python documentation here, which is really useful. +* **Object inspector.** This is on the right at the top. We get Python documentation here, which is really useful. -We can move these windows around by dragging and dropping. If we mess up -- it happens to the best of us -- look for "View" at the top and click on "Reset window layout." +We can move these windows around by dragging and dropping. If we mess up -- it happens to the best of us -- look for "View" at the top and click on "Reset window layout." -**Jupyter.** Jupyter is another graphical environment, which we use to create and run **IPython notebooks**. These notebooks combine code, output, words, and graphics. It's a convenient format for presenting our work to others and can be used as a project report. We'll use IPython notebooks in class in a few weeks. In the meantime, here are [two](https://github.com/DaveBackus/Data_Bootcamp/blob/master/Code/IPython/bootcamp_examples.ipynb) [examples](http://nbviewer.jupyter.org/url/norvig.com/ipython/How%20to%20Do%20Things%20with%20Words.ipynb). +**Jupyter.** Jupyter is another graphical environment, which we use to create and run **IPython notebooks**. These notebooks combine code, output, words, and graphics. It's a convenient format for presenting our work to others and can be used as a project report. We'll use IPython notebooks in class in a few weeks. In the meantime, here are [two](https://github.com/DaveBackus/Data_Bootcamp/blob/master/Code/IPython/bootcamp_examples.ipynb) [examples](http://nbviewer.jupyter.org/url/norvig.com/ipython/How%20to%20Do%20Things%20with%20Words.ipynb). -To create or run an IPython notebook from Launcher, **click the blue Launch button** to the right of the ipython-notebook icon. It will open a tab in your default browser. (If you're not sure what that is, you'll soon find out.) In the browser tab, you'll see something like this: +To create or run an IPython notebook from Launcher, **click the blue Launch button** to the right of the ipython-notebook icon. It will open a tab in your default browser. (If you're not sure what that is, you'll soon find out.) In the browser tab, you'll see something like this: ![Jupyter environment](figs/jupyter_plain.png "Jupyter") +at the top the word "Jupyter." (It used to say IPython, but now the same environment handles code in Julia, R, and other languages, which called for a [name change](http://ipython.org/#jupyter-and-the-future-of-ipython).) Just below the word Jupyter you'll see the words "File, "Edit," "View," etc. Below that you'll see the directory (folder) structure of your computer. +--> -**Exercise.** Create a directory (folder) on your computer with the name `Data_Bootcamp` and store your programs there. If you're not sure how to do this, let us know. (And note well: There is an **underscore** `"_"` between "Data" and "Bootcamp", not a blank space.) +**Exercise.** Create a directory (folder) on your computer with the name `Data_Bootcamp` and store your programs there. If you're not sure how to do this, let us know. (And note well: There is an **underscore** `"_"` between "Data" and "Bootcamp", not a blank space.) -Let's repeat that last part: We use the acronym **mtwn** to indicate material that is "more than we need," meaning it's safe to ignore. ---> +## Run test programs -## Run test programs +Let's run a test program -- the same one -- in Spyder and IPython/Jupyter and make sure everything works. -Let's run a test program -- the same one -- in Spyder and IPython/Jupyter and make sure everything works. +**Spyder**. Start up Spyder. (If you're not sure how to do that, go back to the previous section.) Once you have Spyder up and running: -**Spyder**. Start up Spyder. (If you're not sure how to do that, go back to the previous section.) Once you have Spyder up and running: +* Create a Python code file. Click on "File" (upper left), then "New file." That should give you a new mostly empty file with some junk at the top that you can ignore or delete. -* Create a Python code file. Click on "File" (upper left), then "New file." That should give you a new mostly empty file with some junk at the top that you can ignore or delete. +* Add code to file. Enter the following lines of code at the bottom of your file: -* Add code to file. Enter the following lines of code at the bottom of your file: + ```python + import sys # import system module (don't ask) - ```python - import sys # import system module (don't ask) - - print('\nWhat version of Python?\n', sys.version, '\n', sep='') + print('\nWhat version of Python?\n', sys.version, '\n', sep='') - if float(sys.version_info[0]) < 3.0: - raise Exception('Program halted, old version of Python. \n', + if float(sys.version_info[0]) < 3.0: + raise Exception('Program halted, old version of Python. \n', 'Sorry, you need to install Anaconda again.') else: print('Congratulations, you have Python 3!') ``` - [If you're feeling lazy, you can make do with the first two lines on their own, but you won't get the messages we describe below.] + [If you're feeling lazy, you can make do with the first two lines on their own, but you won't get the messages we describe below.] -* Save your code. Click on File at the top left, then Save As, and save in the `Data_Bootcamp` directory under the name `bootcamp_test.py` (Python programs use the extension `.py`). +* Save your code. Click on File at the top left, then Save As, and save in the `Data_Bootcamp` directory under the name `bootcamp_test.py` (Python programs use the extension `.py`). -* Run it. Click on the green triangle above the editor and run your program. +* Run it. Click on the green triangle above the editor and run your program. -The output appears in the IPython console in the lower right corner. If you get the message "Congratulations etc," you're all set. Pat yourself on the back and buy yourself a cold drink, you've earned it. If you get the message "Program halted, old version of Python, etc," you need to go back and install Anaconda again, this time **following the instructions exactly**! Yes, we know that's discouraging, but it's better to know that now than run into problems later. Have a cold drink anyway and catch your breath. +The output appears in the IPython console in the lower right corner. If you get the message "Congratulations etc," you're all set. Pat yourself on the back and buy yourself a cold drink, you've earned it. If you get the message "Program halted, old version of Python, etc," you need to go back and install Anaconda again, this time **following the instructions exactly**! Yes, we know that's discouraging, but it's better to know that now than run into problems later. Have a cold drink anyway and catch your breath. -**Comment (mtwn).** We use the acronym **mtwn** to indicate material that is "more than we need," meaning it's safe to ignore. +**Comment (mtwn).** We use the acronym **mtwn** to indicate material that is "more than we need," meaning it's safe to ignore. -**More comments.** All of these are mtwn, but we thought they would make the code we just entered less mysterious -- and give us a head start with Python programming. (i) Anything following a hash (#) is a comment and has no effect on what the program does. (ii) Blank lines are optional, but they make the code easier to read. (iii) The rest of the code checks the Python version (`sys.version_info`). If the version is less than 3.0, it prints an error message (`raise Exception`). Otherwise it prints the message "Congratulations, etc." (iv) The statements that begin with `raise` and `print` are indented exactly four spaces. That's a standard feature of Python. Anything else generates an error. +**More comments.** All of these are mtwn, but we thought they would make the code we just entered less mysterious -- and give us a head start with Python programming. (i) Anything following a hash (#) is a comment and has no effect on what the program does. (ii) Blank lines are optional, but they make the code easier to read. (iii) The rest of the code checks the Python version (`sys.version_info`). If the version is less than 3.0, it prints an error message (`raise Exception`). Otherwise it prints the message "Congratulations, etc." (iv) The statements that begin with `raise` and `print` are indented exactly four spaces. That's a standard feature of Python. Anything else generates an error. -**IPython**. We prefer to write code in an editor and will stick with Spyder for a few weeks. IPython notebooks are good for talks and reports, because they contain text and output, specifically graphical output. We can generally read through them more easily than naked code. Here are [three](http://savvastjortjoglou.com/nba-shot-sharts.html) [more](http://nbviewer.ipython.org/url/jakevdp.github.com/downloads/notebooks/XKCD_plots.ipynb) [examples](https://github.com/DaveBackus/Data_Bootcamp/blob/master/Code/SQL/SQL_Intro.ipynb) to make the point. +**IPython**. We prefer to write code in an editor and will stick with Spyder for a few weeks. IPython notebooks are good for talks and reports, because they contain text and output, specifically graphical output. We can generally read through them more easily than naked code. Here are [three](http://savvastjortjoglou.com/nba-shot-sharts.html) [more](http://nbviewer.ipython.org/url/jakevdp.github.com/downloads/notebooks/XKCD_plots.ipynb) [examples](https://github.com/DaveBackus/Data_Bootcamp/blob/master/Code/SQL/SQL_Intro.ipynb) to make the point. -To run the same code in an IPython notebook, start the IPython/Jupyter app in Launcher, the one labelled "ipython-notebook". (If you're not sure what this means, go back to the previous section.) Once you have it up and running: +To run the same code in an IPython notebook, start the IPython/Jupyter app in Launcher, the one labelled "ipython-notebook". (If you're not sure what this means, go back to the previous section.) Once you have it up and running: -* Choose the directory. You should see the directory structure of your computer in Jupyter. Navigate to the `Data_Bootcamp` directory (folder) you created earlier. +* Choose the directory. You should see the directory structure of your computer in Jupyter. Navigate to the `Data_Bootcamp` directory (folder) you created earlier. -* Create an IPython notebook. Click on the "New" dropdown menu in the upper right corner and choose Python 3. This will create a blank notebook and an empty cell, where you can enter words or code. +* Create an IPython notebook. Click on the "New" dropdown menu in the upper right corner and choose Python 3. This will create a blank notebook and an empty cell, where you can enter words or code. -* Set the file name. To the right of the word Jupyter, you'll see "Untitled". Change it to `bootcamp_test`. +* Set the file name. To the right of the word Jupyter, you'll see "Untitled". Change it to `bootcamp_test`. -* Enter code. Click on the dropdown menu below the word "Help" and choose Code. Then enter the code listed above in the empty cell. +* Enter code. Click on the dropdown menu below the word "Help" and choose Code. Then enter the code listed above in the empty cell. -* Run the code. Click on "Cell" at the top and choose Run All. +* Run the code. Click on "Cell" at the top and choose Run All. -Output will appear in the same cell below your code. If it says "Congratulations etc." you're all set. +Output will appear in the same cell below your code. If it says "Congratulations etc." you're all set. -## Let's go! +## Let's go! -We're now ready to write and run Python programs in two environments. Take a bow. +We're now ready to write and run Python programs in two environments. Take a bow. -## Review +## Review **Exercise.** We have seen both **code files** and **environments** for working with them. With this in mind, fill in the blanks in the table below and explain your answers to your neighbor. -Environment | File -:---: | :---: -MS Word | Word document -MS Excel | Excel file -Spyder | - | IPython notebook +Environment | File +:---: | :---: +MS Word | Word document +MS Excel | Excel file +Spyder | + | IPython notebook -**Exercise.** What version of Python are we using? +**Exercise.** What version of Python are we using? -**Exercise.** Identify the editor, the IPython console, and the Object inspector in the Spyder picture above -- or your computer. +**Exercise.** Identify the editor, the IPython console, and the Object inspector in the Spyder picture above -- or your computer. -## Resources +## Resources More on the Anaconda distribution and its contents: -* Anaconda [download page](http://continuum.io/downloads) and [package list](http://docs.continuum.io/anaconda/pkg-docs). -* Spyder [documentation](https://pythonhosted.org/spyder/). +* Anaconda [download page](http://continuum.io/downloads) and [package list](http://docs.continuum.io/anaconda/pkg-docs). +* Spyder [documentation](https://pythonhosted.org/spyder/). * IPython [documentation](http://ipython.org/notebook.html). Look for the [Pybonacci demo](https://youtu.be/H6dLGQw9yFQ), it covers the basics in 5 minutes. You can also get help in Jupyter: click on "Help" at the top and choose "User Interface Tour." -* [Links](https://www.reddit.com/r/Python/comments/2trvyy/resource_or_tutorials_for_anacondaconda/) to other documentation and support. More than you'll ever want or need. +* [Links](https://www.reddit.com/r/Python/comments/2trvyy/resource_or_tutorials_for_anacondaconda/) to other documentation and support. More than you'll ever want or need. diff --git a/intro.md b/intro.md index 51bdcd1..5af032a 100644 --- a/intro.md +++ b/intro.md @@ -2,136 +2,136 @@ --- -**Overview.** Data management skills are enormously valuable in the modern world. We're going to give you those skills, show you how to apply them to economic and financial data, and maybe tell a few bad jokes along the way. Join us! +**Overview.** Data management skills are enormously valuable in the modern world. We're going to give you those skills, show you how to apply them to economic and financial data, and maybe tell a few bad jokes along the way. Join us! -**Buzzwords.** Python, code, Google fu. +**Buzzwords.** Python, code, Google fu. --- -This book -- and the course we developed it for -- is about **data**. It's also about **tools** for working with data, which in this case means **[Python][10]** and its data-related tools. Our focus is economic and financial data, which is what we know best, but the same tools can be applied to any data. By the end of the course, you will have a better idea where to find data that's useful to you, and you will have command over tools you can use to do something interesting with it. We think your life will be more interesting, too, but maybe that's just us. +This book -- and the course we developed it for -- is about **data**. It's also about **tools** for working with data, which in this case means **[Python][10]** and its data-related tools. Our focus is economic and financial data, which is what we know best, but the same tools can be applied to any data. By the end of the course, you will have a better idea where to find data that's useful to you, and you will have command over tools you can use to do something interesting with it. We think your life will be more interesting, too, but maybe that's just us. -## Answers to common questions +## Answers to common questions -**Why should I do this?** It’s an investment in your future. You will learn how to process data and communicate its content effectively and efficiently. You will have more fun. And you will be more valuable to current and future employers. +**Why should I do this?** It’s an investment in your future. You will learn how to process data and communicate its content effectively and efficiently. You will have more fun. And you will be more valuable to current and future employers. -**Can’t I do what I need in Excel?** Excel is a great program, but once you have a little programming experience it will remind you of doing arithmetic on your fingers. With Python, you will be able do routine tasks more efficiently (“[automate the boring stuff](https://automatetheboringstuff.com/),” as one guide suggests), handle larger data sets, rearrange datasets at will, and generally do things that spreadsheet programs can’t do. +**Can’t I do what I need in Excel?** Excel is a great program, but once you have a little programming experience it will remind you of doing arithmetic on your fingers. With Python, you will be able do routine tasks more efficiently (“[automate the boring stuff](https://automatetheboringstuff.com/),” as one guide suggests), handle larger data sets, rearrange datasets at will, and generally do things that spreadsheet programs can’t do. -**What are the prerequisites?** There are none. We start at the very beginning and go from there. What you will need is the **courage** to take on a challenge and the **patience** to debug [programs that don’t quite work](http://junkcharts.typepad.com/numbersruleyourworld/2015/06/the-day-after-the-half-day-in-the-life-of-a-data-scientist.html) -- and they never work the first time, and often not the second or third time either. Don't panic. Ask for help and remind yourself that patience is a virtue. +**What are the prerequisites?** There are none. We start at the very beginning and go from there. What you will need is the **courage** to take on a challenge and the **patience** to debug [programs that don’t quite work](http://junkcharts.typepad.com/numbersruleyourworld/2015/06/the-day-after-the-half-day-in-the-life-of-a-data-scientist.html) -- and they never work the first time, and often not the second or third time either. Don't panic. Ask for help and remind yourself that patience is a virtue. -**What if my quant skills are weak or nonexistent?** Then this is the course for you! We do our best to make the material accessible. We’re looking beyond quants to marketing, management, and humanities majors. One of our design team was an English major. +**What if my quant skills are weak or nonexistent?** Then this is the course for you! We do our best to make the material accessible. We’re looking beyond quants to marketing, management, and humanities majors. One of our design team was an English major. -**Will this turn me into a programmer?** You will come out of the course somewhere between Brad Pitt and Jonah Hill in “[Moneyball](http://www.imdb.com/title/tt1210166/)," with a solid foundation for dealing with whatever data comes your way. You will not be ready for a career as a programmer, but you will be able to work effectively with people who know more and do things yourself that Excel users can only dream about. +**Will this turn me into a programmer?** You will come out of the course somewhere between Brad Pitt and Jonah Hill in “[Moneyball](http://www.imdb.com/title/tt1210166/)," with a solid foundation for dealing with whatever data comes your way. You will not be ready for a career as a programmer, but you will be able to work effectively with people who know more and do things yourself that Excel users can only dream about. -**Will this turn me into a data scientist?** Sadly, no. But you will have a solid foundation for pursuing the many technical topics that fall under the rubrics data science and machine learning. See, for example, the extensive collection of courses on **business analytics** and **data science** offered by our [IOMS](http://www.stern.nyu.edu/experience-stern/about/departments-centers-initiatives/academic-departments/ioms-dept/) and [CS](https://www.cs.nyu.edu/web/index.html) groups. +**Will this turn me into a data scientist?** Sadly, no. But you will have a solid foundation for pursuing the many technical topics that fall under the rubrics data science and machine learning. See, for example, the extensive collection of courses on **business analytics** and **data science** offered by our [IOMS](http://www.stern.nyu.edu/experience-stern/about/departments-centers-initiatives/academic-departments/ioms-dept/) and [CS](https://www.cs.nyu.edu/web/index.html) groups. -**Should I take this course if I already know how to code?** You’re welcome to, and you will learn a lot about data and the data components of Python. But please don’t scare the other students. +**Should I take this course if I already know how to code?** You’re welcome to, and you will learn a lot about data and the data components of Python. But please don’t scare the other students. -**Is there anything Python can't do?** Well, [it can't swallow a porcupine](http://www.telegraph.co.uk/news/worldnews/11697672/Python-chokes-to-death-after-eating-porcupine.html). Someone is working on pretty much everything else. +**Is there anything Python can't do?** Well, [it can't swallow a porcupine](http://www.telegraph.co.uk/news/worldnews/11697672/Python-chokes-to-death-after-eating-porcupine.html). Someone is working on pretty much everything else. - -## Why data? -We're living in a world of data: data about the economy, data about financial markets, data about your business. Data doesn't solve all our problems, but it's a valuable input to better decisions. For example, how to choose a [college major](http://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/). +## Why data? -Many of our former students tell us that data skills keep them in business. One of our alums analyzes television viewer data for a network. The datasets are too large for Excel, so he uses Python. Another manages attendence data for a major league baseball team. A third works for a quantitative hedge fund, where Python is the tool of choice. A fourth is worried that you won't need him after this course. Even students with non-technical backgrounds tell us that basic data and programming skills are, if not required, at least very useful in their jobs. One of our marketing majors, for example, needs to interface with her company's SQL database to get the data she needs to do her job. +We're living in a world of data: data about the economy, data about financial markets, data about your business. Data doesn't solve all our problems, but it's a valuable input to better decisions. For example, how to choose a [college major](http://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/). +Many of our former students tell us that data skills keep them in business. One of our alums analyzes television viewer data for a network. The datasets are too large for Excel, so he uses Python. Another manages attendence data for a major league baseball team. A third works for a quantitative hedge fund, where Python is the tool of choice. A fourth is worried that you won't need him after this course. Even students with non-technical backgrounds tell us that basic data and programming skills are, if not required, at least very useful in their jobs. One of our marketing majors, for example, needs to interface with her company's SQL database to get the data she needs to do her job. -## Why Python? -[Python][10] is a popular general-purpose programming language that has been used for a broad range of applications. Google uses it. So do Instagram and Netflix. Dropbox is written in Python. +## Why Python? + +[Python][10] is a popular general-purpose programming language that has been used for a broad range of applications. Google uses it. So do Instagram and Netflix. Dropbox is written in Python. [10]: https://en.wikipedia.org/wiki/Python_(programming_language) -We think Python is the language of choice right now if you want a user-friendly introduction to programming and a useful tool for day-to-day data work. It's a high-level language, which means the language does a lot of the work. It has a broad range of applications and an enormous community of users. You'll come to appreciate both. And it's free and open source. Free means you pay nothing. Open source means you can look at the code if you want to see how something works. +We think Python is the language of choice right now if you want a user-friendly introduction to programming and a useful tool for day-to-day data work. It's a high-level language, which means the language does a lot of the work. It has a broad range of applications and an enormous community of users. You'll come to appreciate both. And it's free and open source. Free means you pay nothing. Open source means you can look at the code if you want to see how something works. - +That's our opinion anyway, but the larger point is that learning a programming language -- any language -- is better than not learning one. We'll use Python, but you could do similar things in R -- and many do. See also [this discussion](http://quant-econ.net/about_lectures.html#how-about-other-languages) from our friends Tom Sargent and John Stachurski. Or [this one](http://www.dataschool.io/python-or-r-for-data-science/) from Kevin Markham. Or this [exchange](http://www.quora.com/Which-is-better-for-data-analysis-R-or-Python) on Quora. +--> -## Everyone likes Python +## Everyone likes Python -Python isn't just a useful language, it's one people like. We're talking about programmers here, for the most part, but even us novices find that its casual simplicity makes coding fun. See, for example, this classic [xkcd cartoon](https://xkcd.com/353/). +Python isn't just a useful language, it's one people like. We're talking about programmers here, for the most part, but even us novices find that its casual simplicity makes coding fun. See, for example, this classic [xkcd cartoon](https://xkcd.com/353/). -Writer and programmer Paul Ford [puts it this way](http://www.bloomberg.com/graphics/2015-paul-ford-what-is-code/): +Writer and programmer Paul Ford [puts it this way](http://www.bloomberg.com/graphics/2015-paul-ford-what-is-code/): -> People love [Python] and want it to work everywhere and do everything. They’ve spent tens of thousands of hours making that possible and then given the fruit of their labor away. That’s a powerful indicator. A huge amount of effort has gone into making Python practical as well as pleasurable to use. +> People love [Python] and want it to work everywhere and do everything. They’ve spent tens of thousands of hours making that possible and then given the fruit of their labor away. That’s a powerful indicator. A huge amount of effort has gone into making Python practical as well as pleasurable to use. -He's alluding here to the vast community of users who are developing tools that allow Python to do all kinds of things. Python's data tools are an example: they're not part of the core langauage, they're add-ons developed by users. He adds: "Python people are pretty cool," so there's that, too. +He's alluding here to the vast community of users who are developing tools that allow Python to do all kinds of things. Python's data tools are an example: they're not part of the core langauage, they're add-ons developed by users. He adds: "Python people are pretty cool," so there's that, too. ## Work habits -There are no shortcuts in learning how to code. You simply need to spend hours doing it. Progress will seem slow at first, but if you stick with it things will start to look familiar, and even make sense. You may even start to think of projects as fun, and revel in your new-found power over data. +There are no shortcuts in learning how to code. You simply need to spend hours doing it. Progress will seem slow at first, but if you stick with it things will start to look familiar, and even make sense. You may even start to think of projects as fun, and revel in your new-found power over data. -As you work your way up the learning curve, keep this advice in mind: +As you work your way up the learning curve, keep this advice in mind: -**Don't panic.** Learning a new language takes some time, it won't happen in a week. +**Don't panic.** Learning a new language takes some time, it won't happen in a week. -**Stick with it.** The secret is to keep working. Trust us, things will start to make sense in a couple weeks. Here's a [great example](https://medium.com/@meandvan/how-i-learned-to-stop-worrying-and-love-the-code-af1a809457c7). (How can you not love someone who writes: "How I learned to stop worrying and love the code"?) +**Stick with it.** The secret is to keep working. Trust us, things will start to make sense in a couple weeks. Here's a [great example](https://medium.com/@meandvan/how-i-learned-to-stop-worrying-and-love-the-code-af1a809457c7). (How can you not love someone who writes: "How I learned to stop worrying and love the code"?) -**Ask for help.** If you get stuck, ask for help -- from friends, from your Bootcamp classmates (post a problem), or from us (the teachers of the course). +**Ask for help.** If you get stuck, ask for help -- from friends, from your Bootcamp classmates (post a problem), or from us (the teachers of the course). -**Work on your [Google fu](http://english.stackexchange.com/questions/19967/what-does-google-fu-mean).** With a little help from Google, you will find that many of your questions have been asked before. Even better, they have been answered. One way to find them: Google something like "python [problem]." +**Work on your [Google fu](http://english.stackexchange.com/questions/19967/what-does-google-fu-mean).** With a little help from Google, you will find that many of your questions have been asked before. Even better, they have been answered. One way to find them: Google something like "python [problem]." - + -## Our approach +## Our approach -**Leap in.** We start quickly, which will seem like being dumped in a foreign country where you don't understand the language. We do that so we can get to the things that interest us: applications to data analysis. That means **the work load is heaviest at the start**. Don't panic, the pace will slow down after the first 4-6 chapters -- and you'll learn a lot in the meantime. +**Leap in.** We start quickly, which will seem like being dumped in a foreign country where you don't understand the language. We do that so we can get to the things that interest us: applications to data analysis. That means **the work load is heaviest at the start**. Don't panic, the pace will slow down after the first 4-6 chapters -- and you'll learn a lot in the meantime. -**Stress the basics, ignore the rest.** We think once you understand the basics, you'll be in a good position to work out special cases on your own. That allows us to strip out a bunch of confusing detail, which we think is good for everyone. +**Stress the basics, ignore the rest.** We think once you understand the basics, you'll be in a good position to work out special cases on your own. That allows us to strip out a bunch of confusing detail, which we think is good for everyone. -**Learn to teach yourself.** The best way to learn new things is to teach yourself: Google your problem and find the resources you need. We build that attitude in from the start, suggesting ways in which you can solve problems yourself. But remember: it also helps to be in a supportive environment, where you can ask for help when you need it. +**Learn to teach yourself.** The best way to learn new things is to teach yourself: Google your problem and find the resources you need. We build that attitude in from the start, suggesting ways in which you can solve problems yourself. But remember: it also helps to be in a supportive environment, where you can ask for help when you need it. - + -**Online book preferred.** We sometimes print out the pdf ourselves, but the online version comes with live links. We'll update it frequently as new ideas come to mind. We think it's a superior user experience. +**Online book preferred.** We sometimes print out the pdf ourselves, but the online version comes with live links. We'll update it frequently as new ideas come to mind. We think it's a superior user experience. ## Wordplay -Python is named for Monty Python, a group of comedians whose humor appeals to the tech crowd. Idle, a well-know Python editor, is a reference to Python-member Eric Idle. The [Python Package Index](https://pypi.python.org/pypi), a repository of Python packages, is commonly known as the [Cheese Shop](http://youtu.be/PPN3KTtrnZM), a reference to a famous Monty Python skit. The Anaconda distribution (next chapter) is a play on the word python. +Python is named for Monty Python, a group of comedians whose humor appeals to the tech crowd. Idle, a well-know Python editor, is a reference to Python-member Eric Idle. The [Python Package Index](https://pypi.python.org/pypi), a repository of Python packages, is commonly known as the [Cheese Shop](http://youtu.be/PPN3KTtrnZM), a reference to a famous Monty Python skit. The Anaconda distribution (next chapter) is a play on the word python. -## Resources +## Resources -The resources section at the end of each chapter is a collection of (mostly) links to things we've found useful. They're more than you need, but give you some recommended options if you want to follow up on a specific topic. +The resources section at the end of each chapter is a collection of (mostly) links to things we've found useful. They're more than you need, but give you some recommended options if you want to follow up on a specific topic. -Here we'll say simply that all of the materials for this book and the associated course are posted online: +Here we'll say simply that all of the materials for this book and the associated course are posted online: * Website. Everything is posted on our [class website](http://databootcamp.nyuecon.com/). -* Book. It's hosted by [GitBook](https://www.gitbook.com/book/davebackus/test/details). +* Book. It's hosted by [GitBook](https://www.gitbook.com/book/davebackus/test/details). * Code. We give links to the relevant code at the start of each chapter, but if you want them all, look in the [Code directory](https://github.com/DaveBackus/Data_Bootcamp/tree/master/Code) of the GitHub repo. If you save them, **remember to click on the Raw button** in the upper right. (This is an oddity of GitHub, which distinguishes between a display of the file and the file iself.) -* Other materials. Pretty much everything else is available on our [GitHub repository](https://github.com/DaveBackus/Data_Bootcamp). +* Other materials. Pretty much everything else is available on our [GitHub repository](https://github.com/DaveBackus/Data_Bootcamp). - +http://worrydream.com/EnlightenedImaginationForCitizens/ +--> diff --git a/more.md b/more.md index f089d7e..eae148c 100644 --- a/more.md +++ b/more.md @@ -1,56 +1,56 @@ -# More cool stuff +# More cool stuff --- **Overview.** -**Python tools.** +**Python tools.** -**Buzzwords.** +**Buzzwords.** -**Applications.** +**Applications.** --- ## Using APIs -https://www.codecademy.com/en/tracks/npr +https://www.codecademy.com/en/tracks/npr https://www.codecademy.com/en/tracks/nhtsa -## Maps +## Maps -http://maxberggren.github.io/2015/08/04/basemap/ -http://matplotlib.org/basemap/users/examples.html +http://maxberggren.github.io/2015/08/04/basemap/ +http://matplotlib.org/basemap/users/examples.html -http://geoffboeing.com/2014/08/visualizing-summer-travels-with-cartodb/ +http://geoffboeing.com/2014/08/visualizing-summer-travels-with-cartodb/ -## Animations +## Animations https://jakevdp.github.io/blog/2012/08/18/matplotlib-animation-tutorial/ http://www.christianmoscardi.com/blog/2015/08/12/embedding-d3-in-ipython-notebook.html -Savvas on NBA motion +Savvas on NBA motion -## Debugging +## Debugging http://blog.ionelmc.ro/2013/06/05/python-debugging-tools/ -## Dashboards +## Dashboards http://multithreaded.stitchfix.com/blog/2015/07/16/pyxley/ ## Scraping websites -Scrapy, Beautful Soup +Scrapy, Beautful Soup -http://pbpython.com/web-scraping-mn-budget.html +http://pbpython.com/web-scraping-mn-budget.html http://savvastjortjoglou.com/nba-shot-sharts.html -http://www.gregreda.com/tag/scraping.html +http://www.gregreda.com/tag/scraping.html -http://blog.webhose.io/2015/08/16/dead-simple-for-devs-python-crawler-script-for-extracting-structured-data-from-any-almost-website-into-csv/ +http://blog.webhose.io/2015/08/16/dead-simple-for-devs-python-crawler-script-for-extracting-structured-data-from-any-almost-website-into-csv/ https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Dashboard/Medicare-Drug-Spending/Drug_Spending_Dashboard.html @@ -58,23 +58,23 @@ https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-R http://www.r-bloggers.com/the-star-wars-grossing-war/ -## Bokeh +## Bokeh -http://rowanv.com/portfolio/oecd_unemployment/ -https://github.com/rowanv/giraffe_viz/blob/master/oecd_unemployment.py +http://rowanv.com/portfolio/oecd_unemployment/ +https://github.com/rowanv/giraffe_viz/blob/master/oecd_unemployment.py ## Dashboards -This uses flask: +This uses flask: http://dash.rowanv.com/ https://github.com/rowanv/giraffe_dash -## Natural language processing +## Natural language processing -Processing text... +Processing text... -http://nbviewer.jupyter.org/url/norvig.com/ipython/How%20to%20Do%20Things%20with%20Words.ipynb +http://nbviewer.jupyter.org/url/norvig.com/ipython/How%20to%20Do%20Things%20with%20Words.ipynb http://gaussiangeek.blogspot.com/2015/06/ever-since-i-heard-abbey-road-i.html @@ -89,22 +89,22 @@ http://ptrckprry.com/course/ssd/nltk-tutorial.pdf http://ptrckprry.com/ssd/ -## Fuzzy wuzzy +## Fuzzy wuzzy ## Large datasets From Itamar: I tried playing around with some of the examples below and others I found. -My key insight is that the data sets are extremely large (at least few GB each), and therefore the way to access the data is by running SQL-lite queries as part of the API request and load to memory only aggregated data. -As far as I remember we said SQL is not a focus of this course. For that reason I don't think this data set can be handy for us. -Let me know if you guys think otherwise. +My key insight is that the data sets are extremely large (at least few GB each), and therefore the way to access the data is by running SQL-lite queries as part of the API request and load to memory only aggregated data. +As far as I remember we said SQL is not a focus of this course. For that reason I don't think this data set can be handy for us. +Let me know if you guys think otherwise. On Thursday, November 19, 2015 at 4:21:05 PM UTC-5, David Backus wrote: -Data +Data https://data.cityofnewyork.us/ https://nycopendata.socrata.com/dashboard -Examples +Examples https://plot.ly/ipython-notebooks/big-data-analytics-with-pandas-and-sqlite/ http://iquantny.tumblr.com/ http://fivethirtyeight.com/features/uber-is-serving-new-yorks-outer-boroughs-more-than-taxis-are/ @@ -114,56 +114,56 @@ http://fivethirtyeight.com/features/how-data-made-me-a-believer-in-new-york-city Pokemon or Big Data? https://pixelastic.github.io/pokemonorbigdata/ -## Plot.ly +## Plot.ly -## Google App Engine +## Google App Engine -## Other coding enviroments +## Other coding enviroments -There are lots of coding environments out there. Spyder is the easiest, in our view, but there are lots of choices. +There are lots of coding environments out there. Spyder is the easiest, in our view, but there are lots of choices. -* Dave uses **Spyder** because he likes its old school Matlab look and feel. -* Chase and Spencer use **[Sublime Text](http://www.sublimetext.com/)**, an editor that can be customized to do almost anything. Paul does the same with **[Vim](http://www.vim.org/)**. Both are text editors only. You run Python from the command line, which is even more old school. -* Lots of people recommend **[Pycharm](https://www.jetbrains.com/pycharm/download/)**. Dave thinks this is the tool of choice for someone who wants to go to the next level: slightly harder than Spyder to get going, but way more powerful once you do. Among other things, it looks really cool. +* Dave uses **Spyder** because he likes its old school Matlab look and feel. +* Chase and Spencer use **[Sublime Text](http://www.sublimetext.com/)**, an editor that can be customized to do almost anything. Paul does the same with **[Vim](http://www.vim.org/)**. Both are text editors only. You run Python from the command line, which is even more old school. +* Lots of people recommend **[Pycharm](https://www.jetbrains.com/pycharm/download/)**. Dave thinks this is the tool of choice for someone who wants to go to the next level: slightly harder than Spyder to get going, but way more powerful once you do. Among other things, it looks really cool. -Here are [two](https://wiki.python.org/moin/IntegratedDevelopmentEnvironments) [lists](https://www.reddit.com/r/Python/comments/1keync/best_free_python_ide/) if you'd like to get a sense of what's out there and what others think about it. +Here are [two](https://wiki.python.org/moin/IntegratedDevelopmentEnvironments) [lists](https://www.reddit.com/r/Python/comments/1keync/best_free_python_ide/) if you'd like to get a sense of what's out there and what others think about it. ## R -**Install R.** If you decide you'd like to try R some time, [choose a "mirror"](https://cran.r-project.org/mirrors.html) and download the appropriate version. We recommend you run it in [RStudio](https://www.rstudio.com/products/rstudio/download/), a popular coding environment. Both are free. Once you've installed them, start up RStudio and it will access R as needed. +**Install R.** If you decide you'd like to try R some time, [choose a "mirror"](https://cran.r-project.org/mirrors.html) and download the appropriate version. We recommend you run it in [RStudio](https://www.rstudio.com/products/rstudio/download/), a popular coding environment. Both are free. Once you've installed them, start up RStudio and it will access R as needed. Things to check out 1. Conda. This could have promise: Continuum support of R. Two things: -* You can install from conda, which you have if you installed Anaconda. +* You can install from conda, which you have if you installed Anaconda. -* You can run in a Jupyter notebook, as Brian mentioned. +* You can run in a Jupyter notebook, as Brian mentioned. -The bad news is that it doesn't seem to have RStudio, which I like. +The bad news is that it doesn't seem to have RStudio, which I like. More at https://www.continuum.io/blog/developer/jupyter-and-conda-r -2. List of online resources. +2. List of online resources. http://www.r-bloggers.com/learning-r-index-of-online-r-courses-october-2015/ -3. Princeton's intro. +3. Princeton's intro. http://data.princeton.edu/R/ -**Learn R.** If you want to learn how to program in R, there's lots of good stuff online -- too much, really. We like these: +**Learn R.** If you want to learn how to program in R, there's lots of good stuff online -- too much, really. We like these: -* [Try R](http://tryr.codeschool.com/) is like Codecademy, you run code online. -* Princeton: http://data.princeton.edu/R/ +* [Try R](http://tryr.codeschool.com/) is like Codecademy, you run code online. +* Princeton: http://data.princeton.edu/R/ * List http://www.r-bloggers.com/learning-r-index-of-online-r-courses-october-2015/ -* Kelly Black has a [tutorial](http://www.cyclismo.org/tutorial/R/) that covers more advanced topics, including introductory statistics. -* [Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/) combines R programming with an introduction to modern statistics and machine learning. +* Kelly Black has a [tutorial](http://www.cyclismo.org/tutorial/R/) that covers more advanced topics, including introductory statistics. +* [Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/) combines R programming with an introduction to modern statistics and machine learning. -We also like the blog aggregator **[R-bloggers](http://www.r-bloggers.com/)**, which is filled with applications, including code. +We also like the blog aggregator **[R-bloggers](http://www.r-bloggers.com/)**, which is filled with applications, including code. -## Other languages +## Other languages http://www.curiousefficiency.org/posts/2015/10/languages-to-improve-your-python.html @@ -171,7 +171,7 @@ http://www.curiousefficiency.org/posts/2015/10/languages-to-improve-your-python. https://cloud.google.com/datalab/ -Also Wakari, AWS... +Also Wakari, AWS... -## SQLite +## SQLite diff --git a/pandas-input.md b/pandas-input.md index 835b3ed..c9612f7 100644 --- a/pandas-input.md +++ b/pandas-input.md @@ -1,104 +1,104 @@ -# Data input: Packages and Pandas +# Data input: Packages and Pandas --- -**Overview.** We introduce "packages" -- collections of tools that extend Python's capabilities -- and explore one of them: Pandas, the Python package devoted to data management. We use Pandas to read spreadsheet data into Python and describe the "dataframe" this produces. +**Overview.** We introduce "packages" -- collections of tools that extend Python's capabilities -- and explore one of them: Pandas, the Python package devoted to data management. We use Pandas to read spreadsheet data into Python and describe the "dataframe" this produces. -**Python tools.** Import, Pandas. +**Python tools.** Import, Pandas. -**Buzzwords.** Package, csv file, dataframe, series, index, API. +**Buzzwords.** Package, csv file, dataframe, series, index, API. -**Applications.** Income and output of countries, government debt, income by college major, old people, equity returns, George Clooney's movie roles. +**Applications.** Income and output of countries, government debt, income by college major, old people, equity returns, George Clooney's movie roles. -**Code.** [Link](https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/Python/bootcamp_pandas_1.py). +**Code.** [Link](https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/Python/bootcamp_pandas_1.py). --- -We're ready now to look at some data. Lots of data. You will need an **internet connection** for much of it. +We're ready now to look at some data. Lots of data. You will need an **internet connection** for much of it. -You may recall that our typical program consists of data input, data management, and graphics. We'll spend most of our time here on the first -- data input -- but touch briefly on the second and third. More concretely, we explain how to get spreadsheet data into Python. Along the way we describe how Python uses collections of tools or plug-ins (**packages**) to address a wide range of applications: data management (**Pandas**), graphics (Matplotlib), and many other things. +You may recall that our typical program consists of data input, data management, and graphics. We'll spend most of our time here on the first -- data input -- but touch briefly on the second and third. More concretely, we explain how to get spreadsheet data into Python. Along the way we describe how Python uses collections of tools or plug-ins (**packages**) to address a wide range of applications: data management (**Pandas**), graphics (Matplotlib), and many other things. ## Reminders -* Objects and methods. Recall that we apply the method `justdoit` to the object `x` with `x.justdoit`. +* Objects and methods. Recall that we apply the method `justdoit` to the object `x` with `x.justdoit`. -* Help. We get help in Spyder from both the IPython console and the Object inspector. For the hypothetical `x.justdoit`, we would type `x.justdoit?` in the IPython console or `x.justdoit` in the Object inspector. +* Help. We get help in Spyder from both the IPython console and the Object inspector. For the hypothetical `x.justdoit`, we would type `x.justdoit?` in the IPython console or `x.justdoit` in the Object inspector. -* Data structures. That's the term we use for specific organizations of data. Examples are strings, lists, and dictionaries. Each has a specific structure and a set of methods we can apply. List are collections of objects between square brackets: `numberlist = [1, -5, 2]`. Dictionaries are pairs of items between curly brackets: `namedict = {'Dave': 'Backus', 'Chase': 'Coleman'}`. The first item in each pair is the "key," the second is the "value."" +* Data structures. That's the term we use for specific organizations of data. Examples are strings, lists, and dictionaries. Each has a specific structure and a set of methods we can apply. List are collections of objects between square brackets: `numberlist = [1, -5, 2]`. Dictionaries are pairs of items between curly brackets: `namedict = {'Dave': 'Backus', 'Chase': 'Coleman'}`. The first item in each pair is the "key," the second is the "value."" -* Integers, floats, and strings. Three common types of data. +* Integers, floats, and strings. Three common types of data. -* Function returns. We refer to the output of a function as its **return**. We would say, for example, that the function `type(x)` returns the type of the input object `x`. We capture the return with an assignment: `xtype = type(x)`. +* Function returns. We refer to the output of a function as its **return**. We would say, for example, that the function `type(x)` returns the type of the input object `x`. We capture the return with an assignment: `xtype = type(x)`. ## Python packages -Python is not just a programming language, it's an open source collection of tools that includes both core Python and a large collection of packages written by different people. The word "package" here refers to plug-ins or extensions that expand Python's capabilities. Terminology varies. What we call a package others sometimes call a "library." The term "module" typically refers to a subset of core Python or one of its packages. You won't go far wrong to use the terms interchangeably. +Python is not just a programming language, it's an open source collection of tools that includes both core Python and a large collection of packages written by different people. The word "package" here refers to plug-ins or extensions that expand Python's capabilities. Terminology varies. What we call a package others sometimes call a "library." The term "module" typically refers to a subset of core Python or one of its packages. You won't go far wrong to use the terms interchangeably. -The standard Python packages are well written, well documented, and well supported. They have armies of users who spot and correct problems. Some of the others less so. We try to stick to the standard packages, specifically those that come with the Anaconda distribution. +The standard Python packages are well written, well documented, and well supported. They have armies of users who spot and correct problems. Some of the others less so. We try to stick to the standard packages, specifically those that come with the Anaconda distribution. -Some of the leading packages for numerical ("scientific") computation are +Some of the leading packages for numerical ("scientific") computation are -* **[Pandas](http://pandas.pydata.org/).** The leading package for managing data and our focus in this chapter. +* **[Pandas](http://pandas.pydata.org/).** The leading package for managing data and our focus in this chapter. -* **[Matplotlib](http://matplotlib.org/).** The leading graphics package. We'll use it extensively. +* **[Matplotlib](http://matplotlib.org/).** The leading graphics package. We'll use it extensively. -* **[NumPy](http://www.numpy.org/).** Tools for numerical computing. In Excel the basic unit is a cell, a single number. In NumPy the basic unit is a vector (a column) or matrix (a table or worksheet), which allows us to do things with an entire column or table in one line. This facility carries over to Pandas. +* **[NumPy](http://www.numpy.org/).** Tools for numerical computing. In Excel the basic unit is a cell, a single number. In NumPy the basic unit is a vector (a column) or matrix (a table or worksheet), which allows us to do things with an entire column or table in one line. This facility carries over to Pandas. -All of these packages come with the [Anaconda distribution](http://docs.continuum.io/anaconda/pkg-docs.html), which means we already have them installed and ready to use. +All of these packages come with the [Anaconda distribution](http://docs.continuum.io/anaconda/pkg-docs.html), which means we already have them installed and ready to use. -Pandas is an essential part of data work in Python. Its [authors describe it](http://pandas.pydata.org/) as "an open source library for high-performance, easy-to-use data structures and data analysis tools in Python." That's a mouthful. Suffice it to say that we can do pretty much everything in Pandas that we can do in Excel -- and more. We can compute sums of rows and columns, generate new rows or columns, construct pivot tables, and lots of other things. And we can do all this with much larger files than Excel can handle. +Pandas is an essential part of data work in Python. Its [authors describe it](http://pandas.pydata.org/) as "an open source library for high-performance, easy-to-use data structures and data analysis tools in Python." That's a mouthful. Suffice it to say that we can do pretty much everything in Pandas that we can do in Excel -- and more. We can compute sums of rows and columns, generate new rows or columns, construct pivot tables, and lots of other things. And we can do all this with much larger files than Excel can handle. - +We won't use them, but they're in Anaconda, too. Feel free to give them a try. If you do, please report back on your experience. +--> -## Importing packages +## Importing packages -In Python we need to tell our program which packages we plan to use. We do that with an `import` statement. +In Python we need to tell our program which packages we plan to use. We do that with an `import` statement. -Here are some examples applied to a mythical package `xyz` and mythical function `foo`: +Here are some examples applied to a mythical package `xyz` and mythical function `foo`: -* `import xyz as x`. This imports the package `xyz` under the abbreviation `x`. A function `foo` in package `xyz` is then executed with `x.foo`. This is the most common syntax and the one we'll generally use. With Pandas, for example, the standard import statement is `import pandas as pd`. +* `import xyz as x`. This imports the package `xyz` under the abbreviation `x`. A function `foo` in package `xyz` is then executed with `x.foo`. This is the most common syntax and the one we'll generally use. With Pandas, for example, the standard import statement is `import pandas as pd`. -* `import xyz`. This imports the whole thing as well, but here the function `foo` is executed with the more verbose `xyz.foo`. +* `import xyz`. This imports the whole thing as well, but here the function `foo` is executed with the more verbose `xyz.foo`. -* `from xyz import *`. This imports all the functions and methods from the package `xyz`, but here the function `foo` is executed simply by typing `foo`. We don't usually do this, because it opens up the possibility that the same function exists in more than one package, which is virtually guaranteed to create confusion. If `foo` is the only function we care about, we can use `from xyz import foo` instead. +* `from xyz import *`. This imports all the functions and methods from the package `xyz`, but here the function `foo` is executed simply by typing `foo`. We don't usually do this, because it opens up the possibility that the same function exists in more than one package, which is virtually guaranteed to create confusion. If `foo` is the only function we care about, we can use `from xyz import foo` instead. We'll see these examples repeatedly: -```python +```python import pandas as pd # data package -import matplotlib.pyplot as plt # graphics package +import matplotlib.pyplot as plt # graphics package ``` -You might also go through earlier chapters and identify the `import` statements you find. By convention, they are placed at the top of the program. What packages or modules have we used? What do they do? +You might also go through earlier chapters and identify the `import` statements you find. By convention, they are placed at the top of the program. What packages or modules have we used? What do they do? -Some fine points: +Some fine points: -* Redundancy. What happens if we issue an import statement twice? Answer: Nothing, no harm done. +* Redundancy. What happens if we issue an import statement twice? Answer: Nothing, no harm done. -* Jokes. These are programmer jokes, which might be a contradiction in terms, but try them and see what happens: +* Jokes. These are programmer jokes, which might be a contradiction in terms, but try them and see what happens: - ```python + ```python import this - import antigravity + import antigravity ``` * Versions. We can check the version number of a package with `package_name.__version__`. To check the version of Pandas, try @@ -107,50 +107,50 @@ Some fine points: import pandas as pd print('Pandas version ', pd.__version__) # these are double underscores ``` - This can be helpful if we're trying to track down an error. + This can be helpful if we're trying to track down an error. -**Exercise.** Import Pandas. What version do you have? +**Exercise.** Import Pandas. What version do you have? -**Exercise.** What happens if we import Pandas twice under different names, once with `import pandas as pd` and once with `import pandas as pa`? Write a short program that tests your conjecture. *Hint:* Use what have we done with Pandas so far. +**Exercise.** What happens if we import Pandas twice under different names, once with `import pandas as pd` and once with `import pandas as pa`? Write a short program that tests your conjecture. *Hint:* Use what have we done with Pandas so far. -## Data input 1: reading internet files +## Data input 1: reading internet files -The easiest way to get data into a Python program is to read it from a file -- a spreadsheet file, for example. The word "read" here means take what's in the file and somehow get it into Python so we can do things with it. Pandas can read lots of kinds of files: csv, xls, xlsx, and so on. The files can be on our computer or on the internet. We'll start with the internet -- there's less ambiguity about the location of the file -- but the same approach works with files on your computer. +The easiest way to get data into a Python program is to read it from a file -- a spreadsheet file, for example. The word "read" here means take what's in the file and somehow get it into Python so we can do things with it. Pandas can read lots of kinds of files: csv, xls, xlsx, and so on. The files can be on our computer or on the internet. We'll start with the internet -- there's less ambiguity about the location of the file -- but the same approach works with files on your computer. -We prefer **csv files** ("comma separated values"), a common data format for serious data people. Their simple structure (entries separated by commas) allows easy and rapid input. They also avoid some of [the problems](http://www.win-vector.com/blog/2014/11/excel-spreadsheets-are-hard-to-get-right/) with translating Excel files. If we have an Excel spreadsheet, we can always save it as a "CSV (Comma delimited) (*.csv)" file. Excel will warn us that some features are incompatible with the csv format, but we're generally happy to do it anyway. Here's an example of a [raw csv file](https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/Python/test.csv) (pretty basic, eh?) and [this is](https://github.com/DaveBackus/Data_Bootcamp/blob/master/Code/Python/test.csv) (roughly) how it's displayed in Excel. +We prefer **csv files** ("comma separated values"), a common data format for serious data people. Their simple structure (entries separated by commas) allows easy and rapid input. They also avoid some of [the problems](http://www.win-vector.com/blog/2014/11/excel-spreadsheets-are-hard-to-get-right/) with translating Excel files. If we have an Excel spreadsheet, we can always save it as a "CSV (Comma delimited) (*.csv)" file. Excel will warn us that some features are incompatible with the csv format, but we're generally happy to do it anyway. Here's an example of a [raw csv file](https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/Python/test.csv) (pretty basic, eh?) and [this is](https://github.com/DaveBackus/Data_Bootcamp/blob/master/Code/Python/test.csv) (roughly) how it's displayed in Excel. **Reading csv files.** It's easy to read csv files with Pandas. We'll read one from our GitHub repository to show how it works. We like to read data from internet sources like this, especially when the data is automatically updated at the source. That's not the case here, but we'll see how easy it is to get this kind of data into Python. We read the cleverly-named `test.csv` with the equally clever `read_csv` function in Pandas: -```python -import pandas as pd +```python +import pandas as pd url1 = 'https://raw.githubusercontent.com/DaveBackus' url2 = '/Data_Bootcamp/master/Code/Python/test.csv' -url = url1 + url2 # location of file -df = pd.read_csv(url) # read file and assign it to df +url = url1 + url2 # location of file +df = pd.read_csv(url) # read file and assign it to df ``` -The syntax works like this: +The syntax works like this: -* `url` is a string that tells Python where to look for the file. We break it in two because it's too long to fit on one line. -* `read_csv` is a Pandas function that reads csv files. The `pd.` before it tells Python it's a Pandas function; we established the `pd` abbreviation in the `import` statement. -* The `df` on the left makes this an assignment: We assign what we read to the variable `df`. +* `url` is a string that tells Python where to look for the file. We break it in two because it's too long to fit on one line. +* `read_csv` is a Pandas function that reads csv files. The `pd.` before it tells Python it's a Pandas function; we established the `pd` abbreviation in the `import` statement. +* The `df` on the left makes this an assignment: We assign what we read to the variable `df`. -**Digression.** We won't do this often, but if the internet is down, or we want a simple example to experiment with, we can create data like this from a dictionary. Each pair in this dictionary consists of a variable name (the "key") and a list containing data (the "value"). This code reproduces the output of the `read_csv` function above: +**Digression.** We won't do this often, but if the internet is down, or we want a simple example to experiment with, we can create data like this from a dictionary. Each pair in this dictionary consists of a variable name (the "key") and a list containing data (the "value"). This code reproduces the output of the `read_csv` function above: -```python -df = pd.DataFrame({'name': ['Dave', 'Chase', 'Spencer'], - 'x1': [1, 4, 5], 'x2': [2, 3, 6], 'x3': [3.5, 4.3, 7.8]}) +```python +df = pd.DataFrame({'name': ['Dave', 'Chase', 'Spencer'], + 'x1': [1, 4, 5], 'x2': [2, 3, 6], 'x3': [3.5, 4.3, 7.8]}) ``` -This constructs a dataframe from a dictionary. In the dictionary, each "key" is a variable name expressed as a string and each "value" is a list that produces a column of data. **End digression.** +This constructs a dataframe from a dictionary. In the dictionary, each "key" is a variable name expressed as a string and each "value" is a list that produces a column of data. **End digression.** -So what does our read statement give us? What's in `df`? We can check its contents by adding the statement `print('\n', df)`. (The `'\n'` tells the print function to start printing on a new line, which makes the output look better.) The result is +So what does our read statement give us? What's in `df`? We can check its contents by adding the statement `print('\n', df)`. (The `'\n'` tells the print function to start printing on a new line, which makes the output look better.) The result is ```python name x1 x2 x3 @@ -159,34 +159,34 @@ So what does our read statement give us? What's in `df`? We can check its cont 2 Spencer 5 6 7 ``` -What we have is a table, much like what we'd see in a spreadsheet. If we compare it to the [source](https://github.com/DaveBackus/Data_Bootcamp/blob/master/Code/Python/test.csv) we see that the first column is new, added somehow by the program, but the others are just as they look in the source. +What we have is a table, much like what we'd see in a spreadsheet. If we compare it to the [source](https://github.com/DaveBackus/Data_Bootcamp/blob/master/Code/Python/test.csv) we see that the first column is new, added somehow by the program, but the others are just as they look in the source. -The documentation for `read_csv` in the Object inspector gives us an overwhelming amount of information. Starting at the bottom, we see that it returns a **dataframe**. More on that shortly. We also see a long list of optional inputs that change how we read the file. We will ignore them unless forced to do otherwise. +The documentation for `read_csv` in the Object inspector gives us an overwhelming amount of information. Starting at the bottom, we see that it returns a **dataframe**. More on that shortly. We also see a long list of optional inputs that change how we read the file. We will ignore them unless forced to do otherwise. -**Exercise.** Run the code +**Exercise.** Run the code -```python +```python url1 = 'https://raw.githubusercontent.com/DaveBackus' -url2 = '/Data_Bootcamp/master/Code/Python/test0.csv' # note the added 0 +url2 = '/Data_Bootcamp/master/Code/Python/test0.csv' # note the added 0 url = url1 + url2 df = pd.read_csv(url) ``` -What happens? +What happens? -**Exercise.** Change the last line of the earlier code to +**Exercise.** Change the last line of the earlier code to ```python dfalt = pd.read_csv(url, nrows=2) -print('\n', dfalt) +print('\n', dfalt) ``` -What does the argument `nrows=2` do? +What does the argument `nrows=2` do? -**Example.** We can identify specific values in a csv file as missing or NA (not available). We see in the documentation that the parameter `na_values` takes a list of strings as input. To treat the number 1 as missing we change the read statement to `read_csv(url, na_values=[1])`. The result is +**Example.** We can identify specific values in a csv file as missing or NA (not available). We see in the documentation that the parameter `na_values` takes a list of strings as input. To treat the number 1 as missing we change the read statement to `read_csv(url, na_values=[1])`. The result is ```python name x1 x2 x3 @@ -195,51 +195,51 @@ What does the argument `nrows=2` do? 2 Spencer 5 6 7.8 ``` -We see that the number 1 that was formerly at the top of the `x1` column has been replaced by `NaN` -- "not a number". +We see that the number 1 that was formerly at the top of the `x1` column has been replaced by `NaN` -- "not a number". -**Exercise.** Adapt the code to treat the numbers 1 and 6 as missing. +**Exercise.** Adapt the code to treat the numbers 1 and 6 as missing. -**Reading Excel files.** We can also read Excel files (xls and xlsx) with Pandas: use the `read_excel` function. The syntax is almost identical: +**Reading Excel files.** We can also read Excel files (xls and xlsx) with Pandas: use the `read_excel` function. The syntax is almost identical: -```python -import pandas as pd +```python +import pandas as pd url1 = 'https://raw.githubusercontent.com/DaveBackus' url2 = '/Data_Bootcamp/master/Code/Python/test.xls' url = url1 + url2 dfx = pd.read_excel(url) -print('\n', dfx) +print('\n', dfx) ``` -If all goes well, the modified code produces a dataframe `dfx` that's identical to `df`. +If all goes well, the modified code produces a dataframe `dfx` that's identical to `df`. -**Exercise.** Change the file extension at the end of `url2` from `.xls` to `.xlsx`. What does the new code produce? +**Exercise.** Change the file extension at the end of `url2` from `.xls` to `.xlsx`. What does the new code produce? -## Properties of dataframes +## Properties of dataframes -Ok, so we read a file and assigned its contents to `df`. But what is `df`? What kinds of things can we do with it? +Ok, so we read a file and assigned its contents to `df`. But what is `df`? What kinds of things can we do with it? -We start by finding its type. We use the handy `type()` function and enter `type(df)` in the IPython console. It responds: `pandas.core.frame.DataFrame`. More simply, it's a **dataframe**. A dataframe is another example of a data structure, like a list or a dictionary, that organizes data in a specific way. A dataframe has three components: a table of data, column labels, and row labels. +We start by finding its type. We use the handy `type()` function and enter `type(df)` in the IPython console. It responds: `pandas.core.frame.DataFrame`. More simply, it's a **dataframe**. A dataframe is another example of a data structure, like a list or a dictionary, that organizes data in a specific way. A dataframe has three components: a table of data, column labels, and row labels. -Typically columns are variables and the column labels give us their names. In our example, the second column has the name `x1` and its values follow below it. The rows are then observations, and the row labels give us their names. This is a standard setup and we'll do our best to conform to it. If the data come in some other form, we'll try to convert it. +Typically columns are variables and the column labels give us their names. In our example, the second column has the name `x1` and its values follow below it. The rows are then observations, and the row labels give us their names. This is a standard setup and we'll do our best to conform to it. If the data come in some other form, we'll try to convert it. -**Dimensions.** We access a dataframe's dimensions -- the numbers of rows and columns -- with the `shape` method: `df.shape`. Here the answer is `(3,4)`, so we have 3 rows (observations) and 4 columns (variables). +**Dimensions.** We access a dataframe's dimensions -- the numbers of rows and columns -- with the `shape` method: `df.shape`. Here the answer is `(3,4)`, so we have 3 rows (observations) and 4 columns (variables). -**Columns and indexes.** We access the column and row labels directly. For the dataframe `df` we read in earlier, we extract column labels with the `columns` method: `df.columns`. That gives us the verbose output `Index(['name', 'x1', 'x2', 'x3'], dtype='object')`. If we prefer to have them as a list, we use `list(df)`. That gives us the column names as a list: `['name', 'x1', 'x2', 'x3']`. If we check the [source](https://github.com/DaveBackus/Data_Bootcamp/blob/master/Code/Python/test.csv), we see that the column labels come from the first row of the file. +**Columns and indexes.** We access the column and row labels directly. For the dataframe `df` we read in earlier, we extract column labels with the `columns` method: `df.columns`. That gives us the verbose output `Index(['name', 'x1', 'x2', 'x3'], dtype='object')`. If we prefer to have them as a list, we use `list(df)`. That gives us the column names as a list: `['name', 'x1', 'x2', 'x3']`. If we check the [source](https://github.com/DaveBackus/Data_Bootcamp/blob/master/Code/Python/test.csv), we see that the column labels come from the first row of the file. -The row labels are referred to as the **index**. We extract them with the `index` method: `df.index`. That gives us the verbose output `Int64Index([0, 1, 2], dtype='int64')`. We can convert it to a list by adding another method, `df.index.tolist()`, which gives us `[0, 1, 2]`. (Cool! Two methods strung together!) In this case, the index is not part of the original file; Pandas inserted a counter. As usual in Python, the counter starts at zero. +The row labels are referred to as the **index**. We extract them with the `index` method: `df.index`. That gives us the verbose output `Int64Index([0, 1, 2], dtype='int64')`. We can convert it to a list by adding another method, `df.index.tolist()`, which gives us `[0, 1, 2]`. (Cool! Two methods strung together!) In this case, the index is not part of the original file; Pandas inserted a counter. As usual in Python, the counter starts at zero. -**Exercise.** What does `df.columns.tolist()` do? How does it compare to `list(df)`? +**Exercise.** What does `df.columns.tolist()` do? How does it compare to `list(df)`? **Column data types.** Pandas allows every column (typically a variable) to have a different data type, but the type must be the same within a column. With our dataframe `df`, we get the types with the `dtypes` method; that is, with `df.dtypes`: @@ -251,16 +251,16 @@ x2 int64 x3 float64 ``` -Evidently `x1` and `x2` are integers and `x3` is a float. They're no different from the types of numbers we came across in the previous chapter. The first column, `name`, is different. Object is the name Pandas gives to things it can't turn into numbers -- in our case, strings. Sometimes, as here, that makes sense: names like `Dave` and `Spencer` are naturally strings. But in many cases we've run across, numbers are given the dtype object because there was something in the data that didn't look like a number. We'll see more of that later on. +Evidently `x1` and `x2` are integers and `x3` is a float. They're no different from the types of numbers we came across in the previous chapter. The first column, `name`, is different. Object is the name Pandas gives to things it can't turn into numbers -- in our case, strings. Sometimes, as here, that makes sense: names like `Dave` and `Spencer` are naturally strings. But in many cases we've run across, numbers are given the dtype object because there was something in the data that didn't look like a number. We'll see more of that later on. -**Transpose columns and rows.** If we want to rotate the dataframe, exchanging columns and rows, we use the `transpose` method: `df.transpose` or (more succinctly) `df.T`. Let's do that with the dataframe `df` we read in earlier: +**Transpose columns and rows.** If we want to rotate the dataframe, exchanging columns and rows, we use the `transpose` method: `df.transpose` or (more succinctly) `df.T`. Let's do that with the dataframe `df` we read in earlier: ```python dft = df.T print('\n', dft) ``` -The result is +The result is ``` 0 1 2 @@ -270,26 +270,26 @@ x2 2 3 6 x3 3.5 4.3 7.8 ``` -In this case it doesn't make much sense, but in others we'll find it helpful. +In this case it doesn't make much sense, but in others we'll find it helpful. ## Working with variables -So we have a dataframe `df` whose columns are variables. One of the great things about Pandas is that we can do things with every observation of a variable in one statement. +So we have a dataframe `df` whose columns are variables. One of the great things about Pandas is that we can do things with every observation of a variable in one statement. -**Variables = series.** If we want to refer to the variable `x1`, we write `df['x1']`. If we ask what type this is, with +**Variables = series.** If we want to refer to the variable `x1`, we write `df['x1']`. If we ask what type this is, with ```python print(type(df['x1'])) ``` -we find that it's a `pandas.core.series.Series` -- **series** for short. A series is essentially a dataframe with a single variable or column, which simplifies the bookkeeping a bit. +we find that it's a `pandas.core.series.Series` -- **series** for short. A series is essentially a dataframe with a single variable or column, which simplifies the bookkeeping a bit. -**Extracting a list of variables.** We just saw that `df['x1']` "extracts" the variable/series `x1` from the dataframe `df`. In other cases, we may want to extract a set of variables and create a smaller dataframe. This happens a lot when the data we read has more variables than we need. +**Extracting a list of variables.** We just saw that `df['x1']` "extracts" the variable/series `x1` from the dataframe `df`. In other cases, we may want to extract a set of variables and create a smaller dataframe. This happens a lot when the data we read has more variables than we need. -We can extract variables by name or number. If by number, we count (as usual) starting with zero. This code gives us two ways to extract `x1` and `x3` from `df`: +We can extract variables by name or number. If by number, we count (as usual) starting with zero. This code gives us two ways to extract `x1` and `x3` from `df`: ```python namelist = ['x1', 'x3'] @@ -298,66 +298,66 @@ df_v1 = df[namelist] df_v2 = df[numlist] ``` -**Exercise.** Run this code and verify that the two dataframes are the same. Verify that the statement `df_v3 = df[[0,2]]` does the same. +**Exercise.** Run this code and verify that the two dataframes are the same. Verify that the statement `df_v3 = df[[0,2]]` does the same. **Exercise.** How would you extract the first two variables, `x1` and `x2`? -**Constructing new variables from old ones.** Now that we know how to refer to a variable, we can construct others from them. We construct two with +**Constructing new variables from old ones.** Now that we know how to refer to a variable, we can construct others from them. We construct two with -```python +```python df['y1'] = df['x1']/df['x2'] df['y2'] = df['x2'] + df['x3'] ``` -The first line computes the new variable `y1` as the ratio of `x1` to `x2`. The second computes `y2` as the sum of `x2` and `x3`. +The first line computes the new variable `y1` as the ratio of `x1` to `x2`. The second computes `y2` as the sum of `x2` and `x3`. -The dataframe now includes both of these variables. The statement `print('\n', df)` gives us +The dataframe now includes both of these variables. The statement `print('\n', df)` gives us -```python +```python name x1 x2 x3 y1 y2 0 Dave 1 2 3.5 0.500000 5.5 1 Chase 4 3 4.3 1.333333 7.3 2 Spencer 5 6 7.8 0.833333 13.8 ``` -Let's think about what we've done here. In Excel, we would compute the first observation of `y1`, then copy the formula to the other observations. Here one line of code computes all the observations of a new variable `y1`. +Let's think about what we've done here. In Excel, we would compute the first observation of `y1`, then copy the formula to the other observations. Here one line of code computes all the observations of a new variable `y1`. -**Exercise.** Create a variable `z` equal to the sum of `x1`, `x2`, and `x3`. +**Exercise.** Create a variable `z` equal to the sum of `x1`, `x2`, and `x3`. -**Rename variables.** Suppose we want to give `x1` the more intuitive name `sales`. We can do that with the statement +**Rename variables.** Suppose we want to give `x1` the more intuitive name `sales`. We can do that with the statement -```python +```python df.rename(columns={'x1': 'sales'}) ``` -Note the use of a dictionary that associates the "key" `x1` with the "value" `sales`. +Note the use of a dictionary that associates the "key" `x1` with the "value" `sales`. +**Exercise.** Drop the initial observation from `df`, the one with the index `'Dave'`. +--> - + -## Dataframe methods +## Dataframe methods -One of the great things about dataframes is that they have lots of methods ready to go. We'll survey some of the most useful ones at high speed and come back to them when we have more interesting data. +One of the great things about dataframes is that they have lots of methods ready to go. We'll survey some of the most useful ones at high speed and come back to them when we have more interesting data. -**Data output.** To save a dataframe to a local file on our computer we use the `df.to_*` family of methods. For example, the methods `df.to_csv()` and `df.to_excel()` produce csv and Excel files, respectively. Both require a file name as input. We'll hold off on them until we've addressed files on our computer. +**Data output.** To save a dataframe to a local file on our computer we use the `df.to_*` family of methods. For example, the methods `df.to_csv()` and `df.to_excel()` produce csv and Excel files, respectively. Both require a file name as input. We'll hold off on them until we've addressed files on our computer. **Clipboard methods.** We can read from the clipboard and write to it. Suppose we open a spreadsheet and copy a section of it into the clipboard. We can paste it into a dataframe with the statement @@ -366,13 +366,13 @@ One of the great things about dataframes is that they have lots of methods ready df_clip = pd.read_clipboard() ``` -Going the other way, we can copy the dataframe `df` to the clipboard with `df.to_clipboard()`. From the clipboard, we can paste it into Excel or other applications. We're not fans of this -- it makes replication hard if we need to do this again -- but it's awful convenient. We heard about it from one of our former students. +Going the other way, we can copy the dataframe `df` to the clipboard with `df.to_clipboard()`. From the clipboard, we can paste it into Excel or other applications. We're not fans of this -- it makes replication hard if we need to do this again -- but it's awful convenient. We heard about it from one of our former students. -**Exercise.** Copy the dataframe `df` into an empty spreadsheet on your computer using the `to_clipboard()` method. +**Exercise.** Copy the dataframe `df` into an empty spreadsheet on your computer using the `to_clipboard()` method. -**The top and bottom of a dataframe.** We commonly work with much larger dataframes in which it's unwieldy, and perhaps impossible, to print the whole thing. So we often look at either the top or bottom: the first few few or last few observations. The statement `df.head(n)` extracts the top `n` observations and `df.head()` (with no input) extracts the top 5. This creates a new dataframe, as we see here: +**The top and bottom of a dataframe.** We commonly work with much larger dataframes in which it's unwieldy, and perhaps impossible, to print the whole thing. So we often look at either the top or bottom: the first few few or last few observations. The statement `df.head(n)` extracts the top `n` observations and `df.head()` (with no input) extracts the top 5. This creates a new dataframe, as we see here: ```python h = df.head(2) @@ -380,39 +380,39 @@ print(type(h)) print(h) ``` -The second print statement gives us the first 2 observations, which is what we requested. +The second print statement gives us the first 2 observations, which is what we requested. -`df.tail(2)` does the same for the bottom of the dataframe `df`: the last 2 observations. +`df.tail(2)` does the same for the bottom of the dataframe `df`: the last 2 observations. -**Setting the index.** We're not stuck with the index in our dataframe, we can make it whatever we want. If we want to use `name` as the index, associating observations with the `name` variable, we use the `set_index()` method: +**Setting the index.** We're not stuck with the index in our dataframe, we can make it whatever we want. If we want to use `name` as the index, associating observations with the `name` variable, we use the `set_index()` method: ```python -df = df.set_index(['name']) +df = df.set_index(['name']) ``` -That gives us +That gives us ```python x1 x2 x3 y1 y2 -name +name Dave 1 2 3.5 0.500000 5.5 Chase 4 3 4.3 1.333333 7.3 Spencer 5 6 7.8 0.833333 13.8 ``` -with `name` now used as the index. +with `name` now used as the index. -We did something else here that's important: We assigned the result back to `df`. That keeps what we've done in the dataframe `df`. If we hadn't done this, `df` would remain unchanged with a counter as its index and our effort to set the index would be lost. +We did something else here that's important: We assigned the result back to `df`. That keeps what we've done in the dataframe `df`. If we hadn't done this, `df` would remain unchanged with a counter as its index and our effort to set the index would be lost. -**Exercise.** Set `name` as the index as just described. Use the `index` method to extract it and verify that `name` is, in fact, the index. +**Exercise.** Set `name` as the index as just described. Use the `index` method to extract it and verify that `name` is, in fact, the index. -**Exercise.** Apply the `reset_index()` method to our new dataframe. What does it do? What is the index of the new dataframe? +**Exercise.** Apply the `reset_index()` method to our new dataframe. What does it do? What is the index of the new dataframe? -**Statistics.** We can compute the mean, the standard deviation, and other statistics for all the variables at once with +**Statistics.** We can compute the mean, the standard deviation, and other statistics for all the variables at once with ```python df.mean() @@ -420,46 +420,46 @@ df.std() df.describe() ``` -The first line gives us the means, the second the standard deviations. The third line gives us a collection of statistics, including the mean, the standard deviation, the min, and the max. +The first line gives us the means, the second the standard deviations. The third line gives us a collection of statistics, including the mean, the standard deviation, the min, and the max. -Note that we've done them all at once. `df.mean()`, for example, computes the means of all the variables in one line. Ditto the others. +Note that we've done them all at once. `df.mean()`, for example, computes the means of all the variables in one line. Ditto the others. -**Exercise.** The statement `print(df.mean())` gives us the means of each variable in a column. How would we produce the same output as a row? +**Exercise.** The statement `print(df.mean())` gives us the means of each variable in a column. How would we produce the same output as a row? -**Exercise.** What kind of object does `df.mean()` produce? +**Exercise.** What kind of object does `df.mean()` produce? -**Plotting.** We have a number of methods available that plot dataframes. The most basic is the `plot()` method, which plots all of the variables against the index. Try this and see what it looks like: +**Plotting.** We have a number of methods available that plot dataframes. The most basic is the `plot()` method, which plots all of the variables against the index. Try this and see what it looks like: ```python df.plot() ``` -You should see lines for each of the variables plotted against the index `name`. +You should see lines for each of the variables plotted against the index `name`. -**Exercise.** Produce a bar chart of the same data with the statement +**Exercise.** Produce a bar chart of the same data with the statement ```python df.plot(kind='bar') ``` -What happens if we change `df` to `df['x1']`? Change `bar` to `barh`? +What happens if we change `df` to `df['x1']`? Change `bar` to `barh`? -## Data input 2: Reading files from your computer +## Data input 2: Reading files from your computer -Next up: reading files in Python from your computer's hard drive. This is really useful, but there's a catch: we need to tell Python where to find the file. +Next up: reading files in Python from your computer's hard drive. This is really useful, but there's a catch: we need to tell Python where to find the file. - +--> -**Prepare test data.** We start with the easy part. Open a blank spreadsheet in Excel and enter the data: +**Prepare test data.** We start with the easy part. Open a blank spreadsheet in Excel and enter the data: ```python name x1 x2 x3 @@ -468,55 +468,55 @@ Chase 4 3 4 Spencer 5 6 7 ``` -That is: four rows with four entries in each one. +That is: four rows with four entries in each one. Now save the contents in your `Data_Bootcamp` directory. Do this three times in different formats: -* Excel file. Save the file as `test.xlsx`. -* Old-style Excel file. Save as an "Excel 97-2003 (*.xls)" file under the name `test.xls`. -* CSV file. Save as a "CSV (Comma delimited) (*.csv)" file under the name `test.csv`. +* Excel file. Save the file as `test.xlsx`. +* Old-style Excel file. Save as an "Excel 97-2003 (*.xls)" file under the name `test.xls`. +* CSV file. Save as a "CSV (Comma delimited) (*.csv)" file under the name `test.csv`. -Each of these options shows up in Excel when we choose "Save As." +Each of these options shows up in Excel when we choose "Save As." -**Find the file.** Ok, now where is the file? We know, it's in the `Data_Bootcamp` directory, but where is that? We need the complete path so we can tell Python where to find it. +**Find the file.** Ok, now where is the file? We know, it's in the `Data_Bootcamp` directory, but where is that? We need the complete path so we can tell Python where to find it. -Let's introduce some terms so we can be clear what we're talking about. The "file name" is something like `test.csv`. The format of the "directory" or folder address depends on the operating system. On a Windows computer, it's something like +Let's introduce some terms so we can be clear what we're talking about. The "file name" is something like `test.csv`. The format of the "directory" or folder address depends on the operating system. On a Windows computer, it's something like ``` -C:\Users\userid\Documents\Data_Bootcamp +C:\Users\userid\Documents\Data_Bootcamp ``` -On a Mac, it looks like +On a Mac, it looks like -```python -/Users/userid/Data_Bootcamp +```python +/Users/userid/Data_Bootcamp ``` -The complete path to the file `test.csv` is a combination of the path to the directory and the file name, with a slash in between. In Windows: +The complete path to the file `test.csv` is a combination of the path to the directory and the file name, with a slash in between. In Windows: ``` C:\Users\userid\Documents\Data_Bootcamp\test.csv ``` -In Mac OS: +In Mac OS: -```python -/Users/userid/Data_Bootcamp/test.csv +```python +/Users/userid/Data_Bootcamp/test.csv ``` -Note that they use different kinds of slashes. +Note that they use different kinds of slashes. -How did we find these addresses or paths? +How did we find these addresses or paths? -* Windows. Type the name of the file in the Windows search box or use Windows Explorer. -* Mac OS. We select (but not open) the file and do "command-i" to get the file's information window. From the window, we copy the path that follows "Where:." It looks like we're copying arrows, but they turn into slashes when we paste the path. +* Windows. Type the name of the file in the Windows search box or use Windows Explorer. +* Mac OS. We select (but not open) the file and do "command-i" to get the file's information window. From the window, we copy the path that follows "Where:." It looks like we're copying arrows, but they turn into slashes when we paste the path. -**Exercise.** Look for the complete path to `test.csv` on your computer. Let us know if you can't find it. If you did, let us know what method you used. Did we miss any? +**Exercise.** Look for the complete path to `test.csv` on your computer. Let us know if you can't find it. If you did, let us know what method you used. Did we miss any? -**Reading data with the complete path.** Once we have the complete path, we simply tell Python to read the file at that location. Again, this varies with the operating system. +**Reading data with the complete path.** Once we have the complete path, we simply tell Python to read the file at that location. Again, this varies with the operating system. * In Windows, we take the complete path and -- **this is important** -- change all the backslashes `\` to either double backslashes `\\` or forward slashes `/`. (Don't ask.) Then we read the file from the path, just as we read it from a url earlier: @@ -525,203 +525,203 @@ path = 'C:\\Users\\userid\\Documents\\Data_Bootcamp\\test.csv' df = read_csv(path) ``` -* In Mac OS we don't need to change the slashes: +* In Mac OS we don't need to change the slashes: ```python path = '/Users/userid/Data_Bootcamp/test.csv' df = read_csv(path) ``` -* In both: Open the file in Excel, click on File, and read the path from the Info tab. +* In both: Open the file in Excel, click on File, and read the path from the Info tab. -**Exercise.** Try the appropriate one on your computer to make sure it works. Let us know if it doesn't. +**Exercise.** Try the appropriate one on your computer to make sure it works. Let us know if it doesn't. -**Reading from the current working directory.** An alternative is to set the current working directory (cwd), which is where Python will look for files. We can set that with Python's [os module](https://docs.python.org/3.5/library/os.html). What you'll need is the location of the `Data_Bootcamp` directory. +**Reading from the current working directory.** An alternative is to set the current working directory (cwd), which is where Python will look for files. We can set that with Python's [os module](https://docs.python.org/3.5/library/os.html). What you'll need is the location of the `Data_Bootcamp` directory. -Once we know the directory path, we can use it in Python. +Once we know the directory path, we can use it in Python. -* In Windows, we use +* In Windows, we use + + ```python + import os - ```python - import os - file = 'test.csv' cwd = 'C:/Users/userid/Data_Bootcamp' - - os.chdir(cwd) # set current working directory + + os.chdir(cwd) # set current working directory print('Current working directory is', os.getcwd()) - print('File exists?', os.path.isfile(file)) # check to see if file is there - + print('File exists?', os.path.isfile(file)) # check to see if file is there + df = pd.read_csv(file) ``` - Note that we used forward slashes here. We could also use double back slashes if we prefer. + Note that we used forward slashes here. We could also use double back slashes if we prefer. + +* In Mac OS, the only difference is the format of the path: -* In Mac OS, the only difference is the format of the path: + ```python + import os - ```python - import os - file = 'test.csv' cwd = '/Users/userid/Data_Bootcamp' - - os.chdir(cwd) # set current working directory + + os.chdir(cwd) # set current working directory print('Current working directory is', os.getcwd()) - print('File exists?', os.path.isfile(file)) # check to see if file is there - + print('File exists?', os.path.isfile(file)) # check to see if file is there + df = pd.read_csv(file) ``` -Once we've set the path, we read the csv file as before. `read_excel()` works the same way with Excel files. +Once we've set the path, we read the csv file as before. `read_excel()` works the same way with Excel files. +If this doesn't work, go back to the complete path. +--> -**Report problems.** If you have difficulty, or find that this works differently on your computer, let us know. +**Report problems.** If you have difficulty, or find that this works differently on your computer, let us know. -## Data input: Examples +## Data input: Examples -Here are some spreadsheet datasets we find interesting. In each one, we describe the data using the `shape`, `columns`, and `head` methods. Where we can, we also produce a simple plot. +Here are some spreadsheet datasets we find interesting. In each one, we describe the data using the `shape`, `columns`, and `head` methods. Where we can, we also produce a simple plot. -**Penn World Table.** The [PWT](http://www.rug.nl/research/ggdc/data/pwt/?lang=en), as we call it, is a standard database for comparing the incomes of countries. It includes annual data for GDP, GDP per person, employment, hours worked, capital, and many other things. The variables are measured on a comparable basis, with GDP measured in 2005 US dollars. +**Penn World Table.** The [PWT](http://www.rug.nl/research/ggdc/data/pwt/?lang=en), as we call it, is a standard database for comparing the incomes of countries. It includes annual data for GDP, GDP per person, employment, hours worked, capital, and many other things. The variables are measured on a comparable basis, with GDP measured in 2005 US dollars. -The data is in an [Excel spreadsheet](http://www.rug.nl/research/ggdc/data/pwt/v81/pwt81.xlsx). If we open it, we see that it has three sheets. The third one is the data and is named `Data`. We read it in with the code: +The data is in an [Excel spreadsheet](http://www.rug.nl/research/ggdc/data/pwt/v81/pwt81.xlsx). If we open it, we see that it has three sheets. The third one is the data and is named `Data`. We read it in with the code: ```python -import pandas as pd +import pandas as pd url = 'http://www.rug.nl/research/ggdc/data/pwt/v81/pwt81.xlsx' pwt = pd.read_excel(url, sheetname='Data') ``` -So what does that give us? +So what does that give us? -* `pwt.shape` returns `(10357, 47)`: the dataframe `pwt` contains 10,357 observations of 47 variables. -* `list(pwt)` gives us the variable names, which include `countrycode`, `country`, `year`, `rgdpo` (real GDP), and `pop` (population). -* `pwt.head()` shows us the first 5 observations, which refer to Angola for the years 1950 to 1954. If we look further down, we see that countries are stacked on top of each other in alphabetical order. +* `pwt.shape` returns `(10357, 47)`: the dataframe `pwt` contains 10,357 observations of 47 variables. +* `list(pwt)` gives us the variable names, which include `countrycode`, `country`, `year`, `rgdpo` (real GDP), and `pop` (population). +* `pwt.head()` shows us the first 5 observations, which refer to Angola for the years 1950 to 1954. If we look further down, we see that countries are stacked on top of each other in alphabetical order. -In this dataset, each column is a variable and each row is an observation. But if we were to plot one of the variables, it wouldn't make much sense. The observations string together countries, one after the other. What we'd like to do is compare countries, which this isn't set up to do -- yet. +In this dataset, each column is a variable and each row is an observation. But if we were to plot one of the variables, it wouldn't make much sense. The observations string together countries, one after the other. What we'd like to do is compare countries, which this isn't set up to do -- yet. **Exercise.** Download the spreadsheet and open it in Excel. What does it look like? (You can use your Google fu here: Google "penn world table 8.1", go to the first link, and look for the Excel link.) -**Exercise.** Change the input in the last line of code to `sheetname=2`. Why does this work? +**Exercise.** Change the input in the last line of code to `sheetname=2`. Why does this work? -**World Economic Outlook.** Another good source of macroeconomic data for countries is the IMF's [World Economic Outlook](https://www.imf.org/external/ns/cs.aspx?id=28) or WEO. It comes out twice a year and includes annual data from 1980 to roughly 5 years in the future (forecasts, evidently). It includes the usual GDP, but also government debt and deficits, interest rates, and exchange rates. +**World Economic Outlook.** Another good source of macroeconomic data for countries is the IMF's [World Economic Outlook](https://www.imf.org/external/ns/cs.aspx?id=28) or WEO. It comes out twice a year and includes annual data from 1980 to roughly 5 years in the future (forecasts, evidently). It includes the usual GDP, but also government debt and deficits, interest rates, and exchange rates. This one gives us some idea of the challenges we face dealing with what looks like ordinary spreadsheet data. The file extension is `xls`, which suggests it's an Excel spreadsheet, but that's a lie. In fact it's a "tab-delimited" file: essentially a csv, but with tabs rather than commas separating entries. We read it with -```python +```python import pandas as pd url1 = 'https://www.imf.org/external/pubs/ft/weo/' url2 = '2015/02/weodata/WEOOct2015all.xls' -weo = pd.read_csv(url1+url2, - sep='\t', # \t = tab - thousands=',', # kill commas - na_values=['n/a', '--']) # missing values +weo = pd.read_csv(url1+url2, + sep='\t', # \t = tab + thousands=',', # kill commas + na_values=['n/a', '--']) # missing values ``` -This has several features we need to deal with: +This has several features we need to deal with: -* Use `read_csv()` rather than `read_excel()`: it's not an Excel file. -* Identify tabs as the separator between entries with the argument `sep='\t'`. +* Use `read_csv()` rather than `read_excel()`: it's not an Excel file. +* Identify tabs as the separator between entries with the argument `sep='\t'`. * Eliminate commas from numbers -- things like `12,345.6`, which Python will treat as strings. (What were they thinking of?) -* Identify missing values. +* Identify missing values. -Keep in mind that it took us an hour or two to figure all this out. Sometimes we find that others have done this for us. +Keep in mind that it took us an hour or two to figure all this out. Sometimes we find that others have done this for us. -**Exercise.** Download the WEO file. What happens when you open it in Excel? (You can use the link in the code. Or Google "IMF WEO", look for the most recent link, and choose Entire Dataset.) +**Exercise.** Download the WEO file. What happens when you open it in Excel? (You can use the link in the code. Or Google "IMF WEO", look for the most recent link, and choose Entire Dataset.) -**Exercise.** Why were we able to spread the `read_csv()` statement over several lines? +**Exercise.** Why were we able to spread the `read_csv()` statement over several lines? -**Exercise.** Google "python pandas weo" to see if someone else has figured out how to read this file. +**Exercise.** Google "python pandas weo" to see if someone else has figured out how to read this file. -**Exercise.** How big is the dataframe `weo`? What variables does it include? Use the statement `weo[[0, 1, 2, 3, 4]].head()` to see what the first five columns contain. +**Exercise.** How big is the dataframe `weo`? What variables does it include? Use the statement `weo[[0, 1, 2, 3, 4]].head()` to see what the first five columns contain. -This dataset doesn't come in the standard format, with columns as variables and rows as observations. Instead, each row contains observations for all years for some variable and country combination. If we want to work with it, we'll have to change the structure. Which we'll do, but not now. +This dataset doesn't come in the standard format, with columns as variables and rows as observations. Instead, each row contains observations for all years for some variable and country combination. If we want to work with it, we'll have to change the structure. Which we'll do, but not now. - -**PISA education data.** PISA stands for [Program for International Student Assessment](http://www.oecd.org/pisa/). It's an international effort to collect information about educational performance that's comparable across countries. We read about it every few years when newspapers print stories about how poorly American students are doing. PISA collects data on student performance, teacher quality, and many other things, and posts both summaries and individual test results. +**PISA education data.** PISA stands for [Program for International Student Assessment](http://www.oecd.org/pisa/). It's an international effort to collect information about educational performance that's comparable across countries. We read about it every few years when newspapers print stories about how poorly American students are doing. PISA collects data on student performance, teacher quality, and many other things, and posts both summaries and individual test results. -We use data from a summary table in an [OECD report](http://www.oecd.org/pisa/keyfindings/pisa-2012-results-volume-I.pdf); note the data link at the bottom of Table 1.A. This codes reads the data from the link: +We use data from a summary table in an [OECD report](http://www.oecd.org/pisa/keyfindings/pisa-2012-results-volume-I.pdf); note the data link at the bottom of Table 1.A. This codes reads the data from the link: -```python +```python import pandas as pd url = 'http://dx.doi.org/10.1787/888932937035' -pisa = pd.read_excel(url, - skiprows=18, # skip the first 18 rows - skipfooter=7, # skip the last 7 - parse_cols=[0,1,9,13], # select columns of interest +pisa = pd.read_excel(url, + skiprows=18, # skip the first 18 rows + skipfooter=7, # skip the last 7 + parse_cols=[0,1,9,13], # select columns of interest index_col=0, # set the index as the first column - header=[0,1] # set the variable names + header=[0,1] # set the variable names ) ``` -There are a number of new things in the read statement: +There are a number of new things in the read statement: -* We've spread the read statement over several lines to make it easier to read. Python understands that the line doesn't end until we reach the right paren `)`. That's common Python syntax. -* We skip rows at the top and bottom that do not contain data. -* We choose specific columns to read using the `parse_cols` parameter. Column numbering starts at zero, as we have come to expect. -* We set the index and column labels. +* We've spread the read statement over several lines to make it easier to read. Python understands that the line doesn't end until we reach the right paren `)`. That's common Python syntax. +* We skip rows at the top and bottom that do not contain data. +* We choose specific columns to read using the `parse_cols` parameter. Column numbering starts at zero, as we have come to expect. +* We set the index and column labels. -We can clean this up further if we drop blank lines and simplify the variable names: +We can clean this up further if we drop blank lines and simplify the variable names: -```python -pisa = pisa.dropna() # drop blank lines -pisa.columns = ['Math', 'Reading', 'Science'] # simplify variable names +```python +pisa = pisa.dropna() # drop blank lines +pisa.columns = ['Math', 'Reading', 'Science'] # simplify variable names pisa['Math'].plot(kind='barh') ``` -The plot we produce in the last line is virtually impossible to read, but we'll work on that later. +The plot we produce in the last line is virtually impossible to read, but we'll work on that later. -**UN population data.** We tend to have pretty good demographic data. We keep track of how many people we have, their ages, how many children they have, what they die of, and so on. A good international source is the [United Nations' Population Division](http://esa.un.org/unpd/wpp/Download/Standard/Population/). +**UN population data.** We tend to have pretty good demographic data. We keep track of how many people we have, their ages, how many children they have, what they die of, and so on. A good international source is the [United Nations' Population Division](http://esa.un.org/unpd/wpp/Download/Standard/Population/). -This code reads in estimates of population by age for many countries: +This code reads in estimates of population by age for many countries: ```python url1 = 'http://esa.un.org/unpd/wpp/DVD/Files/' url2 = '1_Indicators%20(Standard)/EXCEL_FILES/1_Population/' url3 = 'WPP2015_POP_F07_1_POPULATION_BY_AGE_BOTH_SEXES.XLS' -url = url1 + url2 + url3 +url = url1 + url2 + url3 cols = [2, 4, 5] + list(range(6,28)) est = pd.read_excel(url, sheetname=0, skiprows=16, parse_cols=cols) ``` -The columns contain population number for 5-year age groups. when we're up to it, we'll use this data to illustrate the dramatic aging of the population in most countries. It's one of the striking facts of modern times: people are living longer, a lot longer. +The columns contain population number for 5-year age groups. when we're up to it, we'll use this data to illustrate the dramatic aging of the population in most countries. It's one of the striking facts of modern times: people are living longer, a lot longer. -**Exercise.** What does `list(range(6,28))` do? Why? +**Exercise.** What does `list(range(6,28))` do? Why? -**Incomes by college major.** Nate Silver's [538 blog](http://fivethirtyeight.com/) does a lot of good data journalism and often posts its data online. This one comes from their analysis of [income by college major](http://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/). The data comes from the American Community Survey but they've done the work of organizing it for us. +**Incomes by college major.** Nate Silver's [538 blog](http://fivethirtyeight.com/) does a lot of good data journalism and often posts its data online. This one comes from their analysis of [income by college major](http://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/). The data comes from the American Community Survey but they've done the work of organizing it for us. -Here's the code: +Here's the code: ```python url1 = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/' @@ -730,170 +730,170 @@ url = url1 + url2 df538 = pd.read_csv(url) ``` -**Exercise.** What variables does this data contain? +**Exercise.** What variables does this data contain? -**Exercise.** Set the index as `Major`. (Ask yourself: What method should I use?) +**Exercise.** Set the index as `Major`. (Ask yourself: What method should I use?) -**Exercise.** Create a horizontal bar chart with the variable `Median` (median salary) using the `plot()` method. +**Exercise.** Create a horizontal bar chart with the variable `Median` (median salary) using the `plot()` method. -**Internet Movie Database (IMDb).** We love this one, a list of roles in IMDb's movie database we got from [Brandon Rhodes](https://github.com/brandon-rhodes/pycon-pandas-tutorial). We read it with this code: +**Internet Movie Database (IMDb).** We love this one, a list of roles in IMDb's movie database we got from [Brandon Rhodes](https://github.com/brandon-rhodes/pycon-pandas-tutorial). We read it with this code: ```python url = 'http://pages.stern.nyu.edu/~dbackus/csv/cast.csv' cast = pd.read_csv(url, encoding='utf-8') ``` -Don't panic if nothing happens. It's a big file, and takes a minute or two to read. +Don't panic if nothing happens. It's a big file, and takes a minute or two to read. -**Exercise.** Since we're all experts by now, we'll leave this one to you: +**Exercise.** Since we're all experts by now, we'll leave this one to you: -* How large is the dataframe? -* What variables does it include? -* Try these statements: +* How large is the dataframe? +* What variables does it include? +* Try these statements: ```python ah = cast[cast['title'] == 'Annie Hall'] gc = cast[cast['name'] == 'George Clooney'] ``` -This goes beyond what we've done so far, but what do you think they do? What do the dataframes `ah` and `gc` contain? +This goes beyond what we've done so far, but what do you think they do? What do the dataframes `ah` and `gc` contain? - +--> +--> +--> ## Data input 3: APIs -APIs are "application program interfaces". That's a mouthful. A dataset with an API allows access through some method other than a spreadsheet. The API is the set of rules for accessing the data. The bad news is the jargon. The good news is that people have written easy-to-use code to access the APIs. We don't need to understand the API, we just use the code and say thank you. +APIs are "application program interfaces". That's a mouthful. A dataset with an API allows access through some method other than a spreadsheet. The API is the set of rules for accessing the data. The bad news is the jargon. The good news is that people have written easy-to-use code to access the APIs. We don't need to understand the API, we just use the code and say thank you. -The Pandas package has what they call a set of [Remote Data Access tools](http://pandas.pydata.org/pandas-docs/stable/remote_data.html). They break now and then, typically when the underlying data changes, but when they work they're great. This part of Pandas is undergoing a transition, but for now this is how it works. +The Pandas package has what they call a set of [Remote Data Access tools](http://pandas.pydata.org/pandas-docs/stable/remote_data.html). They break now and then, typically when the underlying data changes, but when they work they're great. This part of Pandas is undergoing a transition, but for now this is how it works. -**FRED.** The St Louis Fed has put together a large collection of time series data that they refer to as [FRED](https://research.stlouisfed.org/fred2/): Federal Reserve Economic Data. They started with the US, but now include data for many countries. +**FRED.** The St Louis Fed has put together a large collection of time series data that they refer to as [FRED](https://research.stlouisfed.org/fred2/): Federal Reserve Economic Data. They started with the US, but now include data for many countries. -The Pandas docs describe how to access FRED. Here's an example that reads in quarterly data for US real GDP and real consumption and produces a simply plot: +The Pandas docs describe how to access FRED. Here's an example that reads in quarterly data for US real GDP and real consumption and produces a simply plot: ```python -import pandas as pd -import pandas.io.data as web # package to access FRED -import datetime # package to handle dates +import pandas as pd +import pandas.io.data as web # package to access FRED +import datetime # package to handle dates -start = datetime.datetime(2010, 1, 1) # start date -codes = ['GDPC1', 'PCECC96'] # real GDP, real consumption -fred = web.DataReader(codes, 'fred', start) +start = datetime.datetime(2010, 1, 1) # start date +codes = ['GDPC1', 'PCECC96'] # real GDP, real consumption +fred = web.DataReader(codes, 'fred', start) fred = fred/1000 # convert billions to trillions fred.plot() ``` -We copied most of this from the Pandas documentation. Which is a good idea: Start with something that's supposed to work and change one thing at a time until you have what you want. +We copied most of this from the Pandas documentation. Which is a good idea: Start with something that's supposed to work and change one thing at a time until you have what you want. The variable `start` contains a date in (year, month, day) format. Pandas knows a lot about how to work with dates, especially when we construct them using `datetime.datetime` as we did above. -The variable `codes` -- not to be confused with "code" -- comes from FRED. Go to [FRED](https://research.stlouisfed.org/fred2/), use the search box to find the series you want, and look for the variable code at the end of the url in your browser. +The variable `codes` -- not to be confused with "code" -- comes from FRED. Go to [FRED](https://research.stlouisfed.org/fred2/), use the search box to find the series you want, and look for the variable code at the end of the url in your browser. +click on the "Cite" tab below the figure, and look for the code in square brackets. +--> -**Exercise.** Run the same code with a start date of 2005. What do you see? +**Exercise.** Run the same code with a start date of 2005. What do you see? -**World Bank.** The World Bank's [databank](http://data.worldbank.org/) covers economic and social statistics for most countries in the world. Variables include GDP, population, education, and infrastructure. Here's an example: +**World Bank.** The World Bank's [databank](http://data.worldbank.org/) covers economic and social statistics for most countries in the world. Variables include GDP, population, education, and infrastructure. Here's an example: ```python -import pandas as pd +import pandas as pd from pandas.io import wb # World Bank api -var = ['NY.GDP.PCAP.PP.KD'] # GDP per capita -iso = ['USA', 'FRA', 'JPN', 'CHN', 'IND', 'BRA', 'MEX'] # country codes +var = ['NY.GDP.PCAP.PP.KD'] # GDP per capita +iso = ['USA', 'FRA', 'JPN', 'CHN', 'IND', 'BRA', 'MEX'] # country codes year = 2013 wb = wb.download(indicator=var, country=iso, start=year, end=year) ``` -If we look at the dataframe `wb`, we see that it has a double index, `country` and `year`. By design, all of the data is for 2013, so we kill off that index with the `reset_index` method and plot what's left as a horizontal bar chart: +If we look at the dataframe `wb`, we see that it has a double index, `country` and `year`. By design, all of the data is for 2013, so we kill off that index with the `reset_index` method and plot what's left as a horizontal bar chart: ```python wb = wb.reset_index(level='year', drop=True) -wb.plot(kind='barh') +wb.plot(kind='barh') ``` (Trust us on the `drop=True`. We'll come back to it in a couple weeks.) -We use codes here for countries and variables. We find country codes in [this list](http://www.countryareacode.net/) -- or just Google "country codes". Pandas accepts both 2- and 3-letter versions. We find variable codes with the search tool in the Remote Data Access module or by looking through the World Bank's [data portal](http://databank.worldbank.org/data/home.aspx). We prefer the latter. Click on a variable of interest and read the code from the end of the url. +We use codes here for countries and variables. We find country codes in [this list](http://www.countryareacode.net/) -- or just Google "country codes". Pandas accepts both 2- and 3-letter versions. We find variable codes with the search tool in the Remote Data Access module or by looking through the World Bank's [data portal](http://databank.worldbank.org/data/home.aspx). We prefer the latter. Click on a variable of interest and read the code from the end of the url. -**Exercise.** What would you like to change in this graph? Keep a list for next time we run into this one. +**Exercise.** What would you like to change in this graph? Keep a list for next time we run into this one. -**Fama-French.** Gene Fama and Ken French post lots of data on equity returns on [Ken French’s website](http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html). The data are zipped text files, which we can easily read into Excel. The Pandas tool is even better. Here's an example: +**Fama-French.** Gene Fama and Ken French post lots of data on equity returns on [Ken French’s website](http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html). The data are zipped text files, which we can easily read into Excel. The Pandas tool is even better. Here's an example: ```python import pandas.io.data as web ff = web.DataReader('F-F_Research_Data_factors', 'famafrench')[0] -ff.columns = ['xsm', 'smb', 'hml', 'rf'] # rename variables +ff.columns = ['xsm', 'smb', 'hml', 'rf'] # rename variables ff.describe() ``` @@ -905,101 +905,101 @@ The data is monthly, 1926 to present. Returns are expressed as percentages; mul * `hml`: the return on value firms minus the return on growth firms * `rf`: the riskfree rate -We use the `describe()` method to compute statistics. Evidently `xsm` has the largest mean. It also has the largest standard deviation. A couple plot methods show us more about the distribution: +We use the `describe()` method to compute statistics. Evidently `xsm` has the largest mean. It also has the largest standard deviation. A couple plot methods show us more about the distribution: -```python +```python ff.boxplot() ff.plot() ff.plot(ff['xsm'], ff['smb'], kind='scatter') ``` -What do you see? What more would you like to know? +What do you see? What more would you like to know? -## Review +## Review -Run this code to create a dataframe of technology indicators from the World Bank for four African countries: +Run this code to create a dataframe of technology indicators from the World Bank for four African countries: ```python -import pandas as pd -data = {'EG.ELC.ACCS.ZS': [53.2, 47.3, 85.4, 22.1], # access to elec (%) - 'IT.CEL.SETS.P2': [153.8, 95.0, 130.6, 74.8], # cell contracts per 100 - 'IT.NET.USER.P2': [11.5, 12.9, 41.0, 13.5], # internet access (%) - 'Country': ['Botswana', 'Namibia', 'South Africa', 'Zambia']} +import pandas as pd +data = {'EG.ELC.ACCS.ZS': [53.2, 47.3, 85.4, 22.1], # access to elec (%) + 'IT.CEL.SETS.P2': [153.8, 95.0, 130.6, 74.8], # cell contracts per 100 + 'IT.NET.USER.P2': [11.5, 12.9, 41.0, 13.5], # internet access (%) + 'Country': ['Botswana', 'Namibia', 'South Africa', 'Zambia']} wb = pd.DataFrame(data) ``` (You can cut and paste this from the bottom of this chapter's code file.) -**Exercise.** What type of object is `wb`? What are its dimensions? +**Exercise.** What type of object is `wb`? What are its dimensions? **Exercise.** What are the variable names? -**Exercise.** What is the index? *Bonus points:* Change the index to the country names. +**Exercise.** What is the index? *Bonus points:* Change the index to the country names. -**Exercise.** Create a horizontal bar chart with this dataframe. What does it tell us? Which country has the most access to electricity? Cell phones? +**Exercise.** Create a horizontal bar chart with this dataframe. What does it tell us? Which country has the most access to electricity? Cell phones? -**Exercise (challenging).** Change the variable names to something more informative. +**Exercise (challenging).** Change the variable names to something more informative. -## Resources +## Resources -We've covered a lot of ground, but if you're looking for more we suggest: +We've covered a lot of ground, but if you're looking for more we suggest: -* On Pandas: Chris Moffitt's [Practical Business Python blog](http://pbpython.com/archives.html) has a good series on Pandas from the perspective of an Excel user. +* On Pandas: Chris Moffitt's [Practical Business Python blog](http://pbpython.com/archives.html) has a good series on Pandas from the perspective of an Excel user. * On data: See our list of common [data sources](http://databootcamp.nyuecon.com/bootcamp_data/). - +--> diff --git a/pandas-merging.md b/pandas-merging.md index 90df6ca..e76d860 100644 --- a/pandas-merging.md +++ b/pandas-merging.md @@ -1,15 +1,15 @@ -# More Pandas: Combining dataframes +# More Pandas: Combining dataframes --- -**Overview.** +**Overview.** -**Python tools.** +**Python tools.** -**Buzzwords.** +**Buzzwords.** -**Applications.** +**Applications.** -**Code.** Link. +**Code.** Link. --- @@ -18,7 +18,7 @@ ## Reminders -DataFrames: index, columns +DataFrames: index, columns diff --git a/pandas-munging.md b/pandas-munging.md index 04b1830..d6293e4 100644 --- a/pandas-munging.md +++ b/pandas-munging.md @@ -1,66 +1,66 @@ -# More Pandas: ?? revisited +# More Pandas: ?? revisited --- -**Overview.** We often get data in one form and want to change it to another. Pandas has an exceptional collection of tools for doing this, but it takes us out of our Excel mindset. +**Overview.** We often get data in one form and want to change it to another. Pandas has an exceptional collection of tools for doing this, but it takes us out of our Excel mindset. -**Python tools.** Pandas, data frames, index, columns, transpose... +**Python tools.** Pandas, data frames, index, columns, transpose... -**Buzzwords.** Want operator, selection (filtering), +**Buzzwords.** Want operator, selection (filtering), -**Applications.** +**Applications.** -**Code.** Link. +**Code.** Link. --- **UNDER CONSTRUCTION** -The idea is to get going, cover details later. +The idea is to get going, cover details later. -The idea is to start with what we want the end product to be: to apply, in the words of a colleague, the **want operator**. +The idea is to start with what we want the end product to be: to apply, in the words of a colleague, the **want operator**. ## Reminders -DataFrames: index, columns +DataFrames: index, columns -## Test cases +## Test cases -Chipotle +Chipotle Teaching? CAS? -ATUS -MEPS -Class poll -Movie titles +ATUS +MEPS +Class poll +Movie titles -## Setting the index +## Setting the index -## Selecting variables +## Selecting variables -## Selecting observations and/or slicing +## Selecting observations and/or slicing -Index, Boolean +Index, Boolean -## String methods +## String methods -## Grouping data +## Grouping data -Groupby, value counts +Groupby, value counts -## Multi-indexes +## Multi-indexes -WEO +WEO -## Renaming variables +## Renaming variables ## Selecting variables (columns) @@ -71,25 +71,25 @@ WEO http://stackoverflow.com/questions/31593201/pandas-iloc-vs-ix-vs-loc-explanation -## Grouping data +## Grouping data -## Combining dataframes +## Combining dataframes -## SQL commands +## SQL commands -A lot of this reproduces the functionality of a SQL database. Pandas has added some commands that make this explicit. +A lot of this reproduces the functionality of a SQL database. Pandas has added some commands that make this explicit. - +--> -## References +## References -Brandon Rhodes. This is great. +Brandon Rhodes. This is great. https://youtu.be/5JnMutdy6Fw https://github.com/brandon-rhodes/pycon-pandas-tutorial/ https://en.wikipedia.org/wiki/Pivot_table -Other +Other * Groupby: http://pandas.pydata.org/pandas-docs/stable/groupby.html * stack and unstack: http://pandas.pydata.org/pandas-docs/stable/reshaping.html -Kaggle example: http://blog.kaggle.com/2013/01/17/getting-started-with-pandas-predicting-sat-scores-for-new-york-city-schools/ +Kaggle example: http://blog.kaggle.com/2013/01/17/getting-started-with-pandas-predicting-sat-scores-for-new-york-city-schools/ -Lots of examples: +Lots of examples: http://tomaugspurger.github.io/ http://nbviewer.ipython.org/github/TomAugspurger/PyDataSeattle/tree/master/notebooks/ -SQL intro https://www.khanacademy.org/computing/hour-of-code/hour-of-sql/v/welcome-to-sql +SQL intro https://www.khanacademy.org/computing/hour-of-code/hour-of-sql/v/welcome-to-sql -https://www.reddit.com/r/Python/comments/3wa22v/120gb_csv_is_this_something_i_can_handle_in_python/ +https://www.reddit.com/r/Python/comments/3wa22v/120gb_csv_is_this_something_i_can_handle_in_python/ -SQL and Pandas: https://www.youtube.com/watch?v=1uVWjdAbgBg +SQL and Pandas: https://www.youtube.com/watch?v=1uVWjdAbgBg -http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/ +http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/ http://www.gregreda.com/2013/10/26/working-with-pandas-dataframes/ -http://manishamde.github.io/blog/2013/03/07/pandas-and-python-top-10/ +http://manishamde.github.io/blog/2013/03/07/pandas-and-python-top-10/ http://markthegraph.blogspot.com/2014/01/pandas-dataframe-cheat-sheet-and-python.html -http://nicolas.kruchten.com/content/2015/09/jupyter_pivottablejs/ \ No newline at end of file +http://nicolas.kruchten.com/content/2015/09/jupyter_pivottablejs/ diff --git a/practice.md b/practice.md index 39b9832..cfaef89 100644 --- a/practice.md +++ b/practice.md @@ -1,4 +1,4 @@ -# Image practice +# Image practice diff --git a/py-fun1.md b/py-fun1.md index c7e8c72..a96e3ba 100644 --- a/py-fun1.md +++ b/py-fun1.md @@ -1,13 +1,13 @@ # Python fundamentals 1 --- -**Overview.** Time to start programming! We work our way through some of the essentials of Python's core language. We do this in the Spyder coding environment. Part 1 of 2. +**Overview.** Time to start programming! We work our way through some of the essentials of Python's core language. We do this in the Spyder coding environment. Part 1 of 2. -**Python tools.** Syntax, Spyder, calculations, assignments, strings, lists, built-in functions, objects, methods, tab completion, object inspector. +**Python tools.** Syntax, Spyder, calculations, assignments, strings, lists, built-in functions, objects, methods, tab completion, object inspector. -**Buzzwords.** Isn't that enough? +**Buzzwords.** Isn't that enough? -**Trigger warning.** Technical content, cannot be mastered without effort. +**Trigger warning.** Technical content, cannot be mastered without effort. **Code.** [Link](https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/Python/bootcamp_fundamentals_1.py). @@ -15,292 +15,292 @@ We're now ready to explore the rudiments of Python. We're going to **jump right in** to the deep end of the pool. For a couple weeks, you may feel like you've been dropped in a foreign country where you don't speak the language. You'll hear terms like "strings", "floats", "objects", "methods", and "tab completion". Don't panic, it's just jargon. If you put some effort into this over the next 2-4 weeks, you'll be fine. And ask questions. Really. **Ask lots of questions.** - + -The challenge and beauty of writing computer programs is that we need to be precise. If we mistype anything, the program won't work. Or it might seem to work, but the output won't be what we expect. In formal terms, the **syntax** -- the set of rules governing the language -- is less flexible than natural language (English, for example). We mix Python concepts with an introduction to **Spyder**, the Python coding environment we described earlier. +The challenge and beauty of writing computer programs is that we need to be precise. If we mistype anything, the program won't work. Or it might seem to work, but the output won't be what we expect. In formal terms, the **syntax** -- the set of rules governing the language -- is less flexible than natural language (English, for example). We mix Python concepts with an introduction to **Spyder**, the Python coding environment we described earlier. - ## Reminders -Remind yourself about the following: +Remind yourself about the following: -* Spyder. An environment for writing and running Python programs. Its components include an editor, an IPython console, and the Object explorer. +* Spyder. An environment for writing and running Python programs. Its components include an editor, an IPython console, and the Object explorer. -* `Data_Bootcamp` directory. The place in your computer where you store files for this course. +* `Data_Bootcamp` directory. The place in your computer where you store files for this course. -**Exercises.** +**Exercises.** -* Start Spyder. If you're not sure how, return to the prevous chapter. -* In Spyder, point to the editor, IPython console, and Object inspector. -* Download the code file for this chapter and save it in your `Data_Bootcamp` directory. +* Start Spyder. If you're not sure how, return to the prevous chapter. +* In Spyder, point to the editor, IPython console, and Object inspector. +* Download the code file for this chapter and save it in your `Data_Bootcamp` directory. ## The logic of Python programs -In a spreadsheet program such as Excel, we can connect cells to other cells. Then when we change one cell, any other cells connected to it update automatically. +In a spreadsheet program such as Excel, we can connect cells to other cells. Then when we change one cell, any other cells connected to it update automatically. -Most computer programs, including Python programs, don't work that way. They run one line at a time, starting at the top of the program and working through the list of instructions until they reach the end or stop for some other reason. A program is just a detailed list of things we want the computer to do. +Most computer programs, including Python programs, don't work that way. They run one line at a time, starting at the top of the program and working through the list of instructions until they reach the end or stop for some other reason. A program is just a detailed list of things we want the computer to do. -Most of the programs in this course have the structure: +Most of the programs in this course have the structure: -* Input data. -* Manipulate the data until it's in the form we want. -* Produce some graphics that summarize the data in a compelling way. +* Input data. +* Manipulate the data until it's in the form we want. +* Produce some graphics that summarize the data in a compelling way. Each of these bullet points is typically associated with a number of lines of code, possibly a large number, but that's the general idea. - -## Calculations in Spyder's IPython console +## Calculations in Spyder's IPython console We'll do lots of numerical calculations. That's mostly what managing data is about: adding things up, dividing one thing by another, and so on. We'll do this initially in Spyder's **IPython console**, typically located in the lower right corner (look for a tab with this label). To see how calculations work in Python, type these expressions in Spyder's IPython console **one at a time**: -```python +```python 2*3 2 * 3 2/3 -2^3 +2^3 2**3 -log(3) +log(3) ``` Type each one into the console, hit return, and look to see what happens. The first one multiplies 2 times 3, and (hopefully) gives us 6 as the answer. The input and output look like this in the console: -```python +```python In [1]: 2*3 Out[1]: 6 ``` -The first line is our input, we typed it. The number in brackets `[1]` is a line number. We don't type it, it's there in the console to begin with. As we proceed the number [1] increases to [2], [3], and so on. The second line -- the one that starts `Out[1]` -- is the response or output Python produces. +The first line is our input, we typed it. The number in brackets `[1]` is a line number. We don't type it, it's there in the console to begin with. As we proceed the number [1] increases to [2], [3], and so on. The second line -- the one that starts `Out[1]` -- is the response or output Python produces. -The second calculation, `2 * 3`, does the same thing. The spaces around the * don't change the output. As a general rule, we can put spaces wherever we think they make the code more readable. +The second calculation, `2 * 3`, does the same thing. The spaces around the * don't change the output. As a general rule, we can put spaces wherever we think they make the code more readable. -The third calculation is division. The input and output are -```python +The third calculation is division. The input and output are +```python In [3]: 2/3 Out[3]: 0.6666666666666666 ``` -The fourth calculation, `2^3`, gives us -```python +The fourth calculation, `2^3`, gives us +```python In [4]: 2^3 Out[4]: 1 ``` -Hmmmm. What just happened? We expected the answer to be 8 (2 to the power 3), but evidently it's not. The short answer is that the hat symbol `^` doesn't do exponents in Python, as it does in Excel. It does something else, which we won't go into. +Hmmmm. What just happened? We expected the answer to be 8 (2 to the power 3), but evidently it's not. The short answer is that the hat symbol `^` doesn't do exponents in Python, as it does in Excel. It does something else, which we won't go into. -That makes this is a good time to practice our **Google fu**: +That makes this is a good time to practice our **Google fu**: **Exercise.** Use Google to search for "python exponents." Use what you find to compute 2 to the power 3. (Don't look below, that's cheating.) -We should find, after wading through the links, that exponents in Python are done this way: -```python +We should find, after wading through the links, that exponents in Python are done this way: +```python In [5]: 2**3 Out[5]: 8 ``` -**Exercise.** What does the calculation `2 ** 3` produce? +**Exercise.** What does the calculation `2 ** 3` produce? -Our last calculation is the log function. Entering `log(3)` generates the message: `NameError: name 'log' is not defined`. This is an example of a **syntax error**: we have used language that Python doesn't understand. Here the message is pretty clear: it doesn't know what `log` means. In other cases, the error message may be more mysterious. We can use functions like `log` and `sqrt` in Python, just as we do in Excel, but we need to import them specially. (And we will, but not yet.) +Our last calculation is the log function. Entering `log(3)` generates the message: `NameError: name 'log' is not defined`. This is an example of a **syntax error**: we have used language that Python doesn't understand. Here the message is pretty clear: it doesn't know what `log` means. In other cases, the error message may be more mysterious. We can use functions like `log` and `sqrt` in Python, just as we do in Excel, but we need to import them specially. (And we will, but not yet.) -**Exercise.** What happens if you try to calculate the square root of 2 with `sqrt(2)`, as you would in Excel? How would you do it? +**Exercise.** What happens if you try to calculate the square root of 2 with `sqrt(2)`, as you would in Excel? How would you do it? -## Assigning values to variables +## Assigning values to variables -Or maybe we should use scare quotes: "Assigning" "values" to "variables." +Or maybe we should use scare quotes: "Assigning" "values" to "variables." We'll start with examples and explain what they do. Type these two lines into the IPython console one at at time. -```python -x = 2 -y = 3 +```python +x = 2 +y = 3 ``` -In each of these lines: +In each of these lines: -* The thing on the left is called a **variable**. In the first line, `x` is the variable. In the second, `y` is the variable. -* The thing on the right is a **value**. In the first line, `2` is the value. In the second, `3` is the value. -* The equals sign `=` **assigns** the value on the right to the variable on the left. Thus the first line assigns the value `2` to the variable `x`. The second assigns the value `3` to the variable `y`. +* The thing on the left is called a **variable**. In the first line, `x` is the variable. In the second, `y` is the variable. +* The thing on the right is a **value**. In the first line, `2` is the value. In the second, `3` is the value. +* The equals sign `=` **assigns** the value on the right to the variable on the left. Thus the first line assigns the value `2` to the variable `x`. The second assigns the value `3` to the variable `y`. -We call statements like these **assignments**: We assign a value to a variable. +We call statements like these **assignments**: We assign a value to a variable. - + -We can see the results of these assignments by checking the contents of the variables `x` and `y`. In the IPython console, typing a variable and hitting return gives us its value. If we type `x` and `y`, one at a time, we get -```python +We can see the results of these assignments by checking the contents of the variables `x` and `y`. In the IPython console, typing a variable and hitting return gives us its value. If we type `x` and `y`, one at a time, we get +```python In [7]: x Out[7]: 2 In [8]: y Out[8]: 3 ``` -So we see that the variables now contain the values we assigned them. +So we see that the variables now contain the values we assigned them. -Variables are handy ways of storing values. We can use them in future calculations simply by using their names, just as we would use a cell address in Excel. Here's an example. Type this into the IPython console: -```python +Variables are handy ways of storing values. We can use them in future calculations simply by using their names, just as we would use a cell address in Excel. Here's an example. Type this into the IPython console: +```python z = x/y -``` -If we type `z` in the console and hit return, we get -```python +``` +If we type `z` in the console and hit return, we get +```python In [9]: z Out[9]: .666666666 ``` -What's going on here? We take `x` (which now has a value of 2) and divide it by `y` (which now has the value of 3) and assigns it to the variable `z`. The result is a computer's version of two-thirds. +What's going on here? We take `x` (which now has a value of 2) and divide it by `y` (which now has the value of 3) and assigns it to the variable `z`. The result is a computer's version of two-thirds. -**Exercise.** Type `w = 7` in the IPython console. What does the code `w = w + 2` do? Why is this not a violation of basic mathematics? +**Exercise.** Type `w = 7` in the IPython console. What does the code `w = w + 2` do? Why is this not a violation of basic mathematics? -**Exercise.** This one will take a little thought. Type `x = 6` in the IPython console. We've reassigned `x` so that its value is now 6, not 2. If we type and submit `z`, we see -```python +**Exercise.** This one will take a little thought. Type `x = 6` in the IPython console. We've reassigned `x` so that its value is now 6, not 2. If we type and submit `z`, we see +```python In [10]: z Out[10]: .6666666666 ``` -But wait, if `z` is supposed to be `x/y`, and `x` now equals 6, then shouldn't `z` be 2? What do you think is going on? How can you fix it so that `z` returns the value 2? +But wait, if `z` is supposed to be `x/y`, and `x` now equals 6, then shouldn't `z` be 2? What do you think is going on? How can you fix it so that `z` returns the value 2? -**Exercise.** Suppose we borrow 200 for one year at an interest rate of 5 percent. If we pay interest plus principal at the end of the year, what is our total payment? Compute this using the variables `principal = 200` and `i = 0.05`. +**Exercise.** Suppose we borrow 200 for one year at an interest rate of 5 percent. If we pay interest plus principal at the end of the year, what is our total payment? Compute this using the variables `principal = 200` and `i = 0.05`. -**Exercise.** Real GDP in the US (the total value of things produced) was 15.58 trillion in 2013 and 15.96 trillion in 2014. What was the growth rate? +**Exercise.** Real GDP in the US (the total value of things produced) was 15.58 trillion in 2013 and 15.96 trillion in 2014. What was the growth rate? -**Exercise.** Suppose we have two variables, `x` and `y`. How would you switch their values, so that `x` takes on `y`'s value and `y` takes on `x`'s? +**Exercise.** Suppose we have two variables, `x` and `y`. How would you switch their values, so that `x` takes on `y`'s value and `y` takes on `x`'s? ## Displaying results with the `print()` function We saw that when we performed a calculation, such as `z = x/y`, we had to ask to see the result. The `print()` function gives us another way to do that. If we type `print(z)` in the IPython console, we get -```python +```python In [11]: print(z) 0.6666666666666666 ``` -Evidently this displays the value of `z`, namely `0.6666666666666666`. We'll use print statements a lot to track the progress of our code. +Evidently this displays the value of `z`, namely `0.6666666666666666`. We'll use print statements a lot to track the progress of our code. - -The print function displays whatever we include in parentheses after the word print: for example, `print(x)`. If we want to print more than one thing, we separate them with commas; for example, `print(x, y)`. That's the **general structure of functions** in Python: a function name (in this case `print`) followed by inputs (known as "arguments") in parentheses that are separated by commas. We usually refer to the `print()` function, with explicit parentheses, to remind ourselves that it requires input of some kind. +The print function displays whatever we include in parentheses after the word print: for example, `print(x)`. If we want to print more than one thing, we separate them with commas; for example, `print(x, y)`. That's the **general structure of functions** in Python: a function name (in this case `print`) followed by inputs (known as "arguments") in parentheses that are separated by commas. We usually refer to the `print()` function, with explicit parentheses, to remind ourselves that it requires input of some kind. -So if we want to verify the calculation of `z`, we can type `print(z)` in the IPython console. If we want to print all the calculations from the previous section, we can type `print(x, y, z)`: +So if we want to verify the calculation of `z`, we can type `print(z)` in the IPython console. If we want to print all the calculations from the previous section, we can type `print(x, y, z)`: ```python In [12]: print(x, y, z) 2 3 0.6666666666666666 ``` -By default, the output is separated by spaces. +By default, the output is separated by spaces. -We'll use more complicated print statements than this, which we'll explain as we go. But if you see something you don't recognize, remember to **ask questions**. +We'll use more complicated print statements than this, which we'll explain as we go. But if you see something you don't recognize, remember to **ask questions**. +--> -**Getting help in Spyder.** If you want to know more about the print function, here are two good ways to do it in Spyder: +**Getting help in Spyder.** If you want to know more about the print function, here are two good ways to do it in Spyder: -* Type `print?` in the IPython console. -* Type `print` in the Object inspector. +* Type `print?` in the IPython console. +* Type `print` in the Object inspector. -The same approaches work for other functions. We use them both a lot. If they fail, either because there's no help or the help is incomprehensible, we fall back on Google fu. +The same approaches work for other functions. We use them both a lot. If they fail, either because there's no help or the help is incomprehensible, we fall back on Google fu. +**Exercise.** What does `end` argument do? What does `end='\n'` do? Try some examples to verify your guess. +--> -## Strings +## Strings -We often work with non-numerical data, collections of characters that might include letters, numbers, or other symbols. Such things show up in a lot in data work, as variable names (GDP, income, volatility) and even as data (country or customer names, for example). We refer these as **strings**. No, not the stuff we tie up packages with, but a "string" of characters like letters or numbers. It's one of many mysterious uses of ordinary words we'll run across as we learn to code. For more on this one, see [here](http://stackoverflow.com/questions/880195/the-history-behind-the-definition-of-a-string). +We often work with non-numerical data, collections of characters that might include letters, numbers, or other symbols. Such things show up in a lot in data work, as variable names (GDP, income, volatility) and even as data (country or customer names, for example). We refer these as **strings**. No, not the stuff we tie up packages with, but a "string" of characters like letters or numbers. It's one of many mysterious uses of ordinary words we'll run across as we learn to code. For more on this one, see [here](http://stackoverflow.com/questions/880195/the-history-behind-the-definition-of-a-string). -We create strings with quotation marks: 'Chase', "Spencer", 'Sarah', "apple", and even '12' are all strings. Single and double quotes both work. The last example is a confusing one, because it looks like a number. It's not. If we try to use it as a number, it doesn't work. Try, for example, `'12'/3`. This generates the error: `TypeError: unsupported operand type(s) for /: 'str' and 'int'`. What this means is that we tried to divide a string (`'12'`) by an integer (`3`). That's no different to Python than trying to divide your name by three, it can't make sense of it. +We create strings with quotation marks: 'Chase', "Spencer", 'Sarah', "apple", and even '12' are all strings. Single and double quotes both work. The last example is a confusing one, because it looks like a number. It's not. If we try to use it as a number, it doesn't work. Try, for example, `'12'/3`. This generates the error: `TypeError: unsupported operand type(s) for /: 'str' and 'int'`. What this means is that we tried to divide a string (`'12'`) by an integer (`3`). That's no different to Python than trying to divide your name by three, it can't make sense of it. -We repeat: **a string is a collection of characters between quotes**. The characters can be pretty much anything. Therefore `12` is a number (no quotes), but `'12'` is a string. +We repeat: **a string is a collection of characters between quotes**. The characters can be pretty much anything. Therefore `12` is a number (no quotes), but `'12'` is a string. -Here are some other examples, which we assign to variable names for later use. Type them into Spyder's IPython console **one at a time**: -```python +Here are some other examples, which we assign to variable names for later use. Type them into Spyder's IPython console **one at a time**: +```python a = 'some' -b = 'thing' -c = a + b +b = 'thing' +c = a + b d = '11.32' ``` -What do you see? The first two are probably obvious: we assign the characters in single quotes on the right to the variables on the left. +What do you see? The first two are probably obvious: we assign the characters in single quotes on the right to the variables on the left. -The third line is something new: we add the string `some` to the string `thing`. What would you expect to get? Try `print(c)` to find out. That gives us the answer: `c = 'something'`. We've simply stitched the two strings together, one after the other. +The third line is something new: we add the string `some` to the string `thing`. What would you expect to get? Try `print(c)` to find out. That gives us the answer: `c = 'something'`. We've simply stitched the two strings together, one after the other. Strings also allow us to produce better-looking output. In the previous section, for example, we can change the statement `print(z)` to `print('The value of z is ', z)`. The first argument (or input), `'The value of z is '`, is a string. The second argument, `z`, is a variable. Together they produce the output `The value of z is 0.6666666666666666`, which is clearer than the number `0.6666666666666666` on its own. Or we could spread this over two lines: ```python message = 'The value of z is' print(message, z) ``` -Here we've taken the components of the previous print statement and expressed them in two statements to make it more readable. (You might ask yourself: Which do you prefer? Why?) +Here we've taken the components of the previous print statement and expressed them in two statements to make it more readable. (You might ask yourself: Which do you prefer? Why?) -**Exercise.** What is a string? How would you explain it to a friend? +**Exercise.** What is a string? How would you explain it to a friend? -**Exercise.** This one's a little harder. Assign your first name as a string to the variable `firstname` and your last name to the variable `lastname`. Use them to construct a new variable equal to your first name, a space, then your last name. Hint: Think about how you would express a space as a string. +**Exercise.** This one's a little harder. Assign your first name as a string to the variable `firstname` and your last name to the variable `lastname`. Use them to construct a new variable equal to your first name, a space, then your last name. Hint: Think about how you would express a space as a string. -**Exercise.** What happens when you type `a * 2` into the console? What about `d * 2`? What is going on here? +**Exercise.** What happens when you type `a * 2` into the console? What about `d * 2`? What is going on here? ## Single, double, and triple quotes -We typically define strings by putting characters between single quotes, as in `a = 'some'`. That will be our standard practice, but **Python treats single and double quotes the same**. We could have typed `a = "some"` (that is, with double quotes) with the same effect. The main reason for using single quotes is laziness: we don't have to hit the shift key. We're not ones to disparage laziness, but the point is that there's no difference between the two. +We typically define strings by putting characters between single quotes, as in `a = 'some'`. That will be our standard practice, but **Python treats single and double quotes the same**. We could have typed `a = "some"` (that is, with double quotes) with the same effect. The main reason for using single quotes is laziness: we don't have to hit the shift key. We're not ones to disparage laziness, but the point is that there's no difference between the two. -Triple quotes are similar, but they can be used to define strings that go over several lines: +Triple quotes are similar, but they can be used to define strings that go over several lines: ```python longstring = """ -Four score and seven years ago +Four score and seven years ago Our fathers brought forth. """ print(longstring) ``` -This produces the output +This produces the output ```python Four score and seven years ago -Our fathers brought forth ... +Our fathers brought forth ... ``` The blank line comes from the empty space to the right of the first triple quote. And yes: we can make triple quotes from single quotes -- and this is more than we need. **Exercise.** Try the exact same code as above, but replace the triple quotes with single quotes. What happens? Why do you think that happened? -**Exercise.** Type in the following. Figure out what's going wrong. Fix it. +**Exercise.** Type in the following. Figure out what's going wrong. Fix it. ```python bad_string = 'Sarah's code' ``` -**Exercise.** Which of these are strings? Which aren't? +**Exercise.** Which of these are strings? Which aren't? ```python apple @@ -313,73 +313,73 @@ apple ``` -## Add comments to your code +## Add comments to your code -One of the rules of good code is that **we explain what we've done -- in the code**. In this class, we might think about writing code that one of our classmates can understand without help. These explanations are referred to as comments. +One of the rules of good code is that **we explain what we've done -- in the code**. In this class, we might think about writing code that one of our classmates can understand without help. These explanations are referred to as comments. -Add a comment with the hash character (#). Anything in a line after a hash is a comment, meaning it's ignored by Python. Here are some examples: +Add a comment with the hash character (#). Anything in a line after a hash is a comment, meaning it's ignored by Python. Here are some examples: ```python # everything that appears after this symbol is a comment! # comments help PEOPLE understand the code, but PYTHON ignores them! -# we're going to add 4 and 5 -4 + 5 # here we're doing it -print(4+5) # here we're printing it +# we're going to add 4 and 5 +4 + 5 # here we're doing it +print(4+5) # here we're printing it ``` -We often put comments like this in our code. Usually not quite this basic, but close. +We often put comments like this in our code. Usually not quite this basic, but close. -If we have a long comment, there's another method (one of our favorites): use triple quotes. Officially triple quotes define strings just as single and double quotes do. Unofficially they're often used for longer comments. Here's an example from the start of our test program: +If we have a long comment, there's another method (one of our favorites): use triple quotes. Officially triple quotes define strings just as single and double quotes do. Unofficially they're often used for longer comments. Here's an example from the start of our test program: -```python +```python """ -Data Bootcamp test program checks to see that we're running Python 3. -Written by Dave Backus, March 2015 -Created with Python 3.4 +Data Bootcamp test program checks to see that we're running Python 3. +Written by Dave Backus, March 2015 +Created with Python 3.4 """ ``` -We recommend putting something like this at the top of every program you write. You'll thank us later, when you go back and try to figure out what it is you did a few weeks ago. +We recommend putting something like this at the top of every program you write. You'll thank us later, when you go back and try to figure out what it is you did a few weeks ago. **Exercise moving forward.** Practice writing comments **all the time**. Whenever you learn something new, write a comment explaining it in your code. It feels tedious, but the best coders always explain their work. It's a good habit to develop. -## Running programs in Spyder +## Running programs in Spyder -If we're writing longer programs, it's generally easier to type them into an editor where we can correct any mistakes we make, just as we do in a word processing program. +If we're writing longer programs, it's generally easier to type them into an editor where we can correct any mistakes we make, just as we do in a word processing program. -Let's give it a try. Type or copy these commands into a new file in the Spyder editor: +Let's give it a try. Type or copy these commands into a new file in the Spyder editor: -```python +```python a = 'some' -b = 'thing' -c = a + b +b = 'thing' +c = a + b print('c =', c) ``` -(Hint: Click on File at the top, then New file.) +(Hint: Click on File at the top, then New file.) -To run this code, we need to save it in a file. In Spyder's editor, click on "File" in the upper left corner and choose "Save." To set the file name (the default `Untitled0.py` isn't all that informative), we choose "Save as" and pick a file name like `somename.py`. The part after the period -- the "extension" -- is important, it identifies the file as a Python program. Make sure to save it in the `Data_Bootcamp` directory so we can find it later. +To run this code, we need to save it in a file. In Spyder's editor, click on "File" in the upper left corner and choose "Save." To set the file name (the default `Untitled0.py` isn't all that informative), we choose "Save as" and pick a file name like `somename.py`. The part after the period -- the "extension" -- is important, it identifies the file as a Python program. Make sure to save it in the `Data_Bootcamp` directory so we can find it later. -Once we've saved the file, we can run it in Spyder by clicking on the green arrow at the top of the editor window. The first three lines produce no output. The last one produces the output `c = something` in the IPython console. +Once we've saved the file, we can run it in Spyder by clicking on the green arrow at the top of the editor window. The first three lines produce no output. The last one produces the output `c = something` in the IPython console. -**Spyder's toolbar.** The red arrow below points to the run button, which runs the whole file. +**Spyder's toolbar.** The red arrow below points to the run button, which runs the whole file. ![Spyder toolbar](figs/spyder_toolbar.png "Spyder's toolbar") -## Code cells in Spyder +## Code cells in Spyder -Spyder has another cool feature we use a lot: we can carve out blocks of code ("cells") and run them separately. That way we can try out small pieces of code one at a time. +Spyder has another cool feature we use a lot: we can carve out blocks of code ("cells") and run them separately. That way we can try out small pieces of code one at a time. -The idea is to put the separator `#%%` (hash, percent, percent) between blocks of code, called **cells**, so that we can run them separately. Consider the code: +The idea is to put the separator `#%%` (hash, percent, percent) between blocks of code, called **cells**, so that we can run them separately. Consider the code: -```python +```python x = 2 y = 3 z = x/y @@ -387,109 +387,109 @@ print('z =', z) #%% a = 'some' b = 'thing' -c = a + b +c = a + b print('c =', c) ``` -The separator `#%%` in the middle divides the file into two cells that we can run one at a time. That allows us to run and test blocks of code without running the whole program. It doesn't make much difference with code this simple, but in longer programs it can be a real time saver. +The separator `#%%` in the middle divides the file into two cells that we can run one at a time. That allows us to run and test blocks of code without running the whole program. It doesn't make much difference with code this simple, but in longer programs it can be a real time saver. -Here's how it works: +Here's how it works: -* In the Spyder editor, click on a code cell. The cell will indicate its selection with a darker background. -* Now go to the toolbar above the editor. The large green triangle runs the whole program. The one to its right displays the text "Run current cell" if you move the cursor to it. Click on it to run the selected cell. +* In the Spyder editor, click on a code cell. The cell will indicate its selection with a darker background. +* Now go to the toolbar above the editor. The large green triangle runs the whole program. The one to its right displays the text "Run current cell" if you move the cursor to it. Click on it to run the selected cell. -**Exercise.** Copy or type the code above into your Python program. Save it. Run each cell, one at a time. Check the output to make sure it worked. +**Exercise.** Copy or type the code above into your Python program. Save it. Run each cell, one at a time. Check the output to make sure it worked. -**Exercise.** Add comments to the code you just wrote. Do it now, while you're still thinking about it. +**Exercise.** Add comments to the code you just wrote. Do it now, while you're still thinking about it. ## Lists - -A Python list is what it sounds like: an ordered collection of items. The items can be lots of things: numbers, strings, variables, or even other lists. +A Python list is what it sounds like: an ordered collection of items. The items can be lots of things: numbers, strings, variables, or even other lists. -**Creating lists.** We create lists by putting **square brackets around a collection of items** separated by commas. Here are some examples. Type each line of code into Spyder's IPython console and run them. +**Creating lists.** We create lists by putting **square brackets around a collection of items** separated by commas. Here are some examples. Type each line of code into Spyder's IPython console and run them. -```python +```python numberlist = [1, 5, -3] -stringlist = ['hi', 'hello', 'hey'] +stringlist = ['hi', 'hello', 'hey'] ``` -These are, of course, assignments: the lists on the right are assigned to the variables on the left. +These are, of course, assignments: the lists on the right are assigned to the variables on the left. We can also make lists of variables: -```python +```python a = 'some' b = 'thing' -c = a + b -variablelist = [a, b, c] +c = a + b +variablelist = [a, b, c] ``` -Or we can combine variables, numbers, and strings: +Or we can combine variables, numbers, and strings: -```python -randomlist = [1, "hello", a] +```python +randomlist = [1, "hello", a] ``` **Exercise.** Add `print(numberlist)` and `print(variablelist)` to your code in Spyder and hit the run button. Note the format of the output. What do the square brackets tell us? The single quotes around some entries? -**Combining lists.** We can combine lists (literally) by adding them. The statement `biglist = numberlist + stringlist` produces a list containing all the elements of `numberlist` plus all the elements of `stringlist`, giving us six items altogether. Type `print(biglist)` to make sure that's the case. +**Combining lists.** We can combine lists (literally) by adding them. The statement `biglist = numberlist + stringlist` produces a list containing all the elements of `numberlist` plus all the elements of `stringlist`, giving us six items altogether. Type `print(biglist)` to make sure that's the case. -In contrast, the statement `biglist2 = [numberlist, stringlist]` produces a new list with two items: the lists `numberlist` and `stringlist`. It's what we might call a "list of lists." That's not something we're likely to do, to be honest. (We did it once, but that was an accident.) The point is simply that lists are flexible objects. +In contrast, the statement `biglist2 = [numberlist, stringlist]` produces a new list with two items: the lists `numberlist` and `stringlist`. It's what we might call a "list of lists." That's not something we're likely to do, to be honest. (We did it once, but that was an accident.) The point is simply that lists are flexible objects. -**Exercise.** How would you explain a list to a classmate? +**Exercise.** How would you explain a list to a classmate? -**Exercise.** Run the statements +**Exercise.** Run the statements -```python +```python mixedlist = [a, b, c, numberlist] print(mixedlist) ``` -What is the output? How would you explain it to a classmates? +What is the output? How would you explain it to a classmates? -**Exercise.** Suppose `x = [1, 2, 3]` is a list. What is `x + x`? `2*x`? Try them and see. +**Exercise.** Suppose `x = [1, 2, 3]` is a list. What is `x + x`? `2*x`? Try them and see. ## Python's built-in functions -We now have several kinds of **objects** to work with: numbers, strings, and lists. There are more on the way, but that's a good start. And yes, the formal term is really **objects**. But what can we do with them? Python has two basic ways to express things we do with objects: **functions** and **methods**. We'll talk about functions in this section and methods in the next one. +We now have several kinds of **objects** to work with: numbers, strings, and lists. There are more on the way, but that's a good start. And yes, the formal term is really **objects**. But what can we do with them? Python has two basic ways to express things we do with objects: **functions** and **methods**. We'll talk about functions in this section and methods in the next one. -Python has a lot of basic "built-in" functions. We've already seen the `print()` function. Here are some others we've found useful. +Python has a lot of basic "built-in" functions. We've already seen the `print()` function. Here are some others we've found useful. -**The `type()` function.** This tells us what kind of object we have. To see how it works, type the following into the IPython console **one line at a time**: +**The `type()` function.** This tells us what kind of object we have. To see how it works, type the following into the IPython console **one line at a time**: -```python +```python type(2) type(2.5) c = 'something' -type(c) +type(c) stringlist = ['a', 'b', c] type(stringlist) type('12') ``` -Think about this on your own for a minute. What do you think you'll get? How does it compare to the real output? +Think about this on your own for a minute. What do you think you'll get? How does it compare to the real output? -Not to kill the suspense, but here's what we should see: +Not to kill the suspense, but here's what we should see: -* `type(2)` gives us the output `int`, which stands for "integer," a whole number like 1, 2, 3, and so on. Just to clarify: 2 is an integer. 2.5 is not. -* `type(2.5)` gives us `float`, a so-called "floating point number" like most of those we run across in Excel -- not a whole number. -* `type(c)` gives us `str`, which tells us that `c` is a string. -* `type(stringlist)` tells us that `stringlist` is a list. -* What is `type('12')`? That's a trick question: it's a string, too, even though it looks like a number. Remember: anything in quotes is a string. +* `type(2)` gives us the output `int`, which stands for "integer," a whole number like 1, 2, 3, and so on. Just to clarify: 2 is an integer. 2.5 is not. +* `type(2.5)` gives us `float`, a so-called "floating point number" like most of those we run across in Excel -- not a whole number. +* `type(c)` gives us `str`, which tells us that `c` is a string. +* `type(stringlist)` tells us that `stringlist` is a list. +* What is `type('12')`? That's a trick question: it's a string, too, even though it looks like a number. Remember: anything in quotes is a string. -The type function is more helpful than you might guess. A lot of what we do in programming is deal with objects of different types and, when necessary, convert one type to another. The first step is to identify the type of the object of interest. +The type function is more helpful than you might guess. A lot of what we do in programming is deal with objects of different types and, when necessary, convert one type to another. The first step is to identify the type of the object of interest. **Exercise.** Try each of these, one at a time, in the IPython console and explain the output: @@ -503,57 +503,57 @@ type('1') type('1.0') ``` -**Exercise.** Try `type(zoo)`. Why does it generate an error? What does the error mean? +**Exercise.** Try `type(zoo)`. Why does it generate an error? What does the error mean? -**Exercise.** Set `zoo = ['lions', 'bears']` and try `type(zoo)` again. What do you get this time? +**Exercise.** Set `zoo = ['lions', 'bears']` and try `type(zoo)` again. What do you get this time? -**The `len()` (length) function.** This tells us the length of an object. To see how it works, type the following in the IPython console one at a time: -```python +**The `len()` (length) function.** This tells us the length of an object. To see how it works, type the following in the IPython console one at a time: +```python len('hello') len(a) -len(c) -len(stringlist) +len(c) +len(stringlist) ``` -The first one gives us the number of characters in the string `a = 'some'` (namely 4). The second one does the same for the string `c = 'something'` (7). The last one tells us the length of the list `stringlist` (3). Note that for strings, `len` gives us the number of characters. For lists, it gives us the number of items. +The first one gives us the number of characters in the string `a = 'some'` (namely 4). The second one does the same for the string `c = 'something'` (7). The last one tells us the length of the list `stringlist` (3). Note that for strings, `len` gives us the number of characters. For lists, it gives us the number of items. **Exercise.** Try the code below. What's going on? -```python -len(4) +```python +len(4) len('4') ``` **Converting strings to numbers.** Suppose we have an object of one type (the string `'11.32'`) and want to use it as another (the number `11.32`). We need to convert it from one type to another, from a string to a floating point number. We can use the function `float()`, then use `type()` to check: -```python +```python f = float('11.32') type(f) ``` -The result `f` is the floating point number `11.32`. +The result `f` is the floating point number `11.32`. The function `int()` lets us do the same for integers. Convert the string `'11'` to an integer, then check with `type()` again: -```python +```python i = int('11') type(i) ``` -The result `i` is the integer 11. +The result `i` is the integer 11. **Converting numbers to strings.** Similarly, we can convert a number back to a string with `str()`: -```python +```python s = str(11) print('s has type', type(s)) t = str(f) # recall that f = float('11.32') print('t has type', type(f)) ``` -**Exercise.** What is the length of the string `'11.32'`? +**Exercise.** What is the length of the string `'11.32'`? -**Exercise.** What happens if we apply the function `float` to the string `'some'`? +**Exercise.** What happens if we apply the function `float` to the string `'some'`? -**Exercise.** This one is tricky, but it came up in some work we were doing. Suppose `year` is a string containing the year of a particular piece of data; for example, `year = '2013'`. How would we construct a string for the following year? Hint: Start by converting year to an integer. +**Exercise.** This one is tricky, but it came up in some work we were doing. Suppose `year` is a string containing the year of a particular piece of data; for example, `year = '2013'`. How would we construct a string for the following year? Hint: Start by converting year to an integer. **Converting strings to lists.** One more type conversion: We can convert a string to a list of its characters. For example, we convert the string `x = abc'` to the list `['a', 'b', 'c']` with `list(x)`. Run this code to see how it works: @@ -564,30 +564,30 @@ y = list(x) print(y) ``` -**Exercise.** What is the result of the statement `list('123')`? +**Exercise.** What is the result of the statement `list('123')`? ## Objects and methods -As we noted, lots of things in Python are **objects**. **Methods** are ready-to-go things we can do with these objects. The available methods depend on the object. A lot of Python is "object-oriented," which means we apply methods to objects to accomplish what you might think you need a function for. Trust us, the jargon is harder than just doing it. +As we noted, lots of things in Python are **objects**. **Methods** are ready-to-go things we can do with these objects. The available methods depend on the object. A lot of Python is "object-oriented," which means we apply methods to objects to accomplish what you might think you need a function for. Trust us, the jargon is harder than just doing it. + (Experts might say at this point: an object is an "instance" of a "class." Ignore them.) +--> Functions and methods differ primarily in their syntax: * Syntax of a **function**: `function(object)` * Syntax of a **method**: `object.method` -We used the former in the previous section and consider the latter here. +We used the former in the previous section and consider the latter here. -What methods are available to work with a given object? Take, for example, the list `numberlist = [1, 5, -3]`. To get the list of available methods, we use the IPython console and type: +What methods are available to work with a given object? Take, for example, the list `numberlist = [1, 5, -3]`. To get the list of available methods, we use the IPython console and type: ```python numberlist.[tab] # here you hit the tab key, don't type in the word "tab" ``` -This wonderful piece of technology is referred to as **tab completion**. The ingredients here are the object (here `numberlist`), the period or dot, and the tab key. When you hit tab, a window will pop up with a list of methods in alphabetical order. In our example, the list starts like this: +This wonderful piece of technology is referred to as **tab completion**. The ingredients here are the object (here `numberlist`), the period or dot, and the tab key. When you hit tab, a window will pop up with a list of methods in alphabetical order. In our example, the list starts like this: ```python numberlist.append @@ -598,9 +598,9 @@ numberlist.count If we want more information about a method, we can type `object.method?` in the IPython console or `object.method` in the object inspector. For the method `numberlist.append`, we get the description -```python +```python Definition: append(object) -Type: Function of None module +Type: Function of None module L.append(object) -> None -- append object to end. ``` @@ -610,72 +610,72 @@ Well, that's pretty opaque, maybe we oversold this approach. What `append` does numberlist.append(7) print(numberlist) ``` - + That's another way to get information about a method: try it and see what happens. -**Example.** Set `firstname = 'Chase'`. The method `lower` converts it to lower case. If we type `firstname.lower` into the object inspector, we see that it comes with parentheses for additional inputs. So we type `firstname.lower()` into the IPython console. The response is `'chase'`. The parentheses are there to provide additional inputs -- arguments, we call them. Without the parentheses, it doesn't work. +**Example.** Set `firstname = 'Chase'`. The method `lower` converts it to lower case. If we type `firstname.lower` into the object inspector, we see that it comes with parentheses for additional inputs. So we type `firstname.lower()` into the IPython console. The response is `'chase'`. The parentheses are there to provide additional inputs -- arguments, we call them. Without the parentheses, it doesn't work. -**Exercise.** Find a method to convert `firstname` to all upper case letters. +**Exercise.** Find a method to convert `firstname` to all upper case letters. -**Exercise.** This one also came up in our work. Suppose we have a variable `z = '12,345.6'`. What is its type? Convert it to a floating point number without the comma. Hint: Use tab completion to find a method to get rid of the comma. +**Exercise.** This one also came up in our work. Suppose we have a variable `z = '12,345.6'`. What is its type? Convert it to a floating point number without the comma. Hint: Use tab completion to find a method to get rid of the comma. -**Exercise.** Run the code +**Exercise.** Run the code -```python +```python firstname = 'John' lastname = 'Lennon' -firstlast = firstname + ' ' + lastname +firstlast = firstname + ' ' + lastname ``` -Find a method to replace the n's in `firstlast` with asterisks. +Find a method to replace the n's in `firstlast` with asterisks. ## Python 2 and 3 -There's a lot of code around written in earlier versions of Python, most commonly Python 2.7. It's there because the people who wrote it started before Python 3 was up and running. Since we're starting from scratch, we are planting ourselves firmly in Python 3 territory. Still, you're likely to run across examples of Python 2 on the internet. The easiest way to tell the difference is the print command: `print(x)` in Python 3 was `print x` (no parentheses) in Python 2. There are lots of other differences, which is why it's essential we all use Python 3. +There's a lot of code around written in earlier versions of Python, most commonly Python 2.7. It's there because the people who wrote it started before Python 3 was up and running. Since we're starting from scratch, we are planting ourselves firmly in Python 3 territory. Still, you're likely to run across examples of Python 2 on the internet. The easiest way to tell the difference is the print command: `print(x)` in Python 3 was `print x` (no parentheses) in Python 2. There are lots of other differences, which is why it's essential we all use Python 3. -## Review +## Review -Work with your neighbor on these review exercises: +Work with your neighbor on these review exercises: -**Exercise.** What should you do if you don't follow what we're doing in class? +**Exercise.** What should you do if you don't follow what we're doing in class? -**Exercise.** Assign the value `12.34` to the variable `xyz`. What "type" is this variable? How would you find out? +**Exercise.** Assign the value `12.34` to the variable `xyz`. What "type" is this variable? How would you find out? -**Exercise.** Create a list that contains the first names of three friends. +**Exercise.** Create a list that contains the first names of three friends. +**Exercise.** Set `first = 'Hersh'` and `last = 'Iyer'`. Construct a string `bothnames` that consists of the first name, a space, and the last name. *Bonus:* Do this with the last name in upper-case (capital) letters. +--> -**Exercise.** Set `name = 'Jones'`. Use (a) tab completion to find a method that coverts `name` to upper case (capital) letters and (b) the Object inspector to find out how to use that method. *Bonus:* How else can you get help in Spyder for methods and functions? +**Exercise.** Set `name = 'Jones'`. Use (a) tab completion to find a method that coverts `name` to upper case (capital) letters and (b) the Object inspector to find out how to use that method. *Bonus:* How else can you get help in Spyder for methods and functions? -**Exercise (challenging).** Use tab completion and the Object inspector to find and apply a method to the string `name` that counts the number of appearances of the letter s. Use `name = 'Ulysses'` as a test case. +**Exercise (challenging).** Use tab completion and the Object inspector to find and apply a method to the string `name` that counts the number of appearances of the letter s. Use `name = 'Ulysses'` as a test case. -## Resources +## Resources -If you'd like another source for comparison, here are some good introductions to basic Python and related topics: +If you'd like another source for comparison, here are some good introductions to basic Python and related topics: -* Codecademy has an excellent [Introduction to Python](http://www.codecademy.com/en/tracks/python). You run Python in their online environment, which is really helpful when you're starting out. It uses Python 2, so the print statement has the form `print x` rather than `print(x)`. If we were to recommend one outside resource, this would be it. You should think seriously of working your way through it in parallel with this course. If you do, you can stop (as far as this course in concerned) when you get to Advanced Topics. -* Here's a [list of free tutorials](http://noeticforce.com/best-free-tutorials-to-learn-python-pdfs-ebooks-online-interactive), but we think you can stop with the first one, Codecademy. -* The official [Python tutorial](\href{https://docs.python.org/3.4/tutorial/introduction.html) is very good. It's also a good idea to get used to reading official documentation like this. There are times when it's unavoidable. -* Mark Lutz's [Learning Python](http://www.amazon.com/Learning-Python-5th-Mark-Lutz/dp/1449355730/) is a 1600-page monster that covers lots of things we won't use. But it's clear and thorough, and comes with the elusive Glenn Okun stamp of approval. He tells us the Kindle version comes with free updates. +* Codecademy has an excellent [Introduction to Python](http://www.codecademy.com/en/tracks/python). You run Python in their online environment, which is really helpful when you're starting out. It uses Python 2, so the print statement has the form `print x` rather than `print(x)`. If we were to recommend one outside resource, this would be it. You should think seriously of working your way through it in parallel with this course. If you do, you can stop (as far as this course in concerned) when you get to Advanced Topics. +* Here's a [list of free tutorials](http://noeticforce.com/best-free-tutorials-to-learn-python-pdfs-ebooks-online-interactive), but we think you can stop with the first one, Codecademy. +* The official [Python tutorial](\href{https://docs.python.org/3.4/tutorial/introduction.html) is very good. It's also a good idea to get used to reading official documentation like this. There are times when it's unavoidable. +* Mark Lutz's [Learning Python](http://www.amazon.com/Learning-Python-5th-Mark-Lutz/dp/1449355730/) is a 1600-page monster that covers lots of things we won't use. But it's clear and thorough, and comes with the elusive Glenn Okun stamp of approval. He tells us the Kindle version comes with free updates. - + These sources go well beyond what we do in this chapter, but we'll catch up with some of it later on. diff --git a/py-fun2.md b/py-fun2.md index ab5abea..e56bcce 100644 --- a/py-fun2.md +++ b/py-fun2.md @@ -1,64 +1,64 @@ -# Python fundamentals 2 +# Python fundamentals 2 --- -**Overview.** More core Python. Part 2 of 2. +**Overview.** More core Python. Part 2 of 2. -**Python tools.** Boolean variables, comparisons, conditionals (if, else), slicing, loops (for), function definitions. +**Python tools.** Boolean variables, comparisons, conditionals (if, else), slicing, loops (for), function definitions. -**Buzzwords.** Code block, data structures, list comprehension, gotcha, PEP8. +**Buzzwords.** Code block, data structures, list comprehension, gotcha, PEP8. **Code.** [Link](https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/Python/bootcamp_fundamentals_2.py). --- -We continue our overview of Python's core language, which lays a foundation for the rest of the course. We go through the material quickly, since we're more interested in the general ideas than the details. You will feel like you're drinking from a fire hose, but it will sink in if you **stick with it**. +We continue our overview of Python's core language, which lays a foundation for the rest of the course. We go through the material quickly, since we're more interested in the general ideas than the details. You will feel like you're drinking from a fire hose, but it will sink in if you **stick with it**. -## Reminders +## Reminders Some things from previous chapters that we'll use a lot: -* Assignments and variables. We say we assign what's on the right to the thing on the left: `x = 17.4` assigns the number `17.4` to the variable `x`. +* Assignments and variables. We say we assign what's on the right to the thing on the left: `x = 17.4` assigns the number `17.4` to the variable `x`. -* Strings. Strings are collections of characters in quotes: `'this is a string'`. +* Strings. Strings are collections of characters in quotes: `'this is a string'`. -* Lists. Lists are collections of things in square brackets: `[1, 'help', 3.14159]`. +* Lists. Lists are collections of things in square brackets: `[1, 'help', 3.14159]`. * Number types: integers vs. floats. Examples of integers include -1, 2, 5, 42. They cannot involve fractions. Floats use decimal points: `11.32`. * The `print()` function. Use `print(‘something’, x)` to display the value(s) of the object(s) in parentheses. -* The `type()` function. The command `type(x)` tells us what kind of object `x` is. Past examples include integers, floating point numbers, strings, and lists. +* The `type()` function. The command `type(x)` tells us what kind of object `x` is. Past examples include integers, floating point numbers, strings, and lists. -* Number and string conversions. Use `str()` to convert a float or integer to a string. Use `float()` or `int()` to convert a string into a float or integer. +* Number and string conversions. Use `str()` to convert a float or integer to a string. Use `float()` or `int()` to convert a string into a float or integer. -* Methods and objects. It's common in Python to work with objects using methods. We apply the method `justdoit` to the object `x` by typing `x.justdoit`. +* Methods and objects. It's common in Python to work with objects using methods. We apply the method `justdoit` to the object `x` by typing `x.justdoit`. -* Spyder. An environment for writing Python programs. The various windows include an editor, an IPython console, and the Object explorer. +* Spyder. An environment for writing Python programs. The various windows include an editor, an IPython console, and the Object explorer. * Comments. Use the hash symbol `#` to add comments to your code and explain what you’re doing. -* Tab completion. To find the list of methods available for a hypothetical object `x`, type `x.[tab]` in Spyder's IPython console -- or in an IPython notebook. We call that "tab completion." +* Tab completion. To find the list of methods available for a hypothetical object `x`, type `x.[tab]` in Spyder's IPython console -- or in an IPython notebook. We call that "tab completion." -* Help. We can get help for a function or method `foo` by typing `foo?` in the IPython console or `foo` in the Object explorer. Try each of them with the `type()` function to remind yourself how this works. +* Help. We can get help for a function or method `foo` by typing `foo?` in the IPython console or `foo` in the Object explorer. Try each of them with the `type()` function to remind yourself how this works. -And while we're reviewing: Save the code file for this chapter in your `Data_Bootcamp` directory and open it in Spyder. +And while we're reviewing: Save the code file for this chapter in your `Data_Bootcamp` directory and open it in Spyder. ## Logical expressions (comparisons) -Sometimes we want to do one thing if a condition is true, and another if it's false. For example, we might want to use observations for which the date is after January 1980, the country is India, or the population is greater than 5 million -- and not otherwise. +Sometimes we want to do one thing if a condition is true, and another if it's false. For example, we might want to use observations for which the date is after January 1980, the country is India, or the population is greater than 5 million -- and not otherwise. -Python does this with **comparisons**, so called because they involve the comparison of one thing with another. For example, the date of an observation with the date January 1980. The result of a comparison is either `True` or `False`. We refer to true/false variables like this as **Boolean**, a name derived from the 18th century mathematician and logician [George](https://espresso.economist.com/a3e8029408056a0791626262beb1e74d) [Boole](https://en.wikipedia.org/wiki/George_Boole). +Python does this with **comparisons**, so called because they involve the comparison of one thing with another. For example, the date of an observation with the date January 1980. The result of a comparison is either `True` or `False`. We refer to true/false variables like this as **Boolean**, a name derived from the 18th century mathematician and logician [George](https://espresso.economist.com/a3e8029408056a0791626262beb1e74d) [Boole](https://en.wikipedia.org/wiki/George_Boole). Let's try some simple examples to see what we're dealing with. Suppose we enter `1 > 0` in the IPython console. What does this mean? The input and output look like this: -```python +```python In [1]: 1 > 0 Out[1]: True ``` -The comparison `1 > 0` is interpreted as a question: Is 1 greater than 0? The answer is `True`. If we enter `1 < 0` instead,the answer is `False`. +The comparison `1 > 0` is interpreted as a question: Is 1 greater than 0? The answer is `True`. If we enter `1 < 0` instead,the answer is `False`. A comparison is a Python object, but what kind of object is it? We can check with the `type()` function: @@ -70,10 +70,10 @@ The answer in this case is `bool` (that is, Boolean), the name we give to expres Python comes with a list of "operators" we can use in comparisons. You can find the complete set in the [Python documentation](https://docs.python.org/3.4/library/stdtypes.html), but common ones include: -* Equals: `==` -* Greater than: `>` -* Greater than or equals: `>=` -* Does not equal: `!=` (not equals). +* Equals: `==` +* Greater than: `>` +* Greater than or equals: `>=` +* Does not equal: `!=` (not equals). We can reverse comparisons with the word `not`. For example: ```python @@ -83,40 +83,40 @@ Out[2]: False Think about that for a minute. And remind yourself that spaces don't matter in Python expressions. -**Exercise.** What is `2 >= 1`? `2 >= 2`? `not 2 >= 1`? If you're not sure, try them in the IPython console and see what you get. +**Exercise.** What is `2 >= 1`? `2 >= 2`? `not 2 >= 1`? If you're not sure, try them in the IPython console and see what you get. **Exercise.** What is `2 + 2 == 4`? How about `1 + 3 != 4`? -**Exercise.** What is `"Sarah" == 'Sarah'`? Can you explain why? +**Exercise.** What is `"Sarah" == 'Sarah'`? Can you explain why? -We can do the same thing with variables. Suppose we want to compare the values of variables `x` and `y`. Which one is bigger? To see how this works, we run the code +We can do the same thing with variables. Suppose we want to compare the values of variables `x` and `y`. Which one is bigger? To see how this works, we run the code ```python x = 2*3 -y = 2**3 +y = 2**3 print('x greater than y is', x > y) ``` -Here `x = 6` and `y = 8`, so the expression `x > y` (is `x` greater than `y`?) is false. +Here `x = 6` and `y = 8`, so the expression `x > y` (is `x` greater than `y`?) is false. -**Exercise.** What do you think this code produces? +**Exercise.** What do you think this code produces? ```python -name1 = 'Chase' -name2 = 'Spencer' -check = name1 > name2 +name1 = 'Chase' +name2 = 'Spencer' +check = name1 > name2 print(check) ``` Run it and see if you're right. What type of variable is `check`? What is its value? Is Chase greater than Spencer? @@ -126,38 +126,38 @@ What are the values of `test1` and `test2`? The expression `conditiona and cond ## Conditionals (`if` and `else`) -Now that we know how to tell whether a comparison is true or false, we can build that into our code. "Conditional" statements allow us to do different things depending on the result of a comparison or Boolean variable, which we refer to as a **condition**. The logic looks like this: +Now that we know how to tell whether a comparison is true or false, we can build that into our code. "Conditional" statements allow us to do different things depending on the result of a comparison or Boolean variable, which we refer to as a **condition**. The logic looks like this: - if a condition is true, then do something. - if a conditions is false, do something else (or do nothing). + if a condition is true, then do something. + if a conditions is false, do something else (or do nothing). -To repeat: a condition here is a comparison or Boolean variable and is either true or false. +To repeat: a condition here is a comparison or Boolean variable and is either true or false. - + **`if` statements** tell the program what to do if the condition is true: ```python if 1 > 0: # read this like "if 1>0 IS TRUE, then do the thing on the next line" print('1 is greater than 0') -``` -The syntax here is precise: +``` +The syntax here is precise: -* The `if` statement **ends with a colon**. That's standard Python syntax, we'll see it again. It's not optional. -* The code that follows is **indented exactly four spaces**. Also not optional. Spyder does it automatically. +* The `if` statement **ends with a colon**. That's standard Python syntax, we'll see it again. It's not optional. +* The code that follows is **indented exactly four spaces**. Also not optional. Spyder does it automatically. -Both of these features -- a colon at the end of the first line, indent the rest four spaces -- show up in lots of Python code. It's very compact, and the indentation makes the code easy to read. +Both of these features -- a colon at the end of the first line, indent the rest four spaces -- show up in lots of Python code. It's very compact, and the indentation makes the code easy to read. -**Exercise.** Change the code to +**Exercise.** Change the code to ```python if 1 < 0: - print('1 is less than 0') -``` -What do you think happens? Try it and see. + print('1 is less than 0') +``` +What do you think happens? Try it and see. -Here's another example. Again, we do something if the condition is true, nothing if the condition is false. In this example, the condition is `x > 6`. If it's true, we print the number. If it's false, we do nothing. The code is +Here's another example. Again, we do something if the condition is true, nothing if the condition is false. In this example, the condition is `x > 6`. If it's true, we print the number. If it's false, we do nothing. The code is ```python x = 7 # we can change this later and see what happens @@ -165,93 +165,93 @@ if x > 6: print('x =', x) print('Done!') -``` -Here we've set `x = 7`, which makes the condition `x > 6` true. The `if` statement then directs the program to print `x`. The blank lines are optional; they make the code easier to read, which is generally a good thing. The statement `print('Done!')` is just there to tell us that the program finished. +``` +Here we've set `x = 7`, which makes the condition `x > 6` true. The `if` statement then directs the program to print `x`. The blank lines are optional; they make the code easier to read, which is generally a good thing. The statement `print('Done!')` is just there to tell us that the program finished. -**Exercise.** What happens if we set `x = 4` at the top? How do we know? +**Exercise.** What happens if we set `x = 4` at the top? How do we know? -**`else` statements** tell the program what to do if the condition is false. If we want to do one thing if a condition is true and another if it is false, we would use `if` for the first and `else` for the second. The second part has been missing so far. Here's an example: -```python +**`else` statements** tell the program what to do if the condition is false. If we want to do one thing if a condition is true and another if it is false, we would use `if` for the first and `else` for the second. The second part has been missing so far. Here's an example: +```python x = 7 -condition = x > 6 +condition = x > 6 if condition: - print('if branch') # do if true + print('if branch') # do if true print(condition) else: - print('else branch') # do if false + print('else branch') # do if false print(condition) ``` -The `else` statement adds the second branch to the decision tree: what to do if the condition is false. Try this with `x = 4` and `x = 7` to see both branches in action. +The `else` statement adds the second branch to the decision tree: what to do if the condition is false. Try this with `x = 4` and `x = 7` to see both branches in action. **Exercise.** Start with the assignments -```python +```python name1 = 'Dave' name2 = 'Glenn' ``` -(The names on the right can be anything, but let's start with these.) Write a program using `if` and `else` that prints the name that comes first in alphabetical order. +(The names on the right can be anything, but let's start with these.) Write a program using `if` and `else` that prints the name that comes first in alphabetical order. -## Slicing strings and lists +## Slicing strings and lists -We can access the elements of strings and lists by specifying the item number in square brackets. This operation is referred to as **slicing**, probably because we're slicing off pieces, like a cake. The only tricky part of this is remembering that **Python starts numbering at zero**. +We can access the elements of strings and lists by specifying the item number in square brackets. This operation is referred to as **slicing**, probably because we're slicing off pieces, like a cake. The only tricky part of this is remembering that **Python starts numbering at zero**. -**Exercise.** Take the string `a = 'some'`. What is `a[1]`? +**Exercise.** Take the string `a = 'some'`. What is `a[1]`? -What just happened? Python starts numbering at zero. If we want the first item/letter, we use `a[0]`. If we want the second, we use `a[1]`. And so on. We can summarize the numbering convention by writing the word `some` on a piece of paper. Below it, write the numbers, in order: 0, 1, 2, 3. Label this row "counting forward." +What just happened? Python starts numbering at zero. If we want the first item/letter, we use `a[0]`. If we want the second, we use `a[1]`. And so on. We can summarize the numbering convention by writing the word `some` on a piece of paper. Below it, write the numbers, in order: 0, 1, 2, 3. Label this row "counting forward." -We can also count backward, but again Python has its own numbering convention. If we want the last letter, we use `a[-1]`. And if we want the one before the last one, we type `a[-2]`. In this case we get the same answer if we type `a[2]`. Both give us `'m'`. +We can also count backward, but again Python has its own numbering convention. If we want the last letter, we use `a[-1]`. And if we want the one before the last one, we type `a[-2]`. In this case we get the same answer if we type `a[2]`. Both give us `'m'`. -Let's track this "backward" numbering system in our example. Below the "counting forward" numbers, start another row. Below the letter `e` write -1. As we move to the left, we type, -2, -3, -4. Label this row "counting backward." +Let's track this "backward" numbering system in our example. Below the "counting forward" numbers, start another row. Below the letter `e` write -1. As we move to the left, we type, -2, -3, -4. Label this row "counting backward." -**Exercise.** Take the string `firstname = 'Monty'` and write below it the forward and backward counting conventions. What is the third letter -- `n` -- in each system? +**Exercise.** Take the string `firstname = 'Monty'` and write below it the forward and backward counting conventions. What is the third letter -- `n` -- in each system? -**Exercise.** Find the last letter of the string `lastname = 'Python'`. Find the second to last letter using both the forward and backward counting conventions. +**Exercise.** Find the last letter of the string `lastname = 'Python'`. Find the second to last letter using both the forward and backward counting conventions. -We can do the same thing with lists, but the items here are the elements of a list rather than the characters in a string. The counting works the same way. Let's see if we can teach ourselves. +We can do the same thing with lists, but the items here are the elements of a list rather than the characters in a string. The counting works the same way. Let's see if we can teach ourselves. -**Exercise.** Take the list `numberlist = [1, 5, -3]`. Use slicing to set a variable `first` equal to the first item. Set another variable `last` equal to the last item. Set a third variable `middle` equal to the middle item. +**Exercise.** Take the list `numberlist = [1, 5, -3]`. Use slicing to set a variable `first` equal to the first item. Set another variable `last` equal to the last item. Set a third variable `middle` equal to the middle item. -## More slicing +## More slicing -We've seen how to "slice" (extract) an item from a string or list. Here we'll show how to slice a range of items. For example, slice the last five characters from the string `c = 'something'`. +We've seen how to "slice" (extract) an item from a string or list. Here we'll show how to slice a range of items. For example, slice the last five characters from the string `c = 'something'`. -Recall that in Python we start counting at zero. If we want the first letter in `c`, we use `c[0]`. If we want the second, we use `c[1]`. +Recall that in Python we start counting at zero. If we want the first letter in `c`, we use `c[0]`. If we want the second, we use `c[1]`. -If we want more than a single letter, we need to specify both the start and the end. Let's try some examples and see what they do: -```python +If we want more than a single letter, we need to specify both the start and the end. Let's try some examples and see what they do: +```python c = 'something' print('c[1] is', c[1]) print('c[1:2] is', c[1:2]) print('c[1:3] is', c[1:3]) print('c[1:] is', c[1:]) ``` -Let's go through this line by line: +Let's go through this line by line: -* The first print statement gives us `o`, the second letter of `something`. It's element 1 because we start numbering at zero. -* The next one does the same. Why not two letters? Let's try another one and see. -* The following line gives us `om`, the second and third letters. Why? Perhaps you figured it out. If not, this is the logic: the second number in `1:3`, namely `3`, is **one more than the end**. So the range `1:3` gives us the second and third letters. Confusing, for sure, but that's how it works. -* The last line has no second number. By convention it goes all the way to the end. The slice `c[1:]` goes from the second letter (the first number 1) to the end, giving us `omething`. +* The first print statement gives us `o`, the second letter of `something`. It's element 1 because we start numbering at zero. +* The next one does the same. Why not two letters? Let's try another one and see. +* The following line gives us `om`, the second and third letters. Why? Perhaps you figured it out. If not, this is the logic: the second number in `1:3`, namely `3`, is **one more than the end**. So the range `1:3` gives us the second and third letters. Confusing, for sure, but that's how it works. +* The last line has no second number. By convention it goes all the way to the end. The slice `c[1:]` goes from the second letter (the first number 1) to the end, giving us `omething`. -Some practice: +Some practice: -**Exercise.** Set `lastname = 'Python'`. Extract the string `'thon'`. +**Exercise.** Set `lastname = 'Python'`. Extract the string `'thon'`. -**Exercise.** Set `numlist = [1, 7, 4, 3]`. Extract the middle two items and assign them to the variable `middle`. Extract all but the first item and assign them to the variable `allbutfirst`. Extract all but the last item and assign them to the variable `allbutlast`. +**Exercise.** Set `numlist = [1, 7, 4, 3]`. Extract the middle two items and assign them to the variable `middle`. Extract all but the first item and assign them to the variable `allbutfirst`. Extract all but the last item and assign them to the variable `allbutlast`. -**Exercise.** Take the string `c = 'something'`. What is `c[:3] + c[3:]`? +**Exercise.** Take the string `c = 'something'`. What is `c[:3] + c[3:]`? @@ -259,74 +259,74 @@ Some practice: There are lots of times we want to do the same thing many times, either on one object or on many similar objects. An example of the latter is to print out a list of names, one at a time. An example of the former is to find an answer to progressively higher degrees of accuracy. We repeat an operation as many times as we need to get a desired degree of accuracy. Both situations come up a lot. -Here's an example in which we print all the items in a list, one at a time: +Here's an example in which we print all the items in a list, one at a time: ```python namelist = ['Chase', 'Dave', 'Sarah', 'Spencer'] # creates the list "namelist" # below, the word "item" is arbitrary. End the line with a colon. for item in namelist: # goes through the items in the list one at a time print(item) # indent this line exactly 4 spaces -# if there is code after this, we'd typically leave a blank line in-between +# if there is code after this, we'd typically leave a blank line in-between ``` -This produces the output -```python +This produces the output +```python Chase Dave Sarah Spencer ``` - +* The first line creates the list `namelist`. Nothing new here. +* The `for` statement goes through the items in the list one at a time. As with `if` statements, it ends with a colon. The variable name `item` is arbitrary. +* The line that follows is indented exactly four spaces. +* If there's code after this, we would typically leave a blank line in between. That's convention, not necessity. --> -Note that `item` changes value as we go through the loop. It's a variable who value actually varies. +Note that `item` changes value as we go through the loop. It's a variable who value actually varies. -We say here that we **iterate** over the items in the list and refer to the list as an **iterable**: that is, something we can iterate over. The terminology isn't important, but that's what it means if you run across it. +We say here that we **iterate** over the items in the list and refer to the list as an **iterable**: that is, something we can iterate over. The terminology isn't important, but that's what it means if you run across it. -**Exercise.** What happens if we replace `item` with `banana` in the code above? +**Exercise.** What happens if we replace `item` with `banana` in the code above? **Example.** We use a loop to compute the sum of the elements of a list of numbers: ```python -numlist = [4, -2, 5] +numlist = [4, -2, 5] sum = 0 for num in numlist: - sum = sum + num - -print(sum) -``` -The answer (of course) is 7. + sum = sum + num -**Exercise.** Adapt the example to compute the average of the elements of `numlist`. +print(sum) +``` +The answer (of course) is 7. +**Exercise.** Adapt the example to compute the average of the elements of `numlist`. -We can also run loops over the characters in a string. This one prints the letters in a word on separate lines: -```python + +We can also run loops over the characters in a string. This one prints the letters in a word on separate lines: +```python word = 'anything' for letter in word: - print(letter) + print(letter) ``` -(You might think we could come up with a more interesting example than this. Sadly no, but we welcome suggestions.) +(You might think we could come up with a more interesting example than this. Sadly no, but we welcome suggestions.) -**Example.** Here's one that combines a `for` loop with an `if` statement to identify and print the vowels in a word: -```python +**Example.** Here's one that combines a `for` loop with an `if` statement to identify and print the vowels in a word: +```python vowels = 'aeiouy' word = 'anything' for letter in word: if letter in vowels: print(letter) ``` -(Adapted from [SciPy lecture 1.2](https://scipy-lectures.github.io/intro/language/control_flow.html#advanced-iteration).) Describe what each line does as well as the overall result. +(Adapted from [SciPy lecture 1.2](https://scipy-lectures.github.io/intro/language/control_flow.html#advanced-iteration).) Describe what each line does as well as the overall result. **Example.** What about the consonants? Note the word `not` below: -```python +```python vowels = 'aeiouy' word = 'anything' for letter in word: @@ -335,213 +335,213 @@ for letter in word: ``` -**Exercise.** Take the list `stuff = ['cat', 3.7, 5, 'dog']`. This is somewhat demanding, but give it a try. +**Exercise.** Take the list `stuff = ['cat', 3.7, 5, 'dog']`. This is somewhat demanding, but give it a try. * Write a program that prints the elements of `stuff`. -* Write a program that tells us the `type` of each element of `stuff`. +* Write a program that tells us the `type` of each element of `stuff`. * Write a program that goes through the elements of `stuff` and prints only the elements that are strings; that is, the function `type` returns the value `str`. * Create another list `more_stuff = ['Sarah', 75, 42.5]`. Remember `append()`? Write a program to add the items from stuff to list `more_stuff` using `append()`. - - -## Loops over counters +## Loops over counters -We now know how loops work. There's one small variant that's worth explaining: loops that do something a fixed number of times. For example, we might want to sum or average the values of a variable. Or value a bond with a fixed number of coupon payments. Or something. +We now know how loops work. There's one small variant that's worth explaining: loops that do something a fixed number of times. For example, we might want to sum or average the values of a variable. Or value a bond with a fixed number of coupon payments. Or something. -The new ingredient is the `range()` function. `range(n)` gives us all the integers (whole numbers) from `0` to `n-1`. (If that sounds strange, remind yourself how slicing works.) And `range(n1, n2)` gives us all the whole numbers from `n1` to `n2-1`. We can use it in lots of ways, but loops are a prime example. +The new ingredient is the `range()` function. `range(n)` gives us all the integers (whole numbers) from `0` to `n-1`. (If that sounds strange, remind yourself how slicing works.) And `range(n1, n2)` gives us all the whole numbers from `n1` to `n2-1`. We can use it in lots of ways, but loops are a prime example. -Some examples illustrate how this works: +Some examples illustrate how this works: -**Example.** This is one of the simplest uses of `range()` in a loop: +**Example.** This is one of the simplest uses of `range()` in a loop: ```python -for number in range(5): # the variable "number" can be anything +for number in range(5): # the variable "number" can be anything print(number) ``` -It prints out the numbers 0, 1, 2, 3, and 4. (Why doesn't it go to 5?) This is like our earlier loops, but `range(5)` has replaced a list or string as the "iterable." +It prints out the numbers 0, 1, 2, 3, and 4. (Why doesn't it go to 5?) This is like our earlier loops, but `range(5)` has replaced a list or string as the "iterable." -Here's a minor variant: +Here's a minor variant: ```python for number in range(2,5): print(number) ``` -It prints out the numbers 2, 3, and 4. +It prints out the numbers 2, 3, and 4. -**Example.** We compute and print the squares of integers up to ten. ([Paul Ford](http://www.bloomberg.com/graphics/2015-paul-ford-what-is-code/) notes: "just the sort of practical, useful program that always appears in programming tutorials to address the needs of people who urgently require a list of squares.") We do that with a `for` loop and the `range()` function: -```python +**Example.** We compute and print the squares of integers up to ten. ([Paul Ford](http://www.bloomberg.com/graphics/2015-paul-ford-what-is-code/) notes: "just the sort of practical, useful program that always appears in programming tutorials to address the needs of people who urgently require a list of squares.") We do that with a `for` loop and the `range()` function: +```python for number in range(5): square = number**2 print('Number and its square:', number, square) ``` -Again we start at zero and work our way up to four. +Again we start at zero and work our way up to four. -**Example.** Here we compute the sum of integers from one to ten: +**Example.** Here we compute the sum of integers from one to ten: ```python -sum = 0 +sum = 0 for num in range(1,11): - sum = sum + num + sum = sum + num print(sum) ``` -The answer is 55. +The answer is 55. **Example.** Here's one that combines a loop and an `if` statement: ```python for num in range(10): if num > 5: - print num + print num ``` -**Exercise.** Write a loop that computes the first five powers of two. +**Exercise.** Write a loop that computes the first five powers of two. -**Example.** Consider a bond that pays annual coupons for a given number of years (the maturity) and a principal of 100 at the end. The yield-to-maturity is the rate at which these payments are discounted. Given values for the coupon and the yield, the price of the bond is +**Example.** Consider a bond that pays annual coupons for a given number of years (the maturity) and a principal of 100 at the end. The yield-to-maturity is the rate at which these payments are discounted. Given values for the coupon and the yield, the price of the bond is ```python -maturity = 10 -coupon = 5 -ytm = 0.05 # yield to maturity +maturity = 10 +coupon = 5 +ytm = 0.05 # yield to maturity -price = 0 +price = 0 for year in range(1, maturity+1): price = price + coupon/(1+yield)**year -price = price + 100/(1+yield)**maturity +price = price + 100/(1+yield)**maturity print('The price of the bond is', price) ``` -The answer is 100, which we might know because the coupon and yield are the same once we convert the latter to a percentage. Python gives us `99.99999999999997`, which is the computer's version of 100. +The answer is 100, which we might know because the coupon and yield are the same once we convert the latter to a percentage. Python gives us `99.99999999999997`, which is the computer's version of 100. -**Digression.** When we wrote this code, we used the variable name `yield` instead of `ytm`. Spyder marked this as `invalid syntax` with a warning sign to the left of the text. Evidently the name `yield` is reserved for something else. As general rule, it's a good idea to pay attention to the hints like this. +**Digression.** When we wrote this code, we used the variable name `yield` instead of `ytm`. Spyder marked this as `invalid syntax` with a warning sign to the left of the text. Evidently the name `yield` is reserved for something else. As general rule, it's a good idea to pay attention to the hints like this. -**Exercise.** In Portugal and Greece, policymakers have suggested reducing their debt by cutting the coupon payments and extending the maturity. How much do we reduce the value of the debt if we reduce the coupons to 2 and increase the maturity to 20? +**Exercise.** In Portugal and Greece, policymakers have suggested reducing their debt by cutting the coupon payments and extending the maturity. How much do we reduce the value of the debt if we reduce the coupons to 2 and increase the maturity to 20? - ## Defining our own functions -It's easy to create our own functions -- experienced programmers do it all the time. A common view is that you should never copy lines of your code. If you're copying, you're repeating yourself. What you should do instead is **write a function once and use it twice**. More than that, breaking a long program into a small number of functions makes the code easier for others to read, which is always a good thing. As we become more comfortable with Python we'll use functions more and more. +It's easy to create our own functions -- experienced programmers do it all the time. A common view is that you should never copy lines of your code. If you're copying, you're repeating yourself. What you should do instead is **write a function once and use it twice**. More than that, breaking a long program into a small number of functions makes the code easier for others to read, which is always a good thing. As we become more comfortable with Python we'll use functions more and more. -The simplest functions have two components: a **name** (what we call it) and a list of **input arguments**. Here's an example: -```python +The simplest functions have two components: a **name** (what we call it) and a list of **input arguments**. Here's an example: +```python def hello(firstname): # define the function - print('Hello,', firstname) + print('Hello,', firstname) -hello('Chase') # use the function -``` +hello('Chase') # use the function +``` Let's go through this line by line: -* The initial `def` statement defines the function, names it `hello`, identifies the input as `firstname`, and ends with a colon (:). -* The following statement(s) are indented the usual four spaces and specify what the function does. In this case, it prints `Hello,` followed by whatever `firstname` happens to be. Python understands that the function ends when the indentation ends. -* The last line "calls" the function with input `Chase`. Note that the name in the function's definition and its use need not be the same. +* The initial `def` statement defines the function, names it `hello`, identifies the input as `firstname`, and ends with a colon (:). +* The following statement(s) are indented the usual four spaces and specify what the function does. In this case, it prints `Hello,` followed by whatever `firstname` happens to be. Python understands that the function ends when the indentation ends. +* The last line "calls" the function with input `Chase`. Note that the name in the function's definition and its use need not be the same. -By convention, Python aficionados put two blank lines before and after function definitions to make them stand out more clearly. We use one here to save space. +By convention, Python aficionados put two blank lines before and after function definitions to make them stand out more clearly. We use one here to save space. -Our function `hello` has a name (`hello`) and an input argument (`firstname`), but returns no output. Output would create a new value that Python could call later in the code, like when you set `x = 2` then used `x`later on. Here we print something but produce no other output. +Our function `hello` has a name (`hello`) and an input argument (`firstname`), but returns no output. Output would create a new value that Python could call later in the code, like when you set `x = 2` then used `x`later on. Here we print something but produce no other output. -In other cases, we might want to send output back to the main program. We do that with a **return** statement, a third component of a function definition. Here's an example: -```python -def combine(first, last): +In other cases, we might want to send output back to the main program. We do that with a **return** statement, a third component of a function definition. Here's an example: +```python +def combine(first, last): """ Function takes strings 'first' and 'last' and returns new string 'last, first' """ - lastfirst = last + ', ' + first - return lastfirst # this is what the function sends back + lastfirst = last + ', ' + first + return lastfirst # this is what the function sends back -both = combine('Chase', 'Coleman') # assign the "return" to both +both = combine('Chase', 'Coleman') # assign the "return" to both print(both) -``` -In our example, we "return" the output `'Coleman, Chase'` and assign it to the variable `both`. Note, too, the comment in triple quotes at the top of the function. That's standard procedure, we recommend it. +``` +In our example, we "return" the output `'Coleman, Chase'` and assign it to the variable `both`. Note, too, the comment in triple quotes at the top of the function. That's standard procedure, we recommend it. -The return is an essential component of many functions. Typically when we read the documentation for a function or method, one of the first things we look for is what it returns. +The return is an essential component of many functions. Typically when we read the documentation for a function or method, one of the first things we look for is what it returns. -**Exercise.** Create and test a function `nextyear` that takes an integer year (say 2015) and returns the following year (2016). +**Exercise.** Create and test a function `nextyear` that takes an integer year (say 2015) and returns the following year (2016). -**Exercise.** Create and test a function that takes an integer year (say, 2015) and returns a string of the next year (say, `'2016'`). +**Exercise.** Create and test a function that takes an integer year (say, 2015) and returns a string of the next year (say, `'2016'`). -**Exercise.** Use the Object inspector to get the documentation for the built-in function `max`. If the input is a list of two or more numbers, what does `max()` return? +**Exercise.** Use the Object inspector to get the documentation for the built-in function `max`. If the input is a list of two or more numbers, what does `max()` return? ## More data structures -This whole section is **mtwn** (more than we need). We recommend you skim it and not worry about the details. We'll review it later as needed. +This whole section is **mtwn** (more than we need). We recommend you skim it and not worry about the details. We'll review it later as needed. -The term **[data structure][5]** refers to the organization of a collection of data. Strings and lists are data structures, and we'll run across **dataframes** when we start working with data. Here we look at a couple more: dictionaries and tuples. +The term **[data structure][5]** refers to the organization of a collection of data. Strings and lists are data structures, and we'll run across **dataframes** when we start working with data. Here we look at a couple more: dictionaries and tuples. [5]: http://en.wikipedia.org/wiki/Data_structure -**Dictionaries.** Dictionaries are (unordered) pairs of things defined by curly brackets `{}`, separated by commas, with the items in each pair separated by colon. For example, a list of first and last names: +**Dictionaries.** Dictionaries are (unordered) pairs of things defined by curly brackets `{}`, separated by commas, with the items in each pair separated by colon. For example, a list of first and last names: ```python names = {'Dave': 'Backus', 'Chase': 'Coleman', 'Spencer': 'Lyon', 'Glenn': 'Okun'} ``` -If we try `type(names)`, the reply is `dict`, meaning dictionary. The components of each pair are referred to as the "key" (the first part) and the "value" (the second). +If we try `type(names)`, the reply is `dict`, meaning dictionary. The components of each pair are referred to as the "key" (the first part) and the "value" (the second). We access the value from the key with syntax of the form: `dict[key]`. In the example above, we get Glenn's last name by typing `names['Glenn']`. (Try it and see.) -**Exercise.** Construct a dictionary whose keys are the integers 1, 2, and 3 and whose values are the same numbers as words: one, two, three. How would you get the word associated with the key `2`? +**Exercise.** Construct a dictionary whose keys are the integers 1, 2, and 3 and whose values are the same numbers as words: one, two, three. How would you get the word associated with the key `2`? -**Tuples.** Tuples are collections of things in parentheses separated by commas. They're like lists but the syntax is different (parentheses rather than square brackets) and **they can't be changed**. (Experts would say they're immutable.) +**Tuples.** Tuples are collections of things in parentheses separated by commas. They're like lists but the syntax is different (parentheses rather than square brackets) and **they can't be changed**. (Experts would say they're immutable.) -Our primary (only?) use will be dates. In the datetime module (more coming), the date April 1, 2012 is expressed by the tuple `date = (2012, 4, 1)` (year, month, day). +Our primary (only?) use will be dates. In the datetime module (more coming), the date April 1, 2012 is expressed by the tuple `date = (2012, 4, 1)` (year, month, day). -**Exercise** Suppose the date is expressed as `(2015, 12, 13)`. What date does that represent? How would you extract the month? +**Exercise** Suppose the date is expressed as `(2015, 12, 13)`. What date does that represent? How would you extract the month? +Here `y` hasn't changed, it's not connected to `x`. +--> -## Programming style +## Programming style -Yes, style counts. We're not only trying to get something done, we're also communicating with others who may look at our code and possibly use it. A clear style makes that communication more effective. +Yes, style counts. We're not only trying to get something done, we're also communicating with others who may look at our code and possibly use it. A clear style makes that communication more effective. -With that in mind, here are some guidelines we've found useful: +With that in mind, here are some guidelines we've found useful: -* Put an overall summary of your program at the top in triple quotes. This should include both the purpose of the program and your name. Your email address is optional. +* Put an overall summary of your program at the top in triple quotes. This should include both the purpose of the program and your name. Your email address is optional. * Lines should be no longer than 79 characters. -* Skip two lines before and after a function definition. -* Skip lines here and there where you think it makes sense. -* Use comments whenever something isn't immediately obvious. +* Skip two lines before and after a function definition. +* Skip lines here and there where you think it makes sense. +* Use comments whenever something isn't immediately obvious. -You can find more along these lines in the classic "[PEP8](https://www.python.org/dev/peps/pep-0008/)" and Google's [style guide](https://google-styleguide.googlecode.com/svn/trunk/pyguide.html). +You can find more along these lines in the classic "[PEP8](https://www.python.org/dev/peps/pep-0008/)" and Google's [style guide](https://google-styleguide.googlecode.com/svn/trunk/pyguide.html). -Some programmers are religious about this. We'd say simply that we want to make our code readable by others. +Some programmers are religious about this. We'd say simply that we want to make our code readable by others. - +--> -## Review +## Review -**Exercise.** Which of the following are `True` and which are `False`? +**Exercise.** Which of the following are `True` and which are `False`? * `2 >= 1` -* `2 >= 2` +* `2 >= 2` * `'this' == "this"` * `'Chase' < 'Spencer'` -**Exercise.** Extract the third letter from the string `name`. Use `name = 'Glenn'` as your test case. +**Exercise.** Extract the third letter from the string `name`. Use `name = 'Glenn'` as your test case. -**Exercise.** Write a short program that prints the last letter of each item in the list `names = ['Chase', Dave', 'Sarah', 'Spencer']`. **Bonus:** Print the third letter only if it's a vowel. +**Exercise.** Write a short program that prints the last letter of each item in the list `names = ['Chase', Dave', 'Sarah', 'Spencer']`. **Bonus:** Print the third letter only if it's a vowel. -**Exercise.** Run the statement `output = list(range(3))` and describe the output. What does it do? +**Exercise.** Run the statement `output = list(range(3))` and describe the output. What does it do? -**Exercise.** Write a function that takes a string as input and returns the last element. For example, if the input string is `Dave`, the function returns `e`. +**Exercise.** Write a function that takes a string as input and returns the last element. For example, if the input string is `Dave`, the function returns `e`. -## Resources +## Resources -See the resources in the previous chapter, especially the link to [Codecademy](https://www.codecademy.com/tracks/python). If you work your way up to Advanced Topics, you'll be in good shape for anything that follows. +See the resources in the previous chapter, especially the link to [Codecademy](https://www.codecademy.com/tracks/python). If you work your way up to Advanced Topics, you'll be in good shape for anything that follows. Additional resources for specific topics: -* The official [Python Tutorial](https://docs.python.org/3.4/tutorial/controlflow.html) has a nice introduction to "control flow language" that includes comparisons, conditional statements, and loops. -* [CodingBat](http://codingbat.com/python) has a great collection of exercises. Runs online. +* The official [Python Tutorial](https://docs.python.org/3.4/tutorial/controlflow.html) has a nice introduction to "control flow language" that includes comparisons, conditional statements, and loops. +* [CodingBat](http://codingbat.com/python) has a great collection of exercises. Runs online. +* The [Python Challenge](http://www.pythonchallenge.com/) is for people who like puzzles as well as coding. Not for the faint of heart. +--> +Obscure but cool: https://github.com/cosmologicon/pywat#python-wats +--> +Some of us find this kind of syntax obscure and do our best to avoid it. But similar ideas show up in lots of places. +--> diff --git a/py-fun3.md b/py-fun3.md index a405821..32766a2 100644 --- a/py-fun3.md +++ b/py-fun3.md @@ -2,17 +2,17 @@ --- -**Overview.** Advanced core Python. +**Overview.** Advanced core Python. -**Python tools.** iterables, generators, classes... +**Python tools.** iterables, generators, classes... --- -?? Add another chapter on classes, iterators, generators, ... +?? Add another chapter on classes, iterators, generators, ... -http://blog.lerner.co.il/pythons-objects-and-classes-a-visual-guide/ +http://blog.lerner.co.il/pythons-objects-and-classes-a-visual-guide/ http://blog.lerner.co.il/want-to-understand-pythons-comprehensions-think-like-an-accountant/ http://blog.lerner.co.il/quick-introduction-implementing-python-iterators/ diff --git a/random.md b/random.md index 6793d10..aada934 100644 --- a/random.md +++ b/random.md @@ -8,30 +8,30 @@ **Applications.** -**Code.** +**Code.** --- -The world is filled with differences. People are large and small, old and young. Countries are large and small, rich and poor. Companies, too. +The world is filled with differences. People are large and small, old and young. Countries are large and small, rich and poor. Companies, too. -What do we mean when we say something is "random"? Well, we might mean [really crazy](http://www.urbandictionary.com/define.php?term=Random). But here we mean a number of different things could happen, but we're not sure ahead of time which one. The Steelers might win or lose. The stock market might go up or down? The economy might grow quickly or slowly. Your income might go up or down. You get the idea. +What do we mean when we say something is "random"? Well, we might mean [really crazy](http://www.urbandictionary.com/define.php?term=Random). But here we mean a number of different things could happen, but we're not sure ahead of time which one. The Steelers might win or lose. The stock market might go up or down? The economy might grow quickly or slowly. Your income might go up or down. You get the idea. -Examples: -* bar chart of equity returns -* boxplots -* distribution of one-day currency changes: euro, rmb, swf +Examples: +* bar chart of equity returns +* boxplots +* distribution of one-day currency changes: euro, rmb, swf * Distribution of ages -* income +* income * medical spending (MEPS) -* Kevin Williams long-tail data... -* BDS firm size and age distributions: http://www.census.gov/ces/dataproducts/bds/ +* Kevin Williams long-tail data... +* BDS firm size and age distributions: http://www.census.gov/ces/dataproducts/bds/ -**See figs 2-4:** http://public.econ.duke.edu/~psarcidi/aa.pdf +**See figs 2-4:** http://public.econ.duke.edu/~psarcidi/aa.pdf * http://ec2-52-21-49-3.compute-1.amazonaws.com:8000/user/PzkeYvpCRSF5/notebooks/nikkei_returns_data.ipynb -* http://www.mglerner.com/blog/?p=28 +* http://www.mglerner.com/blog/?p=28 * https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/ * http://stackoverflow.com/questions/4150171/how-to-create-a-density-plot-in-matplotlib @@ -39,48 +39,48 @@ Examples: http://pandas.pydata.org/pandas-docs/stable/visualization.html#density-plot -Dynamics +Dynamics https://www.nact.org/resources/2014_SP_Global_Corporate_Default_Study.pdf ## Describing randomness -bar charts, pdfs, kde... +bar charts, pdfs, kde... http://pandas.pydata.org/pandas-docs/stable/visualization.html#other-plots -Compare two distributions of movie ratings +Compare two distributions of movie ratings http://fivethirtyeight.com/features/fandango-movies-ratings/ ## Scipy and Numpy -Show tools, generate random data +Show tools, generate random data -## Normal and other distributions +## Normal and other distributions -What's a black swan? How big was the drop in the Chinese market? +What's a black swan? How big was the drop in the Chinese market? -## Equity returns +## Equity returns -## More distributions +## More distributions -* CPS data? -* Long-tail sales data (music, movies?) -* Births by age of mother -* Age distribution +* CPS data? +* Long-tail sales data (music, movies?) +* Births by age of mother +* Age distribution -## Pareto etc +## Pareto etc https://terrytao.wordpress.com/2009/07/03/benfords-law-zipfs-law-and-the-pareto-distribution/ -# References +# References diff --git a/stats.md b/stats.md index 380d6aa..481d82f 100644 --- a/stats.md +++ b/stats.md @@ -3,11 +3,11 @@ --- **Overview.** -**Python tools.** +**Python tools.** **Applications.** -**Code.** +**Code.** --- @@ -16,17 +16,17 @@ Describing multivariate data: scatterplots, multivariate regression ## Data science -Two paths, stats and cs. ... +Two paths, stats and cs. ... * Stats. Start with a model, use data to estimate its parameters (numbers)... -* CS. Start with data, look for patterns. +* CS. Start with data, look for patterns. -Complementary... +Complementary... Claudia's hospital example -http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/ +http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/ Pokemon: https://pixelastic.github.io/pokemonorbigdata/ @@ -34,6 +34,6 @@ Pokemon: https://pixelastic.github.io/pokemonorbigdata/ **Simpson's paradox** -# References +# References -http://sebastianraschka.com/faq/index.html \ No newline at end of file +http://sebastianraschka.com/faq/index.html diff --git a/test.md b/test.md index 0e9e9ab..0cdb891 100644 --- a/test.md +++ b/test.md @@ -2,7 +2,7 @@ --- -**Overview.** Some ideas for thinking about data. +**Overview.** Some ideas for thinking about data. --- @@ -11,35 +11,35 @@ Data analysis starts with a question. Generally, we want to learn something. I * What emerging market countries offer the best business environments? -* How is the US economy doing right now? +* How is the US economy doing right now? -* How do returns on US and European stocks compare? +* How do returns on US and European stocks compare? -* How does average income vary across countries? Across states? Across zip codes? +* How does average income vary across countries? Across states? Across zip codes? -* Where are our best customers? +* Where are our best customers? -You get the idea. The starting point is a question, something you'd like to know. +You get the idea. The starting point is a question, something you'd like to know. -Once we have a question, we can start looking for data that might help us come up with an answer. We might ask ourselves: What data would be helpful in answering our question? Where can we find it? What should we do with it once we have it? +Once we have a question, we can start looking for data that might help us come up with an answer. We might ask ourselves: What data would be helpful in answering our question? Where can we find it? What should we do with it once we have it? -The question comes from you. What we'll provide in this course is a mentality for thinking about data and a toolset to work with it effectively. +The question comes from you. What we'll provide in this course is a mentality for thinking about data and a toolset to work with it effectively. -## Examples +## Examples -It's not that we have no lives or anything, but we think about data all the time. If we read The Economist -- or the Wall Street Journal, or a blog post -- and see an interesting graphic, we look immediately at the source. Is it one we know? Can we get it ourselves? +It's not that we have no lives or anything, but we think about data all the time. If we read The Economist -- or the Wall Street Journal, or a blog post -- and see an interesting graphic, we look immediately at the source. Is it one we know? Can we get it ourselves? -Examples are all around us. Here are a few that caught our attention: +Examples are all around us. Here are a few that caught our attention: -* [FRED](https://research.stlouisfed.org/fred2/series/GDP). Our go-to source for macroeconomic data. Note the "Notes" tab, it gives us the original source if we want to dig deeper. +* [FRED](https://research.stlouisfed.org/fred2/series/GDP). Our go-to source for macroeconomic data. Note the "Notes" tab, it gives us the original source if we want to dig deeper. -* [Gapminder world](http://www.gapminder.org/world/). Great interactive graphic. The [data page](http://www.gapminder.org/data/) gives sources. +* [Gapminder world](http://www.gapminder.org/world/). Great interactive graphic. The [data page](http://www.gapminder.org/data/) gives sources. -* [Market caps of tech firms](http://www.economist.com/techfirms). Interesting to see how quickly tech firms come and go. +* [Market caps of tech firms](http://www.economist.com/techfirms). Interesting to see how quickly tech firms come and go. -* [Economic mobility by region](Inequality: http://www.nytimes.com/2013/07/22/business/in-climbing-income-ladder-location-matters.html). We love maps. This one shows that kids do better (relative to their parents) in some places than others. +* [Economic mobility by region](Inequality: http://www.nytimes.com/2013/07/22/business/in-climbing-income-ladder-location-matters.html). We love maps. This one shows that kids do better (relative to their parents) in some places than others. -* [NBA shot charts](http://savvastjortjoglou.com/nba-shot-sharts.html). If you're into that kind of thing. +* [NBA shot charts](http://savvastjortjoglou.com/nba-shot-sharts.html). If you're into that kind of thing. diff --git a/yields.md b/yields.md index 57281a9..10f5f75 100644 --- a/yields.md +++ b/yields.md @@ -1,4 +1,4 @@ -# Bond yields +# Bond yields --- @@ -11,13 +11,13 @@ --- -Bond yields +Bond yields Breakeven inflation -# Resources +# Resources -Animation: https://jakevdp.github.io/blog/2012/08/18/matplotlib-animation-tutorial/ \ No newline at end of file +Animation: https://jakevdp.github.io/blog/2012/08/18/matplotlib-animation-tutorial/