|
| 1 | +--- |
| 2 | +title: "Introduction to Python for Research" |
| 3 | +subtitle: "0a: Introduction" |
| 4 | +author: "Jason T. Kiley" |
| 5 | +format: |
| 6 | + revealjs: |
| 7 | + theme: dark |
| 8 | + css: _style.css |
| 9 | + slide-number: true |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +## {.center} |
| 14 | + |
| 15 | +::: r-fit-text |
| 16 | +[github.com/jtkiley](https://github.com/jtkiley/) |
| 17 | +::: |
| 18 | + |
| 19 | +## Related |
| 20 | + |
| 21 | +- CARMA 2020 (overlaps with this course): Introduction to Python and Content Analysis of Text. ([Github](https://github.com/jtkiley/2020_carma_python)) |
| 22 | +- Seminar materials (overlaps with this course): Text Analysis: Planning to Publication. ([Github](https://github.com/jtkiley/text_seminar)) |
| 23 | +- Text analysis and machine learning workshop at WU (Oct. 2018) and RSM (Oct. 2019). |
| 24 | +- AOM Big Data workshop with Tim Hannigan, Hovig Tchalian, and Laura Nelson. ([Github](https://github.com/jtkiley/curation_workshop)) |
| 25 | + |
| 26 | +## Course Agenda |
| 27 | + |
| 28 | +- Tools: Python, packages and environments |
| 29 | +- Basics: Python syntax and conventions, Jupyter Notebooks |
| 30 | +- Data handling and project planning |
| 31 | +- Data gathering and assembly |
| 32 | + |
| 33 | +# Overview |
| 34 | + |
| 35 | +## Overview |
| 36 | + |
| 37 | +- What you really need to know about Python. |
| 38 | +- Resources for learning. |
| 39 | +- A brief R comparison. |
| 40 | + |
| 41 | +# What do I really need to know about Python? |
| 42 | + |
| 43 | +## Why Python? |
| 44 | + |
| 45 | +- Approachability: well-designed modern programming language that handles a lot for you. |
| 46 | +- Features: many things have been built already, and you simply "glue" them together. |
| 47 | +- Learning resources: wide popularity in academia and practice means that there are extensive resources. |
| 48 | +- Scalability: from your computer, to the cloud, to a computing cluster, you can use largely the same tools. |
| 49 | + |
| 50 | +## Python Fluency |
| 51 | + |
| 52 | +- [Basics]{style="color:lightblue;"} |
| 53 | +- [Data Preparation]{style="color:lightgreen;"} |
| 54 | +- [Good-enough Programming]{style="color:lightyellow;"} |
| 55 | +- [Software Engineering]{style="color:pink;"} |
| 56 | + |
| 57 | +## [Basics]{style="color:lightblue;"} |
| 58 | + |
| 59 | +- Skills |
| 60 | + - Software: Python interpreter, Jupyter Notebooks, VS Code |
| 61 | + - Variable types: strings, ints, floats |
| 62 | + - Objects and methods: lists, dictionaries |
| 63 | + - Packages: importing and installing |
| 64 | + - Documentation: official and community |
| 65 | +- Time: 2-4 hours |
| 66 | +- Necessity: Largely unavoidable |
| 67 | + |
| 68 | + |
| 69 | +## [Data Preparation]{style="color:lightgreen;"} |
| 70 | + |
| 71 | +- Skills |
| 72 | + - Software: pandas |
| 73 | + - Reading data formats (built-in) |
| 74 | + - Slicing, views, `df.loc[]` |
| 75 | + - Operations on columns and rows |
| 76 | + - Reshaping |
| 77 | + - Merging and querying |
| 78 | +- Time: 1-2 days and ongoing |
| 79 | +- Necessity: Needed and high ROI |
| 80 | + |
| 81 | + |
| 82 | +## [Good-enough Programming]{style="color:lightyellow;"} |
| 83 | + |
| 84 | +- Skills |
| 85 | + - Loops |
| 86 | + - Writing functions |
| 87 | + - Reading and writing files (the hard way) |
| 88 | + - Throwing and handling exceptions |
| 89 | + - Using additional packages |
| 90 | + - End point: working, reusable script |
| 91 | +- Time: 1 week and ongoing; divisible |
| 92 | +- Necessity: Helpful and good ROI |
| 93 | + |
| 94 | + |
| 95 | +## [Software Engineering]{style="color:pink;"} |
| 96 | + |
| 97 | +- Skills |
| 98 | + - Classes and inheritance\* |
| 99 | + - Package development\* |
| 100 | + - Version control\* |
| 101 | + - Unit testing and continuous integration |
| 102 | + - Cross-version support |
| 103 | + - Open source contributions |
| 104 | +- Time: A lot |
| 105 | +- Necessity: Not at all; good for the field |
| 106 | + |
| 107 | + |
| 108 | +# What resources are available for learning? |
| 109 | + |
| 110 | +## Pandas documentation |
| 111 | + |
| 112 | +::: columns |
| 113 | +::: {.column width="50%" #vcenter} |
| 114 | +Comparison with Stata |
| 115 | +::: |
| 116 | + |
| 117 | +::: {.column width="50%" #vcenter} |
| 118 | + |
| 119 | +::: |
| 120 | +::: |
| 121 | + |
| 122 | +::: footer |
| 123 | +See more: [pandas documentation](https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_stata.html) |
| 124 | +::: |
| 125 | + |
| 126 | +## Stack Overflow |
| 127 | + |
| 128 | +::: columns |
| 129 | +::: {.column width="50%" #vcenter} |
| 130 | +Search for what you are trying to do, merging on multiple columns with different names, in this case. |
| 131 | +::: |
| 132 | + |
| 133 | +::: {.column width="50%" #vcenter} |
| 134 | + |
| 135 | +::: |
| 136 | +::: |
| 137 | + |
| 138 | +::: footer |
| 139 | +See more: [Stack Overflow](https://stackoverflow.com/questions/41815079/pandas-merge-join-two-data-frames-on-multiple-columns) |
| 140 | +::: |
| 141 | + |
| 142 | +## Python for Data Analysis |
| 143 | + |
| 144 | +::: columns |
| 145 | +::: {.column width="50%" #vcenter} |
| 146 | +Wes McKinney is the creator of pandas and other open source projects. |
| 147 | +::: |
| 148 | + |
| 149 | +::: {.column width="50%" #vcenter} |
| 150 | + |
| 151 | +::: |
| 152 | +::: |
| 153 | + |
| 154 | +::: footer |
| 155 | +For more: [Wes McKinney](https://wesmckinney.com/book/) |
| 156 | +::: |
| 157 | + |
| 158 | + |
| 159 | +## Other Resources |
| 160 | + |
| 161 | +- edX. Provides many courses that use or teach Python that are relevant for data work (free). |
| 162 | +- Self-study tracks from my seminar. Includes resources for data handling, data retrieval, machine learning. |
| 163 | +- YouTube. Has many content creators, covering Python, data science, and software development. |
| 164 | + |
| 165 | +# Python and R |
| 166 | + |
| 167 | +## What About R? |
| 168 | + |
| 169 | +- R is great overall, especially compared to a lot of commercial stats software. |
| 170 | +- Compared to Python, it is less general purpose, so some useful packages may not have analogues. |
| 171 | +- The syntax (from S in the 1970s) is sometimes quite arcane. |
| 172 | +- Best of both worlds: |
| 173 | + - Gather and prep data in Python. |
| 174 | + - If needed, use R for analyses. |
| 175 | + |
| 176 | +## Stack Overflow - Most Popular |
| 177 | + |
| 178 | + |
| 179 | + |
| 180 | +::: footer |
| 181 | +See more: [Stack Overflow](https://survey.stackoverflow.co/2023/#programming-scripting-and-markup-languages) |
| 182 | +::: |
| 183 | + |
| 184 | +## Stack Overflow - Most Desired |
| 185 | + |
| 186 | + |
| 187 | + |
| 188 | +::: footer |
| 189 | +See more: [Stack Overflow](https://survey.stackoverflow.co/2023/#section-admired-and-desired-programming-scripting-and-markup-languages) |
| 190 | +::: |
| 191 | + |
| 192 | +# Getting Started |
| 193 | + |
| 194 | +## Getting Started |
| 195 | + |
| 196 | +- Two approaches (choose one): |
| 197 | + - Github Codespaces (cloud) |
| 198 | + - VS Code, Docker, local container |
| 199 | +- You'll see me use both. |
| 200 | + |
| 201 | +# Hands on |
| 202 | + |
| 203 | +## Summary |
| 204 | + |
| 205 | +- Using Python for data analysis is not exactly programming, and you already have much of the knowledge you need. |
| 206 | +- Capturing all of our work in code that runs is a best practice that promotes reproducibility, and that helps us most of all. |
| 207 | +- We will start using our container in the next segment, so make sure it is set up and ready (or ask for help). |
| 208 | + |
| 209 | +# Break |
0 commit comments