Get hands-on experience with:
- Unit testing
- Data profiling
- Data quality checks
- Aidan responded to survey feedback
- We saw requests on the GitHub Organization, but it doesn't say from whom.
- Custom repository roles: Not sure what this was for. Please post on Ed if you need something about your repositories adjusted.
- Copilot: How to get it
All together
We're going to be creating a parse_dollars() function. It should accept a Series as an argument and return a new Series.
- Come up with test cases.
- Example inputs → expected outputs/result
- Convert those to pytest.
- Don't write
parse_dollars()yet!
- Don't write
- Run the tests, confirm they fail.
- Write
parse_dollars(), making tests pass.
Code needs to be testable. This encourages good habits, like:
- Ensuring your code can change without unexpected breakage
- Making small, reusable functions with well-defined behavior
- Organizing code into modules
- Allowing the loading of a module without running all the code
Seeing projects with <name>2.py. Splitting code up into smaller files will help you work in parallel without stepping on each others' toes.
You'll pair in your Lab group. Work on branches and submit pull requests for the chunks of work — you decide what the "chunks" are.
Install the Data Wrangler VSCode extension.
- Unit tests for data
- Things to check for when cleaning data
- Can be flexible, like checking for:
- Standard deviation being in a certain range
- X% of values matching certain criteria
- There are commercial tools that help with this - we're going to write the code ourselves.
Look around your data profiling report. Write takeaways of five findings that seem relevant to your analysis. This can be done in a Markdown file or a Jupyter notebook in your repository.
Using the profile information, write data quality checks for three of the things to check for with pytest. The pandas documentation around comparing if objects are equivalent will be helpful.
Submit the links to the pull request(s) via CourseWorks.