As stated in the course description:
Over the semester, students will build a complex end-to-end data system.
You'll be building a live dashboard, with all the infrastructure behind it:
- Automated data ingestion
- A database
- Web-based interactive data visualization
All of this will be in the cloud.
- Center for Disease Control (CDC) dashboards
- Chicago Region Transit Dashboard
- Chicago Transit Authority Historical Bus Crowding
- Colorado Behavioral Health Administration (BHA) Performance Hub
- Congestion Pricing Tracker
- Johns Hopkins COVID map
- New York Flu Tracker
- New York Traffic Data Viewer (TDV)
- NYPD TrafficStat
- TransitMatters
- United States of Health
- All code is peer-reviewed, through pair programming and/or pull requests.
- All team members are contributing equal amounts.
- The Project leverages at least one dataset that's regularly updated.
- The code, documentation, and repository are clean, following good coding style and other best practices.
- The site doesn't need to read like a blog post necessarily, but it should explain what's going on.
- The site + codebase should be a polished portfolio piece.
- Data is being automatically updated.
Your group will pick an initial:
- Problem space
- Dataset
Part of this project is getting experience with automated data ingestion. Doing so is more interesting with data that changes regularly. You can incorporate additional datasets in the future.
Do the following as a group:
- Discuss what you'd like your project to focus on. Don't need to get too specific yet.
- Explore datasets that are updated weekly (the more often, the better) and pick one.
- Create a new notebook in Google Colab.
- Ensure you can load the data.
- Narrow down on 1-3 research questions.
- In other words, at the end of this project, what do you want to be able to show?
- Draw an example visualization that you'd like to produce.
- You can do so digitally or on a piece of paper.
- Include a title, legend, and axes labels (where appropriate).
- This is just a sketch; don't worry about the specific values.
You will then submit the following to the Discussion on Ed:
- What dataset are you going to use?
- Please include a link.
- What are your research question(s)?
- It should be specific, and objectively answerable through the data available.
- What's the link to your notebook?
- Go to Share -> General access -> LionMail -> Commenter.
- What's your target visualization?
- Include a picture.
- What are your known unknowns?
- What challenges do you anticipate?
Only one person from your group needs to submit. None of this is set in stone long term, it is just a starting place. It can all be changed later.
Goal: Get experience with an application development framework
- Using your dataset from Part 1:
- Create a Streamlit app.
- Deploy the app.
- Add a visualization.
- You can get fancy, but don't have to at this stage. Get something simple working first.
- Bring in a second relevant dataset. (This one doesn't need to be regularly updated.)
- This can be shown on a separate page of your Streamlit app, or combined in a single visualization.
- Add the names of the people on your team to your Streamlit app homepage.
- Turn in the link to your live app via CourseWorks.
- You can load data:
- from a URL (preferred), either:
- An API
- A link to a CSV
- from a file, checked into the repository
- from a URL (preferred), either:
- Note the Streamlit app resource limits.
- At this stage, feel free to make the dataset small to get it working.
- If the app is slow to reload, experiment with caching.
Goal: Get experience with unit testing
Work on branches and submit pull requests for the chunks of work — you decide what the "chunks" are.
- Without writing any code:
- Review your existing code.
- What can be refactored into functions?
- Where can we make our code DRY?
- Decide what function you're going to create.
- Come up with test cases (inputs) and expected outputs.
- This can be in a text file, doc, piece of paper, etc.
- Review your existing code.
- Then, as code:
- Write tests.
- Confirm they fail.
- Refactor your code into the function.
- Make the tests pass.
- Repeat until you feel your code is well-organized and well-tested.
- Submit the links to the pull requests via CourseWorks.
As a result, your:
- "main" scripts (for Streamlit pages or otherwise)
- Functions
should be relatively short and easy to read.
This isn't a one-time thing; continue testing and refactoring as you continue with the Project.
You will hold a team retrospective, with the goal of improving how your team works together. Since the groups are small, it can be fairly informal.
- Schedule 45 minutes for the retro.
- The retro needs to be done live/synchronous, not asynchronous.
- Read about retros.
- Decide who will be the Facilitator.
- Optional: Get someone from outside the team.
- Facilitator: Set up EasyRetro. Instructions.
- In the actual retro:
- Read the Agile Prime Directive out loud.
- 5 minutes: Individually write down "what went well" and "what could be better".
- 10-15 minutes: Discuss what has gone well.
- 20-25 minutes: Discuss what could be better.
- 5 minutes: Document takeaways / action items.
- Move your Proposal to the Streamlit app as is.
- Revisit the Proposal.
- Any new insights?
- Anything you want to adjust?
- Document any changes to the Proposal on the Streamlit page.
- Proceed with the analysis.
- If the majority of your code (to call APIs, etc.) is in modules/functions, it can be
imported from a Jupyter notebook. You can do exploratory analysis there, moving things to modules/Streamlit as you go. - You might not be able to fully answer the question(s) yet, but get as close as you can.
- If the majority of your code (to call APIs, etc.) is in modules/functions, it can be
At this point, your project should be looking more like one of the examples. Looking through the Streamlit data elements may be helpful.
Submit links to:
- The EasyRetro board
- Jupyter notebook(s), if any
- The (updated) Streamlit app
Goal: Understand how to work with a cloud-based database
- A service account has been created in your Project for you. It has been given read-only access to BigQuery.
- There are various things that can go wrong in these steps. Don't wait until the last minute.
Do the following for your regularly-updated data source. Only do one for now — we'll do the rest in Lab 10.
- Install pandas-gbq.
- Load data.
- Create a Python script that:
- Creates the table, if it doesn't exist
- Pulls data from your data source
- Copies the data to BigQuery using the appropriate technique
- Since you'll be running the script locally, authenticate with a user account.
- How to write tables with pandas-gbq
- How will you know it worked as intended?
- Create a Python script that:
- Have your app use BigQuery.
- Each team member will need to:
- Create a service account key as JSON. The service account is
streamlit@[project].iam.gserviceaccount.com. - Set up secrets management locally.
- Make sure to add
secrets.tomlto your.gitignoreso that you don't accidentally commit it to Git.
- Make sure to add
- Copy the key information to your
secrets.tomlfile.
- Create a service account key as JSON. The service account is
- Modify your app to read data from BigQuery.
- Copy the secrets to your deployed app.
- Re-deploy.
- Each team member will need to:
- Submit the links via CourseWorks for:
- The pull request(s)
- The link to your live Streamlit app
- We'll do this once now, once at the end.
- This will be factored into an individual score.
Visually map your data flow, end to end.
- What happens at each step?
- What can go wrong?
- Get granular
- Go all the way upstream. How does the data get collected/generated?
- You can use:
- Paper
- Google Drawings
- A fancier diagramming tool
- Don't over-complicate this
- An image of / a link to your map
- Link(s) to:
- Your pull request(s)
- A successful run of the GitHub Action
Goal: Determine and prioritize TODOs
You'll do this prioritization exercise as a group.
- This must be done synchronously.
- Look back at the expectations.
- The Prep can be done in the meeting itself.
- You can use paper/stickies or a digital template like Miro's
Submit a photo/link to the matrix via CourseWorks.
Goal: Meet the expectations
Do tasks you came up with in the prioritization exercise in order of priority.
Goal: Force clarity of the project and code by having to show and explain them to others
Each group will do a presentation on their Project in class.
- 10-ish minutes
- Slides optional
- Everyone in the group should speak.
- Explain the initial proposal and how it's evolved.
- Show the live app.
- Walk through the code.
- Talk through your findings.