Fall 2017
Lecture: Wednesdays 6.10 - 8pm (but see weekly schedule)
Location: 207 Union Theological Seminary
Instructor: Thomas Brambor
tb2729@columbia.edu
IAB 509E Thurs 11-12pm
TA1: Shriya Balaji Palsamudram
sbp2148@columbia.edu
IAB 270 Time TBA
TA1: Sahil Manocha
sahil.manocha@columbia.edu
IAB 270 Time TBA
This course is intended to provide a detailed tour on how to access, clean, “munge” and organize data, both big and small. (It should also give students a flavor of what would be expected of them in a typical data science interview.) Each week will have simple, moderate and complex examples in class, with code to follow. Students will then practice additional exercises at home. The end point of each project would be to get the data organized and cleaned enough so that it is in a data-frame, ready for subsequent analysis and graphing. Therefore, no analysis or visualization (beyond just basic tables and plots to make sure everything was correctly organized) will be taught; and this will free up substantial time for the “nitty-gritty” of all of this data wrangling.
All lecture materials, exercises, and (links to) readings will be made available in the GitHub course repository.
This is a new course. The materials and topics indicated below are a provisional roadmap that will be adjusted to the needs of the students. I will let you know well ahead of time of any changes.
For all questions to the members of the teaching team, we will be using the discussion forum that is integrated into Columbia's Canvas. The forum will be used to exchange questions about lectures, assignments, software etc. Students are encouraged to help each other!
Students are asked to customize their Canvas notifications preferences to receive immediate (ASAP) notifications of messages and announcements through the third-party provider of choice (e.g. email, SMS/text). Students are also asked to log into the course regularly (more than twice a week) and check Announcements and the Canvas Inbox immediately upon logging in to stay on top of developments in the course as they occur.
Please send emails and messages to the instructor and teaching assistants through Canvas. Messages sent through the Canvas Inbox (Send a Message) feature will be answered within 24 hours during the week and within 48 hours on weekends. Please consider these response times when asking about assignments etc.
There are no required books for the course. All required readings will be provided as PDFs or links. However, here are some books that you may find useful in addition to the lectures and course readings.
-
Wickham, H., & Grolemund, G. (2017). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data (1 edition). O’Reilly Media. -- Great as introduction on how to use R. From the creator of many R packages that we use in the course, this will help with the usual tasks of data import and management, modeling, and some visualization. Book is available for free online.
-
Wickham, H. (2014). Advanced R (1 edition). Boca Raton, FL: Chapman and Hall/CRC. Book is available for free online
-
Boehmke, B. C. (2016). Data Wrangling with R (1st ed.). New York, NY: Springer.
- IDRE at UCLA has lots of tutorials, code examples, for R and other statistical packages.
- Try R. In-browser, interactive online tutorial. Particularly useful if you have not used R (much) before.
- Cheat sheets for data wrangling, data visualization, general use of R, R Studio, R Markdown etc.
- R Studio resources for R Markdown. Get started here with markdown.
- Awsome-R A curated list of great R packages and tools.
- Git
- Clients
- Tutorials
- Setting up git
- Try.github
- Hello World - GitHub for the non-programming beginner.
- Guides at GitHub
- Git and Github guide from plot.ly Extensive screen-shot guided intro to Git, Github, Git in RStudio and GitHub pages.
- Pro Git - a full book with lots of details
-
http://stackoverflow.com/ Programming Q&A site. Excellent first stop if you have questions on coding. Searching for keywords, and restrict your queries by adding tags about the coding language or package in square brackets, e.g.
[R],[ggplot], or[shiny]. -
http://stats.stackexchange.com/ A stackoverflow off-shoot with a bit more focus on conceptual questions in statistics.
-
http://rseek.org/ Search engine for R-related stuff, including tutorials and code.
This course will guide you through the data wrangling process using the software package R for most exercises. The program R itself can be downloaded for free at http://cran.r-project.org/.
Some familiarity with the software, in particular with regards to the base functions in R is assumed. Knowledge of specific packages and other software tools will be built throughout the course. If you have extensive experience with other similar programming tools, say Python or Matlab, you will be fine. However, if you are completely new to R and do not have compensatory experience in other coding languages, please consider in the QMSS course "Data Mining" instead.
You will need to have access to your own computer to install software and packages, do your assignments etc. I highly recommend bringing your laptop to class to follow along the coding tutorials and examples.
Homework
Homework problems will be assigned on a weekly basis, and students are expected to work on them alone.
Exams
We will have an in-class final, which will require the students to generate code during class to perform common operations, just as they would find in a data science interview.
Grade Distribution
The distribution of the parts for your grade is as follows:
- Final Exam = 30%
- Homework Assignments = 60%
- Attendance and Participation = 10%
Attendance and Class Participation
Your attendance and participation are necessary at every meeting. This class will work best when students ask a lot of questions.
Academic Integrity
This course is based on the principles of academic integrity established by Columbia University and agreed to by each student. The same rules hold in this course. Academic dishonesty will not be tolerated. All submitted work must be your own work and properly cited.
The full guidelines on academic integrity as well as a review of how or what to cite, can be found here: http://gsas.columbia.edu/academic-integrity
Students found guilty of plagiarism or academic dishonesty will be subject to appropriate disciplinary action, which may include reduction of grade, a failure in the course, suspension or expulsion. This includes lab reports – if they are copied from another student, severe penalties may be applied. ** Note that plagarism is also possible when writing code, so be careful to write your own code.
Late Assignment Policy
Students will lose points for handing in late assignments, at the discretion of the instructor and teaching assistant.
Other
Turn off or silence your cell phones prior to the beginning of class. I reserve the right to answer all calls (your's, not mine) received during class time and let your friends know what you are learning that day.
Feel free to use laptops in class - in fact, I encourage it. Respecting your classmates and myself, please refrain from using Facebook, shopping sites or other random distractions during class.
Changes
There may be adjustments of readings, assignments, exams, and classrooms. Changes will be posted on Courseworks along with other announcements.
Slides
Lecture slides will be made available on the course website. However, I believe that learning and understanding is better served when you need to aggregate and structure your notes yourself, so I suggest you do so as well.
- On your own: Install R and R Studio on your own computer. Try out R Markdown (use the tutorial to get familiar).
- On your own:
- Sign up for a GitHub account.
- Install GitHub Desktop (if you are confident in using command-line Git or have a different software preference, feel free to skip this step.)
- Claim your private repository connected with this class.
- Reading:
- On your own: Install
tidyversepackage. - Reading:
- Why R is Hard to Learn, by Robert A. Muenchen
- Reading:
- Functions in Advanced R by Hadley Wickham
- Some basics of code styling. Style Guide in R packages, by Hadley Wickham
- Reading:
- Swirl – R Programming – Lesson 9 – Functions, by Johnny Chan
- Writing An R Package From Scratch, by Hilary Parker
- Instructions for Creating Your Own R Package, by Song Kim, Phil Martin and Nina McMurry
- Further reading, not required!: Wickham, H. (2015). R Packages: Organize, Test, Document, and Share Your Code (1st edition). Sebastopol, CA: O’Reilly Media. Available online for free.
- On your own: Install
stringrpackage. - Reading:
- Handling and Processing Strings in R, by Gaston Sanchez
- Strings in R for Data Science, by Hadley Wickham
- Reading:
- Using Data.gov APIs in R, University of Virginia Library
- Scraping via APIs, by Bradley Boehmke
- Reading:
- Using R to download and parse JSON: an example using data from an open data portal, by Zev Ross
- Better handling of JSON data in R?, by Rolf Fredheim
- Introduction to
tidyjson, by Jeremy Stanley
- Reading:
- Using rvest to Scrape an HTML Table, by Cory Nissen
- How To Screen-scrape, by Chris Bail
- Reading:
- practicing using SQLZOO
- Reading:
- Basic Introduction into Algorithms and Data Structures, by Frauke Liers
- [Introduction to Pseudocode] by Carnegie Mellon’s Robotics Academy
- Reading:
- A comprehensive beginner’s guide to start ML with Amazon Web Services (AWS) by Aarshay Jain
- Analyzing Your Data on the AWS Cloud (with R), by Tal Galili
- Five ways to handle Big Data in R, by Oliver Bracht