|
| 1 | +# Foreword {-} |
| 2 | + |
| 3 | +*Roger D. Peng* |
| 4 | + |
| 5 | +*Johns Hopkins Bloomberg School of Public Health* |
| 6 | + |
| 7 | +*2022-01-04* |
| 8 | + |
| 9 | +The field of data science has expanded and grown significantly in recent years, |
| 10 | +attracting excitement and interest from many different directions. The demand for introductory |
| 11 | +educational materials has grown concurrently with the growth of the field itself, leading to |
| 12 | +a proliferation of textbooks, courses, blog posts, and tutorials. This book is an important |
| 13 | +contribution to this fast-growing literature, but given the wide availability of materials, a |
| 14 | +reader should be inclined to ask, "What is the unique contribution of *this* book?" In order |
| 15 | +to answer that question it is useful to step back for a moment and consider the development |
| 16 | +of the field of data science over the past few years. |
| 17 | + |
| 18 | +When thinking about data science, it is important to consider two questions: "What is |
| 19 | +data science?" and "How should one do data science?" The former question is under active |
| 20 | +discussion amongst a broad community of researchers and practitioners and there does |
| 21 | +not appear to be much consensus to date. However, there seems a general understanding |
| 22 | +that data science focuses on the more "active" elements—data wrangling, cleaning, and |
| 23 | +analysis—of answering questions with data. These elements are often highly |
| 24 | +problem-specific and may seem difficult to generalize across applications. Nevertheless, over time we |
| 25 | +have seen some core elements emerge that appear to repeat themselves as useful concepts |
| 26 | +across different problems. Given the lack of clear agreement over the definition of data |
| 27 | +science, there is a strong need for a book like this one to propose a vision for what the field |
| 28 | +is and what the implications are for the activities in which members of the field engage. |
| 29 | + |
| 30 | +The first important concept addressed by this book is tidy data, which is a format for |
| 31 | +tabular data formally introduced to the statistical community in a 2014 paper by Hadley |
| 32 | +Wickham. The tidy data organization strategy has proven a powerful abstract concept for |
| 33 | +conducting data analysis, in large part because of the vast toolchain implemented in the |
| 34 | +Tidyverse collection of R packages. The second key concept is the development of workflows |
| 35 | +for reproducible and auditable data analyses. Modern data analyses have only grown in |
| 36 | +complexity due to the availability of data and the ease with which we can implement complex |
| 37 | +data analysis procedures. Furthermore, these data analyses are often part of |
| 38 | +decision-making processes that may have significant impacts on people and communities. Therefore, |
| 39 | +there is a critical need to build reproducible analyses that can be studied and repeated by |
| 40 | +others in a reliable manner. Statistical methods clearly represent an important element |
| 41 | +of data science for building prediction and classification models and for making inferences |
| 42 | +about unobserved populations. Finally, because a field can succeed only if it fosters an |
| 43 | +active and collaborative community, it has become clear that being fluent in the tools of |
| 44 | +collaboration is a core element of data science. |
| 45 | + |
| 46 | +This book takes these core concepts and focuses on how one can apply them to *do* data |
| 47 | +science in a rigorous manner. Students who learn from this book will be well-versed in |
| 48 | +the techniques and principles behind producing reliable evidence from data. This book is |
| 49 | +centered around the use of the R programming language within the tidy data framework, |
| 50 | +and as such employs the most recent advances in data analysis coding. The use of Jupyter |
| 51 | +notebooks for exercises immediately places the student in an environment that encourages |
| 52 | +auditability and reproducibility of analyses. The integration of git and GitHub into the |
| 53 | +course is a key tool for teaching about collaboration and community, key concepts that are |
| 54 | +critical to data science. |
| 55 | + |
| 56 | +The demand for training in data science continues to increase. The availability of large |
| 57 | +quantities of data to answer a variety of questions, the computational power available to |
| 58 | +many more people than ever before, and the public awareness of the importance of data for |
| 59 | +decision-making have all contributed to the need for high-quality data science work. This |
| 60 | +book provides a sophisticated first introduction to the field of data science and provides |
| 61 | +a balanced mix of practical skills along with generalizable principles. As we continue to |
| 62 | +introduce students to data science and train them to confront an expanding array of data |
| 63 | +science problems, they will be well-served by the ideas presented here. |
0 commit comments