|
| 1 | +--- |
| 2 | +title: "Introduction" |
| 3 | +teaching: 10 |
| 4 | +exercises: 0 |
| 5 | +--- |
| 6 | + |
| 7 | + |
| 8 | +:::::: questions |
| 9 | +- What is the point of these exercises? |
| 10 | +- How do I find the data I want to work with? |
| 11 | + |
| 12 | +:::::::::::::::::::::: |
| 13 | + |
| 14 | +:::::: objectives |
| 15 | +- To understand why we start with the Open Data Portal |
| 16 | +- To understand the basics of how the datasets are divided up |
| 17 | + |
| 18 | +:::::::::::::::::::::: |
| 19 | + |
| 20 | + |
| 21 | +:::::: callout |
| 22 | +## Ready to go? |
| 23 | +The first part of this lesson is done entirely in the browser. |
| 24 | + |
| 25 | +Then, [Episode 4: How to access metadata on the command line?](04-cli-through-cernopendata-client.md) requires the use of a command-line tool. It is available as a Docker container or it can be installed with `pip`. Make sure you have completed the [Docker pre-exercise](https://cms-opendata-workshop.github.io/workshopqcd-2024-lesson-docker/) and have docker installed. |
| 26 | + |
| 27 | +[Episode 5: What is in the data?](05-what-is-in-the-data.md) is done in the root or python tools container. Make sure you have [these containers available](https://cms-opendata-workshop.github.io/workshopqcd-2024-lesson-docker/03-docker-for-cms-opendata.html). |
| 28 | + |
| 29 | +:::::: |
| 30 | + |
| 31 | + |
| 32 | +:::::: checklist |
| 33 | + |
| 34 | +## You've got a great idea! What's next? |
| 35 | + |
| 36 | +Suppose you have a great idea that you want to test out with real data! You're going to want |
| 37 | +to know: |
| 38 | + |
| 39 | +- [x] What **year** were the data taken that would work best for you? |
| 40 | +- [x] In which **primary dataset** were the data of your interest stored? |
| 41 | +- [x] What **Monte Carlo datasets** are available and appropriate for your studies? |
| 42 | + |
| 43 | + This may mean |
| 44 | + - finding simulated physics processes that are **background** to your signal |
| 45 | + - finding simulated physics processes for your **signal**, if they exist |
| 46 | + - possibly just finding simulated datasets where you *know* the answer, allowing you to test your new analysis techniques. |
| 47 | + |
| 48 | +:::::: |
| 49 | + |
| 50 | +In this lesson, we'll walk through the process of finding out what data and |
| 51 | +Monte Carlo are available to you, how to find them, and how to examine what |
| 52 | +data are in the individual data files. |
| 53 | + |
| 54 | +:::::: callout |
| 55 | +## What is a Monte Carlo dataset? |
| 56 | +You'll often hear about physicists using *Monte Carlo data* or sometimes just referring to |
| 57 | +*Monte Carlo*. In both cases, we are referring to *simulations* of different physics processes, |
| 58 | +which then mimics how the particles would move through the CMS detector and how they |
| 59 | +might be affected (on average) as they pass through the material. This data is then |
| 60 | +processed just like the "real" data (referred to as *collider data*). |
| 61 | + |
| 62 | +In this way, the analysts can see how the detector records different physics processes |
| 63 | +and best understand how these processes "look" when we analyze the data. This better |
| 64 | +helps us understand both the signal that we are looking for and the backgrounds |
| 65 | +which might hide what we are looking for. |
| 66 | +:::::: |
| 67 | + |
| 68 | + |
| 69 | +First of all, let's understand how the data are stored and why we need certain |
| 70 | +tools to access them. |
| 71 | + |
| 72 | + |
| 73 | +## The CERN Open Data Portal |
| 74 | + |
| 75 | +In some of the earliest discussions about making HEP data publicly available there were many concerns about |
| 76 | +people using and analyzing "other people's" data. The concern centered around well-meaning scientists improperly |
| 77 | +analyzing data and coming up with incorrect conclusions. |
| 78 | + |
| 79 | +While no system is perfect, one way to guard against this is to only release well-understood, well-calibrated |
| 80 | +datasets and to make sure open data analysts *only* use these datasets. These datasets are given |
| 81 | +a [Digital Object Identifier (DOI)](https://www.doi.org/) code for tracking. And if there |
| 82 | +are ever questions about the validity of the data, it allows us to check the |
| 83 | +[data provenance](https://en.wikipedia.org/wiki/Data_lineage#:~:text=Data%20provenance%20refers%20to%20records,the%20data%20and%20its%20origins.). |
| 84 | + |
| 85 | +## DOI |
| 86 | + |
| 87 | +The [Digital Object Identifier (DOI)](https://www.doi.org/) system allows people to assign a unique |
| 88 | +ID to any piece of digital media: a book, a piece of music, a software package, or a dataset. If you want to learn |
| 89 | +more about the DOI process, you can learn more at their [FAQ](https://www.doi.org/faq.html). |
| 90 | + |
| 91 | +:::::: challenge |
| 92 | +## Challenge! |
| 93 | +You will find that all the datasets have their DOI listed at the top of their page on the portal. |
| 94 | +Can you locate where the DOI is shown for this dataset, Record 30521, |
| 95 | +[ DoubleEG primary dataset in NANOAOD format from RunG of 2016 (/DoubleEG/Run2016G-UL2016_MiniAODv2_NanoAODv9-v1/NANOAOD)](https://opendata.cern.ch/record/30521) |
| 96 | + |
| 97 | +Image here from assets/img/portal_screenshot_DOI_example.png |
| 98 | + |
| 99 | +With a DOI, you can create citations to any of these records, for example using a tool like [doi2bib](https://www.doi2bib.org). |
| 100 | +:::::: |
| 101 | + |
| 102 | +## Provenance |
| 103 | + |
| 104 | +You will hear experimentalists refer to the "*provenance*" of a dataset. From the |
| 105 | +[Cambridge dictionary](https://dictionary.cambridge.org/us/dictionary/english/provenance), provenance |
| 106 | +refers to "*the place of origin of something*". |
| 107 | +The way we use it, we are referring to how we keep track of the history of how a dataset was |
| 108 | +processed: what version of the software was used for reconstruction, what period of calibrations |
| 109 | +was used during that processing, etc. In this way, we are documenting the |
| 110 | +[data lineage](https://en.wikipedia.org/wiki/Data_lineage#:~:text=Data%20provenance%20refers%20to%20records,the%20data%20and%20its%20origins.) |
| 111 | +of our datasets. |
| 112 | + |
| 113 | +:::::: testimonial |
| 114 | +## From [Wikipedia](https://en.wikipedia.org/wiki/Data_lineage#:~:text=Data%20provenance%20refers%20to%20records,the%20data%20and%20its%20origins.) |
| 115 | +Data lineage includes the data origin, what happens to it and where it moves over time. |
| 116 | +Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process. |
| 117 | +:::::: |
| 118 | + |
| 119 | +Provenance is an an important part of our data quality checks |
| 120 | +and another reason we want to make sure you are using only vetted and calibrated data. |
| 121 | + |
| 122 | + |
| 123 | +## This lesson |
| 124 | + |
| 125 | +For all the reasons given above, we encourage you to familiarize yourself with the search features and options |
| 126 | +on the portal. With your feedback, we can also work to create better search tools/options and landing |
| 127 | +points. |
| 128 | + |
| 129 | +This exercise will guide you through the current approach to finding data and Monte Carlo. Let's go! |
| 130 | + |
| 131 | +:::::: keypoints |
| 132 | +- Finding the data is non-trivial, but all the information is on the portal |
| 133 | +- A careful understanding of the search options can help with finding what you need |
| 134 | + |
| 135 | +:::::::::::::::::::::: |
0 commit comments