Skip to content

Commit 593f8ba

Browse files
authored
Episodes from Kati
1 parent 1edd511 commit 593f8ba

File tree

5 files changed

+776
-0
lines changed

5 files changed

+776
-0
lines changed

episodes/01-introduction.md

Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
---
2+
title: "Introduction"
3+
teaching: 10
4+
exercises: 0
5+
---
6+
7+
8+
:::::: questions
9+
- What is the point of these exercises?
10+
- How do I find the data I want to work with?
11+
12+
::::::::::::::::::::::
13+
14+
:::::: objectives
15+
- To understand why we start with the Open Data Portal
16+
- To understand the basics of how the datasets are divided up
17+
18+
::::::::::::::::::::::
19+
20+
21+
:::::: callout
22+
## Ready to go?
23+
The first part of this lesson is done entirely in the browser.
24+
25+
Then, [Episode 4: How to access metadata on the command line?](04-cli-through-cernopendata-client.md) requires the use of a command-line tool. It is available as a Docker container or it can be installed with `pip`. Make sure you have completed the [Docker pre-exercise](https://cms-opendata-workshop.github.io/workshopqcd-2024-lesson-docker/) and have docker installed.
26+
27+
[Episode 5: What is in the data?](05-what-is-in-the-data.md) is done in the root or python tools container. Make sure you have [these containers available](https://cms-opendata-workshop.github.io/workshopqcd-2024-lesson-docker/03-docker-for-cms-opendata.html).
28+
29+
::::::
30+
31+
32+
:::::: checklist
33+
34+
## You've got a great idea! What's next?
35+
36+
Suppose you have a great idea that you want to test out with real data! You're going to want
37+
to know:
38+
39+
- [x] What **year** were the data taken that would work best for you?
40+
- [x] In which **primary dataset** were the data of your interest stored?
41+
- [x] What **Monte Carlo datasets** are available and appropriate for your studies?
42+
43+
This may mean
44+
- finding simulated physics processes that are **background** to your signal
45+
- finding simulated physics processes for your **signal**, if they exist
46+
- possibly just finding simulated datasets where you *know* the answer, allowing you to test your new analysis techniques.
47+
48+
::::::
49+
50+
In this lesson, we'll walk through the process of finding out what data and
51+
Monte Carlo are available to you, how to find them, and how to examine what
52+
data are in the individual data files.
53+
54+
:::::: callout
55+
## What is a Monte Carlo dataset?
56+
You'll often hear about physicists using *Monte Carlo data* or sometimes just referring to
57+
*Monte Carlo*. In both cases, we are referring to *simulations* of different physics processes,
58+
which then mimics how the particles would move through the CMS detector and how they
59+
might be affected (on average) as they pass through the material. This data is then
60+
processed just like the "real" data (referred to as *collider data*).
61+
62+
In this way, the analysts can see how the detector records different physics processes
63+
and best understand how these processes "look" when we analyze the data. This better
64+
helps us understand both the signal that we are looking for and the backgrounds
65+
which might hide what we are looking for.
66+
::::::
67+
68+
69+
First of all, let's understand how the data are stored and why we need certain
70+
tools to access them.
71+
72+
73+
## The CERN Open Data Portal
74+
75+
In some of the earliest discussions about making HEP data publicly available there were many concerns about
76+
people using and analyzing "other people's" data. The concern centered around well-meaning scientists improperly
77+
analyzing data and coming up with incorrect conclusions.
78+
79+
While no system is perfect, one way to guard against this is to only release well-understood, well-calibrated
80+
datasets and to make sure open data analysts *only* use these datasets. These datasets are given
81+
a [Digital Object Identifier (DOI)](https://www.doi.org/) code for tracking. And if there
82+
are ever questions about the validity of the data, it allows us to check the
83+
[data provenance](https://en.wikipedia.org/wiki/Data_lineage#:~:text=Data%20provenance%20refers%20to%20records,the%20data%20and%20its%20origins.).
84+
85+
## DOI
86+
87+
The [Digital Object Identifier (DOI)](https://www.doi.org/) system allows people to assign a unique
88+
ID to any piece of digital media: a book, a piece of music, a software package, or a dataset. If you want to learn
89+
more about the DOI process, you can learn more at their [FAQ](https://www.doi.org/faq.html).
90+
91+
:::::: challenge
92+
## Challenge!
93+
You will find that all the datasets have their DOI listed at the top of their page on the portal.
94+
Can you locate where the DOI is shown for this dataset, Record 30521,
95+
[ DoubleEG primary dataset in NANOAOD format from RunG of 2016 (/DoubleEG/Run2016G-UL2016_MiniAODv2_NanoAODv9-v1/NANOAOD)](https://opendata.cern.ch/record/30521)
96+
97+
Image here from assets/img/portal_screenshot_DOI_example.png
98+
99+
With a DOI, you can create citations to any of these records, for example using a tool like [doi2bib](https://www.doi2bib.org).
100+
::::::
101+
102+
## Provenance
103+
104+
You will hear experimentalists refer to the "*provenance*" of a dataset. From the
105+
[Cambridge dictionary](https://dictionary.cambridge.org/us/dictionary/english/provenance), provenance
106+
refers to "*the place of origin of something*".
107+
The way we use it, we are referring to how we keep track of the history of how a dataset was
108+
processed: what version of the software was used for reconstruction, what period of calibrations
109+
was used during that processing, etc. In this way, we are documenting the
110+
[data lineage](https://en.wikipedia.org/wiki/Data_lineage#:~:text=Data%20provenance%20refers%20to%20records,the%20data%20and%20its%20origins.)
111+
of our datasets.
112+
113+
:::::: testimonial
114+
## From [Wikipedia](https://en.wikipedia.org/wiki/Data_lineage#:~:text=Data%20provenance%20refers%20to%20records,the%20data%20and%20its%20origins.)
115+
Data lineage includes the data origin, what happens to it and where it moves over time.
116+
Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process.
117+
::::::
118+
119+
Provenance is an an important part of our data quality checks
120+
and another reason we want to make sure you are using only vetted and calibrated data.
121+
122+
123+
## This lesson
124+
125+
For all the reasons given above, we encourage you to familiarize yourself with the search features and options
126+
on the portal. With your feedback, we can also work to create better search tools/options and landing
127+
points.
128+
129+
This exercise will guide you through the current approach to finding data and Monte Carlo. Let's go!
130+
131+
:::::: keypoints
132+
- Finding the data is non-trivial, but all the information is on the portal
133+
- A careful understanding of the search options can help with finding what you need
134+
135+
::::::::::::::::::::::
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
---
2+
title: "Where are the datasets?"
3+
teaching: 5
4+
exercises: 5
5+
---
6+
7+
8+
:::::: questions
9+
- Where do I find datasets for data and Monte Carlo?
10+
11+
::::::::::::::::::::::
12+
13+
:::::: objectives
14+
- Be able to find the data and Monte Carlo datasets
15+
16+
::::::::::::::::::::::
17+
18+
## CERN Open Data Portal
19+
20+
Our starting point is the landing page for [CERN Open Data Portal](http://opendata.cern.ch/).
21+
You should definitely take some time to explore it. But for now we will select the
22+
CMS data.
23+
24+
:::::: callout
25+
## CERN Open Data Portal
26+
The landing page for the [CERN Open Data Portal](http://opendata.cern.ch/).
27+
![](fig/portal_screenshot_landing_page.png)
28+
29+
::::::::::::::::::::::
30+
31+
:::::: prereq
32+
## Make a selection!
33+
Find the CMS link under **Focus on** and click on it.
34+
::::::::::::::::::::::
35+
36+
## CMS-specific datasets
37+
38+
The figure below shows the website after we have chosen the CMS data. Note the left-hand
39+
sidebar that allows us to filter our selections. Let's see what's there.
40+
41+
42+
:::::: callout
43+
## CERN Open Data Portal - CMS data
44+
The first pass to filter on CMS data
45+
![](fig/portal_screenshot_cms_selected.png)
46+
::::::::::::::::::::::
47+
48+
At first glance we can see a few things. First, there is an option to select only **Dataset** rather
49+
than documentation or software or similar materials. Great! Going forward we'll select **Dataset**.
50+
51+
Next, scrolling down to see the search options in the left bar, we see that there are a lot of entries for data from **2010**, **2011**, **2012**, **2015**, and **2016**, the 7 TeV, 8 TeV and 13 TeV running periods.
52+
We'll be working with 2016 data for these exercises.
53+
54+
:::::: prereq
55+
## Make a selection!
56+
For the next episode, let's select **Dataset** and **2016**.
57+
::::::::::::::::::::::
58+
59+
:::::: keypoints
60+
- Use the filter selections in the left-hand sidebar of the CERN Open Data Portal to find datasets.
61+
62+
::::::::::::::::::::::

0 commit comments

Comments
 (0)