Skip to content

Commit 6d67336

Browse files
Merge pull request #4362 from ovh/data_processing_apache_spark_notebook_data_cleaning_tuto
Data processing apache spark notebook data cleaning tuto
2 parents 6ff31db + 6181aed commit 6d67336

18 files changed

+1205
-0
lines changed

pages/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -852,6 +852,7 @@
852852
+ [Tutorials](public-cloud-data-analytics-data-processing-tutorials)
853853
+ [Python - Calculating π number with Apache Spark](platform/data-processing/40_TUTORIAL_calculate_pi)
854854
+ [Python - Analyzing most used words in lyrics with Apache Spark](platform/data-processing/41_TUTORIAL_wordcount)
855+
+ [Notebooks for Apache Spark - Data cleaning](platform/data-processing/42_TUTORIAL_notebook-data-cleaning)
855856
+ [Data Platforms](bare-metal-cloud-data-platforms)
856857
+ [Logs Data Platform](public-cloud-data-platforms-logs-data-platform)
857858
+ [Getting started](public-cloud-data-platforms-logs-data-platform-getting-started)
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
---
2+
title: Notebooks for Apache Spark - Data cleaning
3+
slug: notebook-spark-data-cleaning
4+
excerpt: Data cleaning of 2 CSV datasets with aggregation into a single clean Parquet file
5+
section: Tutorials
6+
order: 3
7+
routes:
8+
canonical: 'https://docs.ovh.com/de/data-processing/notebook-spark-data-cleaning/'
9+
updated: 2023-03-14
10+
---
11+
12+
**Last updated March 14th, 2023**
13+
14+
> [!primary]
15+
>
16+
> The Notebooks for Apache Spark feature is in `alpha`. During the alpha-testing phase, the infrastructure’s availability and data longevity are not guaranteed. Please do not use this service for applications that are in production, while this phase is not complete.
17+
>
18+
19+
## Objective
20+
21+
The purpose of this tutorial is to show how to clean data with [Apache Spark](https://spark.apache.org/) inside a [Jupyter Notebook](https://jupyter.org/).
22+
23+
*Data Cleaning* or *Data Cleansing* is the preparation of raw data by detecting and correcting records within a dataset.
24+
25+
This tutorial presents a simple data cleaning with `Notebooks for Apache Spark`.
26+
27+
## Requirements
28+
29+
- A [Public Cloud project](https://www.ovhcloud.com/de/public-cloud/) in your OVHcloud account
30+
- Access to the [OVHcloud Control Panel](https://www.ovh.com/auth/?action=gotomanager&from=https://www.ovh.de/&ovhSubsidiary=de)
31+
32+
## Instructions
33+
34+
### Upload data
35+
36+
First, download these 2 dataset CSV files locally:
37+
38+
- [books.csv](https://raw.githubusercontent.com/ovh/data-processing-samples/master/apache_spark_notebook_data_cleaning/books.csv)
39+
- [ratings.csv](https://raw.githubusercontent.com/ovh/data-processing-samples/master/apache_spark_notebook_data_cleaning/ratings.csv)
40+
41+
Then, from the [OVHcloud Control Panel](https://www.ovh.com/auth/?action=gotomanager&from=https://www.ovh.de/&ovhSubsidiary=de), go to the Object Storage section, locate your S3 bucket and upload your data by clicking `Add object`{.action}.
42+
43+
Select both files from your computer and add them to the root (`/`) of your bucket.
44+
45+
![image](images/object-storage-datasets.png){.thumbnail}
46+
47+
### Retrieve bucket credentials
48+
49+
> [!warning]
50+
>
51+
> Please be aware that notebooks are only available in `public access` during the `alpha` of the Notebooks for Apache Spark feature. As such, be careful of the **data** and the **credentials** you may expose in these notebooks.
52+
53+
There is some information that we will need as inputs of the notebook.
54+
55+
First, and while we're on the container page of the [OVHcloud Control Panel](https://www.ovh.com/auth/?action=gotomanager&from=https://www.ovh.de/&ovhSubsidiary=de) we will copy the `Endpoint` information and save it.
56+
57+
Go back to the Object Storage home page and then to the S3 users tab, copy the user's `access key` and save it.
58+
59+
Finally, click on the `...`{.action} button at the end of the user row, click on `View the secret key`{.action}, copy the value and save it.
60+
61+
### Launch and access a Notebook for Apache Spark
62+
63+
From the [OVHcloud Control Panel](https://www.ovh.com/auth/?action=gotomanager&from=https://www.ovh.de/&ovhSubsidiary=de), go to the Data Processing section and create a new notebook by clicking `Data Processing`{.action} and then `Create notebook`{.action}.
64+
65+
You can then reach the `JupyterLab` URL directly from the notebooks list or from the notebook page.
66+
67+
### Experiment with the notebook
68+
69+
Now that you have your initial datasets ready on an Object Storage and a notebook running, you can start cleaning this data.
70+
71+
A preview of this notebook can be found on [GitHub](https://github.com/ovh/data-processing-samples/blob/master/apache_spark_notebook_data_cleaning/apache_spark_notebook_data_cleaning_tutorial.ipynb).
72+
73+
### Go further
74+
75+
- Do you want to create a data cleaning job you could replay based on your notebook? [Please refer to this guide](https://docs.ovh.com/de/data-processing/submit-python/).
76+
77+
## Feedback
78+
79+
Please send us your questions, feedback and suggestions to improve the service:
80+
81+
- On the OVHcloud [Discord server](https://discord.com/invite/vXVurFfwe9)
82+
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
---
2+
title: Notebooks for Apache Spark - Data cleaning
3+
slug: notebook-spark-data-cleaning
4+
excerpt: Data cleaning of 2 CSV datasets with aggregation into a single clean Parquet file
5+
section: Tutorials
6+
order: 3
7+
updated: 2023-03-14
8+
---
9+
10+
**Last updated March 14th, 2023**
11+
12+
> [!primary]
13+
>
14+
> The Notebooks for Apache Spark feature is in `alpha`. During the alpha-testing phase, the infrastructure’s availability and data longevity are not guaranteed. Please do not use this service for applications that are in production, while this phase is not complete.
15+
>
16+
17+
## Objective
18+
19+
The purpose of this tutorial is to show how to clean data with [Apache Spark](https://spark.apache.org/) inside a [Jupyter Notebook](https://jupyter.org/).
20+
21+
*Data Cleaning* or *Data Cleansing* is the preparation of raw data by detecting and correcting records within a dataset.
22+
23+
This tutorial presents a simple data cleaning with `Notebooks for Apache Spark`.
24+
25+
## Requirements
26+
27+
- A [Public Cloud project](ttps://www.ovhcloud.com/asia/public-cloud/) in your OVHcloud account
28+
- Access to the [OVHcloud Control Panel](https://ca.ovh.com/auth/?action=gotomanager&from=https://www.ovh.com/asia/&ovhSubsidiary=asia)
29+
30+
## Instructions
31+
32+
### Upload data
33+
34+
First, download these 2 dataset CSV files locally:
35+
- [books.csv](https://raw.githubusercontent.com/ovh/data-processing-samples/master/apache_spark_notebook_data_cleaning/books.csv)
36+
- [ratings.csv](https://raw.githubusercontent.com/ovh/data-processing-samples/master/apache_spark_notebook_data_cleaning/ratings.csv)
37+
38+
Then, from the [OVHcloud Control Panel](https://ca.ovh.com/auth/?action=gotomanager&from=https://www.ovh.com/asia/&ovhSubsidiary=asia), go to the Object Storage section, locate your S3 bucket and upload your data by clicking `Add object`{.action}.
39+
40+
Select both files from your computer and add them to the root (`/`) of your bucket.
41+
42+
![image](images/object-storage-datasets.png){.thumbnail}
43+
44+
### Retrieve bucket credentials
45+
46+
> [!warning]
47+
>
48+
> Please be aware that notebooks are only available in `public access` during the `alpha` of the Notebooks for Apache Spark feature. As such, be careful of the **data** and the **credentials** you may expose in these notebooks.
49+
50+
There is some information that we will need as inputs of the notebook.
51+
52+
First, and while we're on the container page of the [OVHcloud Control Panel](https://ca.ovh.com/auth/?action=gotomanager&from=https://www.ovh.com/asia/&ovhSubsidiary=asia) we will copy the `Endpoint` information and save it.
53+
54+
Go back to the Object Storage home page and then to the S3 users tab, copy the user's `access key` and save it.
55+
56+
Finally, click on the `...`{.action} button at the end of the user row, click on `View the secret key`{.action}, copy the value and save it.
57+
58+
### Launch and access a Notebook for Apache Spark
59+
60+
From the [OVHcloud Control Panel](https://ca.ovh.com/auth/?action=gotomanager&from=https://www.ovh.com/asia/&ovhSubsidiary=asia), go to the Data Processing section and create a new notebook by clicking `Data Processing`{.action} and then `Create notebook`{.action}.
61+
62+
You can then reach the `JupyterLab` URL directly from the notebooks list or from the notebook page.
63+
64+
### Experiment with the notebook
65+
66+
Now that you have your initial datasets ready on an Object Storage and a notebook running, you can start cleaning this data.
67+
68+
A preview of this notebook can be found on [GitHub](https://github.com/ovh/data-processing-samples/blob/master/apache_spark_notebook_data_cleaning/apache_spark_notebook_data_cleaning_tutorial.ipynb).
69+
70+
### Go further
71+
72+
- Do you want to create a data cleaning job you could replay based on your notebook? [Please refer to this guide](https://docs.ovh.com/asia/en/data-processing/submit-python/).
73+
74+
## Feedback
75+
76+
Please send us your questions, feedback and suggestions to improve the service:
77+
78+
- On the OVHcloud [Discord server](https://discord.com/invite/vXVurFfwe9)
79+
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
---
2+
title: Notebooks for Apache Spark - Data cleaning
3+
slug: notebook-spark-data-cleaning
4+
excerpt: Data cleaning of 2 CSV datasets with aggregation into a single clean Parquet file
5+
section: Tutorials
6+
order: 3
7+
updated: 2023-03-14
8+
---
9+
10+
**Last updated March 14th, 2023**
11+
12+
> [!primary]
13+
>
14+
> The Notebooks for Apache Spark feature is in `alpha`. During the alpha-testing phase, the infrastructure’s availability and data longevity are not guaranteed. Please do not use this service for applications that are in production, while this phase is not complete.
15+
>
16+
17+
## Objective
18+
19+
The purpose of this tutorial is to show how to clean data with [Apache Spark](https://spark.apache.org/) inside a [Jupyter Notebook](https://jupyter.org/).
20+
21+
*Data Cleaning* or *Data Cleansing* is the preparation of raw data by detecting and correcting records within a dataset.
22+
23+
This tutorial presents a simple data cleaning with `Notebooks for Apache Spark`.
24+
25+
## Requirements
26+
27+
- A [Public Cloud project](https://www.ovhcloud.com/en-au/public-cloud/) in your OVHcloud account
28+
- Access to the [OVHcloud Control Panel](https://ca.ovh.com/auth/?action=gotomanager&from=https://www.ovh.com.au/&ovhSubsidiary=au)
29+
30+
## Instructions
31+
32+
### Upload data
33+
34+
First, download these 2 dataset CSV files locally:
35+
- [books.csv](https://raw.githubusercontent.com/ovh/data-processing-samples/master/apache_spark_notebook_data_cleaning/books.csv)
36+
- [ratings.csv](https://raw.githubusercontent.com/ovh/data-processing-samples/master/apache_spark_notebook_data_cleaning/ratings.csv)
37+
38+
Then, from the [OVHcloud Control Panel](https://ca.ovh.com/auth/?action=gotomanager&from=https://www.ovh.com.au/&ovhSubsidiary=au), go to the Object Storage section, locate your S3 bucket and upload your data by clicking `Add object`{.action}.
39+
40+
Select both files from your computer and add them to the root (`/`) of your bucket.
41+
42+
![image](images/object-storage-datasets.png){.thumbnail}
43+
44+
### Retrieve bucket credentials
45+
46+
> [!warning]
47+
>
48+
> Please be aware that notebooks are only available in `public access` during the `alpha` of the Notebooks for Apache Spark feature. As such, be careful of the **data** and the **credentials** you may expose in these notebooks.
49+
50+
There is some information that we will need as inputs of the notebook.
51+
52+
First, and while we're on the container page of the [OVHcloud Control Panel](https://ca.ovh.com/auth/?action=gotomanager&from=https://www.ovh.com.au/&ovhSubsidiary=au) we will copy the `Endpoint` information and save it.
53+
54+
Go back to the Object Storage home page and then to the S3 users tab, copy the user's `access key` and save it.
55+
56+
Finally, click on the `...`{.action} button at the end of the user row, click on `View the secret key`{.action}, copy the value and save it.
57+
58+
### Launch and access a Notebook for Apache Spark
59+
60+
From the [OVHcloud Control Panel](https://ca.ovh.com/auth/?action=gotomanager&from=https://www.ovh.com.au/&ovhSubsidiary=au), go to the Data Processing section and create a new notebook by clicking `Data Processing`{.action} and then `Create notebook`{.action}.
61+
62+
You can then reach the `JupyterLab` URL directly from the notebooks list or from the notebook page.
63+
64+
### Experiment with the notebook
65+
66+
Now that you have your initial datasets ready on an Object Storage and a notebook running, you can start cleaning this data.
67+
68+
A preview of this notebook can be found on [GitHub](https://github.com/ovh/data-processing-samples/blob/master/apache_spark_notebook_data_cleaning/apache_spark_notebook_data_cleaning_tutorial.ipynb).
69+
70+
### Go further
71+
72+
- Do you want to create a data cleaning job you could replay based on your notebook? [Please refer to this guide](https://docs.ovh.com/au/en/data-processing/submit-python/).
73+
74+
## Feedback
75+
76+
Please send us your questions, feedback and suggestions to improve the service:
77+
78+
- On the OVHcloud [Discord server](https://discord.com/invite/vXVurFfwe9)
79+
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
---
2+
title: Notebooks for Apache Spark - Data cleaning
3+
slug: notebook-spark-data-cleaning
4+
excerpt: Data cleaning of 2 CSV datasets with aggregation into a single clean Parquet file
5+
section: Tutorials
6+
order: 3
7+
updated: 2023-03-14
8+
---
9+
10+
**Last updated March 14th, 2023**
11+
12+
> [!primary]
13+
>
14+
> The Notebooks for Apache Spark feature is in `alpha`. During the alpha-testing phase, the infrastructure’s availability and data longevity are not guaranteed. Please do not use this service for applications that are in production, while this phase is not complete.
15+
>
16+
17+
## Objective
18+
19+
The purpose of this tutorial is to show how to clean data with [Apache Spark](https://spark.apache.org/) inside a [Jupyter Notebook](https://jupyter.org/).
20+
21+
*Data Cleaning* or *Data Cleansing* is the preparation of raw data by detecting and correcting records within a dataset.
22+
23+
This tutorial presents a simple data cleaning with `Notebooks for Apache Spark`.
24+
25+
## Requirements
26+
27+
- A [Public Cloud project](https://www.ovhcloud.com/en-ca/public-cloud/) in your OVHcloud account
28+
- Access to the [OVHcloud Control Panel](https://ca.ovh.com/auth/?action=gotomanager&from=https://www.ovh.com/ca/en/&ovhSubsidiary=ca)
29+
30+
## Instructions
31+
32+
### Upload data
33+
34+
First, download these 2 dataset CSV files locally:
35+
- [books.csv](https://raw.githubusercontent.com/ovh/data-processing-samples/master/apache_spark_notebook_data_cleaning/books.csv)
36+
- [ratings.csv](https://raw.githubusercontent.com/ovh/data-processing-samples/master/apache_spark_notebook_data_cleaning/ratings.csv)
37+
38+
Then, from the [OVHcloud Control Panel](https://ca.ovh.com/auth/?action=gotomanager&from=https://www.ovh.com/ca/en/&ovhSubsidiary=ca), go to the Object Storage section, locate your S3 bucket and upload your data by clicking `Add object`{.action}.
39+
40+
Select both files from your computer and add them to the root (`/`) of your bucket.
41+
42+
![image](images/object-storage-datasets.png){.thumbnail}
43+
44+
### Retrieve bucket credentials
45+
46+
> [!warning]
47+
>
48+
> Please be aware that notebooks are only available in `public access` during the `alpha` of the Notebooks for Apache Spark feature. As such, be careful of the **data** and the **credentials** you may expose in these notebooks.
49+
50+
There is some information that we will need as inputs of the notebook.
51+
52+
First, and while we're on the container page of the [OVHcloud Control Panel](https://ca.ovh.com/auth/?action=gotomanager&from=https://www.ovh.com/ca/en/&ovhSubsidiary=ca) we will copy the `Endpoint` information and save it.
53+
54+
Go back to the Object Storage home page and then to the S3 users tab, copy the user's `access key` and save it.
55+
56+
Finally, click on the `...`{.action} button at the end of the user row, click on `View the secret key`{.action}, copy the value and save it.
57+
58+
### Launch and access a Notebook for Apache Spark
59+
60+
From the [OVHcloud Control Panel](https://ca.ovh.com/auth/?action=gotomanager&from=https://www.ovh.com/ca/en/&ovhSubsidiary=ca), go to the Data Processing section and create a new notebook by clicking `Data Processing`{.action} and then `Create notebook`{.action}.
61+
62+
You can then reach the `JupyterLab` URL directly from the notebooks list or from the notebook page.
63+
64+
### Experiment with the notebook
65+
66+
Now that you have your initial datasets ready on an Object Storage and a notebook running, you can start cleaning this data.
67+
68+
A preview of this notebook can be found on [GitHub](https://github.com/ovh/data-processing-samples/blob/master/apache_spark_notebook_data_cleaning/apache_spark_notebook_data_cleaning_tutorial.ipynb).
69+
70+
### Go further
71+
72+
- Do you want to create a data cleaning job you could replay based on your notebook? [Please refer to this guide](https://docs.ovh.com/ca/en/data-processing/submit-python/).
73+
74+
## Feedback
75+
76+
Please send us your questions, feedback and suggestions to improve the service:
77+
78+
- On the OVHcloud [Discord server](https://discord.com/invite/vXVurFfwe9)
79+

0 commit comments

Comments
 (0)