Skip to content

Commit bbb63cd

Browse files
committed
Add guide for all languages.
1 parent ef17aca commit bbb63cd

File tree

15 files changed

+1001
-5
lines changed

15 files changed

+1001
-5
lines changed
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
---
2+
title: Notebooks for Apache Spark - Data cleaning
3+
slug: notebook-spark-data-cleaning
4+
excerpt: Data cleaning of 2 CSV dataset with aggregation into a single clean Parquet file.
5+
section: Tutorials
6+
order: 3
7+
routes:
8+
canonical: 'https://docs.ovh.com/de/data-processing/notebook-spark-data-cleaning/'
9+
updated: 2023-03-14
10+
---
11+
12+
**Last updated March 14th, 2023**
13+
14+
## Objective
15+
16+
The purpose of this tutorial is to show how to clean data with [Apache Spark](https://spark.apache.org/) inside a [Jupyter Notebook](https://jupyter.org/).
17+
18+
*Data Cleaning* or *Data Cleansing* is the preparation of raw data by detecting and correcting records within a dataset.
19+
20+
The tutorial presents a simple data cleaning with `Notebooks for Apache Spark`.
21+
22+
## Requirements
23+
24+
- A [Public Cloud project](https://www.ovhcloud.com/de/public-cloud/) in your OVHcloud account
25+
- Access to the [OVHcloud Control Panel](https://www.ovh.com/auth/?action=gotomanager&from=https://www.ovh.de/&ovhSubsidiary=de)
26+
27+
## Instructions
28+
29+
### Upload data
30+
31+
First, download these 2 datasets CSV files locally:
32+
* [books.csv](https://raw.githubusercontent.com/ovh/data-processing-samples/master/apache_spark_notebook_data_cleaning/books.csv)
33+
* [ratings.csv](https://raw.githubusercontent.com/ovh/data-processing-samples/master/apache_spark_notebook_data_cleaning/ratings.csv)
34+
35+
Then, from the [OVHcloud Control Panel](https://www.ovh.com/auth/?action=gotomanager&from=https://www.ovh.de/&ovhSubsidiary=de), go to the Object Storage section, locate your S3 bucket and upload your data by clicking `Add object`{.action}.
36+
37+
Select both files from your computer and add them to the root `/` of your bucket.
38+
39+
![image](images/object-storage-datasets.png){.thumbnail}
40+
41+
### Retrieve bucket credentials
42+
43+
There are a few information that we will need as inputs of the notebook.
44+
45+
First, and while we're on the container page of the [OVHcloud Control Panel](https://www.ovh.com/auth/?action=gotomanager&from=https://www.ovh.de/&ovhSubsidiary=de) we will copy the `Endpoint` information and save it.
46+
47+
Go back to the Object Storage home page and then to the S3 users tab, copy the user's `access key` and save it.
48+
49+
Finally, click on action "hamburger" at the end of the user row `(...)`{.action} > `View the secret key`{.action}, copy the value and save it.
50+
51+
### Launch and access a Notebook for Apache Spark
52+
53+
From the [OVHcloud Control Panel](https://www.ovh.com/auth/?action=gotomanager&from=https://www.ovh.de/&ovhSubsidiary=de), go to the Data Processing section and create a new notebook by clicking `Data Processing`{.action} > `Create notebook`{.action}.
54+
55+
You can then reach the `JupyterLab` URL directly from the notebooks list or from the notebook page.
56+
57+
### Experiment with notebook
58+
59+
Now that you have your initial datasets ready on an Object Storage and a notebook running, you could start cleaning this data!
60+
61+
A preview of this notebook can be found on [GitHub](https://github.com/ovh/data-processing-samples/blob/master/apache_spark_notebook_data_cleaning/apache_spark_notebook_data_cleaning_tutorial.ipynb).
62+
63+
### Go further
64+
65+
- Do you want to create a data cleaning job you could replay based on your notebook? [Here it is](https://docs.ovh.com/de/data-processing/submit-python/).
66+
67+
## Feedback
68+
69+
Please send us your questions, feedback and suggestions to improve the service:
70+
71+
- On the OVHcloud [Discord server](https://discord.com/invite/vXVurFfwe9)
72+
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
---
2+
title: Notebooks for Apache Spark - Data cleaning
3+
slug: notebook-spark-data-cleaning
4+
excerpt: Data cleaning of 2 CSV dataset with aggregation into a single clean Parquet file.
5+
section: Tutorials
6+
order: 3
7+
updated: 2023-03-14
8+
---
9+
10+
**Last updated March 14th, 2023**
11+
12+
## Objective
13+
14+
The purpose of this tutorial is to show how to clean data with [Apache Spark](https://spark.apache.org/) inside a [Jupyter Notebook](https://jupyter.org/).
15+
16+
*Data Cleaning* or *Data Cleansing* is the preparation of raw data by detecting and correcting records within a dataset.
17+
18+
The tutorial presents a simple data cleaning with `Notebooks for Apache Spark`.
19+
20+
## Requirements
21+
22+
- A [Public Cloud project](ttps://www.ovhcloud.com/asia/public-cloud/) in your OVHcloud account
23+
- Access to the [OVHcloud Control Panel](https://ca.ovh.com/auth/?action=gotomanager&from=https://www.ovh.com/asia/&ovhSubsidiary=asia)
24+
25+
## Instructions
26+
27+
### Upload data
28+
29+
First, download these 2 datasets CSV files locally:
30+
* [books.csv](https://raw.githubusercontent.com/ovh/data-processing-samples/master/apache_spark_notebook_data_cleaning/books.csv)
31+
* [ratings.csv](https://raw.githubusercontent.com/ovh/data-processing-samples/master/apache_spark_notebook_data_cleaning/ratings.csv)
32+
33+
Then, from the [OVHcloud Control Panel](https://ca.ovh.com/auth/?action=gotomanager&from=https://www.ovh.com/asia/&ovhSubsidiary=asia), go to the Object Storage section, locate your S3 bucket and upload your data by clicking `Add object`{.action}.
34+
35+
Select both files from your computer and add them to the root `/` of your bucket.
36+
37+
![image](images/object-storage-datasets.png){.thumbnail}
38+
39+
### Retrieve bucket credentials
40+
41+
There are a few information that we will need as inputs of the notebook.
42+
43+
First, and while we're on the container page of the [OVHcloud Control Panel](https://ca.ovh.com/auth/?action=gotomanager&from=https://www.ovh.com/asia/&ovhSubsidiary=asia) we will copy the `Endpoint` information and save it.
44+
45+
Go back to the Object Storage home page and then to the S3 users tab, copy the user's `access key` and save it.
46+
47+
Finally, click on action "hamburger" at the end of the user row `(...)`{.action} > `View the secret key`{.action}, copy the value and save it.
48+
49+
### Launch and access a Notebook for Apache Spark
50+
51+
From the [OVHcloud Control Panel](https://ca.ovh.com/auth/?action=gotomanager&from=https://www.ovh.com/asia/&ovhSubsidiary=asia), go to the Data Processing section and create a new notebook by clicking `Data Processing`{.action} > `Create notebook`{.action}.
52+
53+
You can then reach the `JupyterLab` URL directly from the notebooks list or from the notebook page.
54+
55+
### Experiment with notebook
56+
57+
Now that you have your initial datasets ready on an Object Storage and a notebook running, you could start cleaning this data!
58+
59+
A preview of this notebook can be found on [GitHub](https://github.com/ovh/data-processing-samples/blob/master/apache_spark_notebook_data_cleaning/apache_spark_notebook_data_cleaning_tutorial.ipynb).
60+
61+
### Go further
62+
63+
- Do you want to create a data cleaning job you could replay based on your notebook? [Here it is](https://docs.ovh.com/asia/en/data-processing/submit-python/).
64+
65+
## Feedback
66+
67+
Please send us your questions, feedback and suggestions to improve the service:
68+
69+
- On the OVHcloud [Discord server](https://discord.com/invite/vXVurFfwe9)
70+
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
---
2+
title: Notebooks for Apache Spark - Data cleaning
3+
slug: notebook-spark-data-cleaning
4+
excerpt: Data cleaning of 2 CSV dataset with aggregation into a single clean Parquet file.
5+
section: Tutorials
6+
order: 3
7+
updated: 2023-03-14
8+
---
9+
10+
**Last updated March 14th, 2023**
11+
12+
## Objective
13+
14+
The purpose of this tutorial is to show how to clean data with [Apache Spark](https://spark.apache.org/) inside a [Jupyter Notebook](https://jupyter.org/).
15+
16+
*Data Cleaning* or *Data Cleansing* is the preparation of raw data by detecting and correcting records within a dataset.
17+
18+
The tutorial presents a simple data cleaning with `Notebooks for Apache Spark`.
19+
20+
## Requirements
21+
22+
- A [Public Cloud project](https://www.ovhcloud.com/en-au/public-cloud/) in your OVHcloud account
23+
- Access to the [OVHcloud Control Panel](https://ca.ovh.com/auth/?action=gotomanager&from=https://www.ovh.com.au/&ovhSubsidiary=au)
24+
25+
## Instructions
26+
27+
### Upload data
28+
29+
First, download these 2 datasets CSV files locally:
30+
* [books.csv](https://raw.githubusercontent.com/ovh/data-processing-samples/master/apache_spark_notebook_data_cleaning/books.csv)
31+
* [ratings.csv](https://raw.githubusercontent.com/ovh/data-processing-samples/master/apache_spark_notebook_data_cleaning/ratings.csv)
32+
33+
Then, from the [OVHcloud Control Panel](https://ca.ovh.com/auth/?action=gotomanager&from=https://www.ovh.com.au/&ovhSubsidiary=au), go to the Object Storage section, locate your S3 bucket and upload your data by clicking `Add object`{.action}.
34+
35+
Select both files from your computer and add them to the root `/` of your bucket.
36+
37+
![image](images/object-storage-datasets.png){.thumbnail}
38+
39+
### Retrieve bucket credentials
40+
41+
There are a few information that we will need as inputs of the notebook.
42+
43+
First, and while we're on the container page of the [OVHcloud Control Panel](https://ca.ovh.com/auth/?action=gotomanager&from=https://www.ovh.com.au/&ovhSubsidiary=au) we will copy the `Endpoint` information and save it.
44+
45+
Go back to the Object Storage home page and then to the S3 users tab, copy the user's `access key` and save it.
46+
47+
Finally, click on action "hamburger" at the end of the user row `(...)`{.action} > `View the secret key`{.action}, copy the value and save it.
48+
49+
### Launch and access a Notebook for Apache Spark
50+
51+
From the [OVHcloud Control Panel](https://ca.ovh.com/auth/?action=gotomanager&from=https://www.ovh.com.au/&ovhSubsidiary=au), go to the Data Processing section and create a new notebook by clicking `Data Processing`{.action} > `Create notebook`{.action}.
52+
53+
You can then reach the `JupyterLab` URL directly from the notebooks list or from the notebook page.
54+
55+
### Experiment with notebook
56+
57+
Now that you have your initial datasets ready on an Object Storage and a notebook running, you could start cleaning this data!
58+
59+
A preview of this notebook can be found on [GitHub](https://github.com/ovh/data-processing-samples/blob/master/apache_spark_notebook_data_cleaning/apache_spark_notebook_data_cleaning_tutorial.ipynb).
60+
61+
### Go further
62+
63+
- Do you want to create a data cleaning job you could replay based on your notebook? [Here it is](https://docs.ovh.com/au/en/data-processing/submit-python/).
64+
65+
## Feedback
66+
67+
Please send us your questions, feedback and suggestions to improve the service:
68+
69+
- On the OVHcloud [Discord server](https://discord.com/invite/vXVurFfwe9)
70+
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
---
2+
title: Notebooks for Apache Spark - Data cleaning
3+
slug: notebook-spark-data-cleaning
4+
excerpt: Data cleaning of 2 CSV dataset with aggregation into a single clean Parquet file.
5+
section: Tutorials
6+
order: 3
7+
updated: 2023-03-14
8+
---
9+
10+
**Last updated March 14th, 2023**
11+
12+
## Objective
13+
14+
The purpose of this tutorial is to show how to clean data with [Apache Spark](https://spark.apache.org/) inside a [Jupyter Notebook](https://jupyter.org/).
15+
16+
*Data Cleaning* or *Data Cleansing* is the preparation of raw data by detecting and correcting records within a dataset.
17+
18+
The tutorial presents a simple data cleaning with `Notebooks for Apache Spark`.
19+
20+
## Requirements
21+
22+
- A [Public Cloud project](https://www.ovhcloud.com/en-ca/public-cloud/) in your OVHcloud account
23+
- Access to the [OVHcloud Control Panel](https://ca.ovh.com/auth/?action=gotomanager&from=https://www.ovh.com/ca/en/&ovhSubsidiary=ca)
24+
25+
## Instructions
26+
27+
### Upload data
28+
29+
First, download these 2 datasets CSV files locally:
30+
* [books.csv](https://raw.githubusercontent.com/ovh/data-processing-samples/master/apache_spark_notebook_data_cleaning/books.csv)
31+
* [ratings.csv](https://raw.githubusercontent.com/ovh/data-processing-samples/master/apache_spark_notebook_data_cleaning/ratings.csv)
32+
33+
Then, from the [OVHcloud Control Panel](https://ca.ovh.com/auth/?action=gotomanager&from=https://www.ovh.com/ca/en/&ovhSubsidiary=ca), go to the Object Storage section, locate your S3 bucket and upload your data by clicking `Add object`{.action}.
34+
35+
Select both files from your computer and add them to the root `/` of your bucket.
36+
37+
![image](images/object-storage-datasets.png){.thumbnail}
38+
39+
### Retrieve bucket credentials
40+
41+
There are a few information that we will need as inputs of the notebook.
42+
43+
First, and while we're on the container page of the [OVHcloud Control Panel](https://ca.ovh.com/auth/?action=gotomanager&from=https://www.ovh.com/ca/en/&ovhSubsidiary=ca) we will copy the `Endpoint` information and save it.
44+
45+
Go back to the Object Storage home page and then to the S3 users tab, copy the user's `access key` and save it.
46+
47+
Finally, click on action "hamburger" at the end of the user row `(...)`{.action} > `View the secret key`{.action}, copy the value and save it.
48+
49+
### Launch and access a Notebook for Apache Spark
50+
51+
From the [OVHcloud Control Panel](https://ca.ovh.com/auth/?action=gotomanager&from=https://www.ovh.com/ca/en/&ovhSubsidiary=ca), go to the Data Processing section and create a new notebook by clicking `Data Processing`{.action} > `Create notebook`{.action}.
52+
53+
You can then reach the `JupyterLab` URL directly from the notebooks list or from the notebook page.
54+
55+
### Experiment with notebook
56+
57+
Now that you have your initial datasets ready on an Object Storage and a notebook running, you could start cleaning this data!
58+
59+
A preview of this notebook can be found on [GitHub](https://github.com/ovh/data-processing-samples/blob/master/apache_spark_notebook_data_cleaning/apache_spark_notebook_data_cleaning_tutorial.ipynb).
60+
61+
### Go further
62+
63+
- Do you want to create a data cleaning job you could replay based on your notebook? [Here it is](https://docs.ovh.com/ca/en/data-processing/submit-python/).
64+
65+
## Feedback
66+
67+
Please send us your questions, feedback and suggestions to improve the service:
68+
69+
- On the OVHcloud [Discord server](https://discord.com/invite/vXVurFfwe9)
70+

pages/platform/data-processing/42_TUTORIAL_notebook-data-cleaning/guide.en-gb.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,10 @@ slug: notebook-spark-data-cleaning
44
excerpt: Data cleaning of 2 CSV dataset with aggregation into a single clean Parquet file.
55
section: Tutorials
66
order: 3
7-
updated: 2023-03-10
7+
updated: 2023-03-14
88
---
99

10-
**Last updated March 10th, 2023**
10+
**Last updated March 14th, 2023**
1111

1212
## Objective
1313

@@ -19,8 +19,8 @@ The tutorial presents a simple data cleaning with `Notebooks for Apache Spark`.
1919

2020
## Requirements
2121

22-
- Access to the [OVHcloud control panel](https://www.ovh.com/auth/?action=gotomanager&from=https://www.ovh.co.uk/&ovhSubsidiary=GB){.external}
23-
- A Public Cloud project
22+
- A [Public Cloud project](https://www.ovhcloud.com/en-gb/public-cloud/) in your OVHcloud account
23+
- Access to the [OVHcloud Control Panel](https://www.ovh.com/auth/?action=gotomanager&from=https://www.ovh.co.uk/&ovhSubsidiary=GB)
2424

2525
## Instructions
2626

@@ -60,7 +60,7 @@ A preview of this notebook can be found on [GitHub](https://github.com/ovh/data-
6060

6161
### Go further
6262

63-
- Do you want to create a data cleaning job you could replay based on your notebook? [Here it is](https://docs.ovh.com/fr/data-processing/submit-python/).
63+
- Do you want to create a data cleaning job you could replay based on your notebook? [Here it is](https://docs.ovh.com/gb/en/data-processing/submit-python/).
6464

6565
## Feedback
6666

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
---
2+
title: Notebooks for Apache Spark - Data cleaning
3+
slug: notebook-spark-data-cleaning
4+
excerpt: Data cleaning of 2 CSV dataset with aggregation into a single clean Parquet file.
5+
section: Tutorials
6+
order: 3
7+
updated: 2023-03-14
8+
---
9+
10+
**Last updated March 14th, 2023**
11+
12+
## Objective
13+
14+
The purpose of this tutorial is to show how to clean data with [Apache Spark](https://spark.apache.org/) inside a [Jupyter Notebook](https://jupyter.org/).
15+
16+
*Data Cleaning* or *Data Cleansing* is the preparation of raw data by detecting and correcting records within a dataset.
17+
18+
The tutorial presents a simple data cleaning with `Notebooks for Apache Spark`.
19+
20+
## Requirements
21+
22+
- A [Public Cloud project](https://www.ovhcloud.com/en-ie/public-cloud/) in your OVHcloud account
23+
- Access to the [OVHcloud Control Panel](https://www.ovh.com/auth/?action=gotomanager&from=https://www.ovh.ie/&ovhSubsidiary=ie)
24+
25+
## Instructions
26+
27+
### Upload data
28+
29+
First, download these 2 datasets CSV files locally:
30+
* [books.csv](https://raw.githubusercontent.com/ovh/data-processing-samples/master/apache_spark_notebook_data_cleaning/books.csv)
31+
* [ratings.csv](https://raw.githubusercontent.com/ovh/data-processing-samples/master/apache_spark_notebook_data_cleaning/ratings.csv)
32+
33+
Then, from the [OVHcloud Control Panel](https://www.ovh.com/auth/?action=gotomanager&from=https://www.ovh.ie/&ovhSubsidiary=ie), go to the Object Storage section, locate your S3 bucket and upload your data by clicking `Add object`{.action}.
34+
35+
Select both files from your computer and add them to the root `/` of your bucket.
36+
37+
![image](images/object-storage-datasets.png){.thumbnail}
38+
39+
### Retrieve bucket credentials
40+
41+
There are a few information that we will need as inputs of the notebook.
42+
43+
First, and while we're on the container page of the [OVHcloud Control Panel](https://www.ovh.com/auth/?action=gotomanager&from=https://www.ovh.ie/&ovhSubsidiary=ie) we will copy the `Endpoint` information and save it.
44+
45+
Go back to the Object Storage home page and then to the S3 users tab, copy the user's `access key` and save it.
46+
47+
Finally, click on action "hamburger" at the end of the user row `(...)`{.action} > `View the secret key`{.action}, copy the value and save it.
48+
49+
### Launch and access a Notebook for Apache Spark
50+
51+
From the [OVHcloud Control Panel](https://www.ovh.com/auth/?action=gotomanager&from=https://www.ovh.ie/&ovhSubsidiary=ie), go to the Data Processing section and create a new notebook by clicking `Data Processing`{.action} > `Create notebook`{.action}.
52+
53+
You can then reach the `JupyterLab` URL directly from the notebooks list or from the notebook page.
54+
55+
### Experiment with notebook
56+
57+
Now that you have your initial datasets ready on an Object Storage and a notebook running, you could start cleaning this data!
58+
59+
A preview of this notebook can be found on [GitHub](https://github.com/ovh/data-processing-samples/blob/master/apache_spark_notebook_data_cleaning/apache_spark_notebook_data_cleaning_tutorial.ipynb).
60+
61+
### Go further
62+
63+
- Do you want to create a data cleaning job you could replay based on your notebook? [Here it is](https://docs.ovh.com/ie/en/data-processing/submit-python/).
64+
65+
## Feedback
66+
67+
Please send us your questions, feedback and suggestions to improve the service:
68+
69+
- On the OVHcloud [Discord server](https://discord.com/invite/vXVurFfwe9)
70+

0 commit comments

Comments
 (0)