|
| 1 | +--- |
| 2 | +title: Notebooks for Apache Spark - Data cleaning |
| 3 | +slug: notebook-spark-data-cleaning |
| 4 | +excerpt: Data cleaning of 2 CSV dataset with aggregation into a single clean Parquet file. |
| 5 | +section: Tutorials |
| 6 | +order: 3 |
| 7 | +updated: 2023-03-10 |
| 8 | +--- |
| 9 | + |
| 10 | +**Last updated March 10th, 2023** |
| 11 | + |
| 12 | +## Objective |
| 13 | + |
| 14 | +The purpose of this tutorial is to show how to clean data with [Apache Spark](https://spark.apache.org/) inside a [Jupyter Notebook](https://jupyter.org/). |
| 15 | + |
| 16 | +*Data Cleaning* or *Data Cleansing* is the preparation of raw data by detecting and correcting records within a dataset. |
| 17 | + |
| 18 | +The tutorial presents a simple data cleaning with `Notebooks for Apache Spark`. |
| 19 | + |
| 20 | +## Requirements |
| 21 | + |
| 22 | +- Access to the [OVHcloud control panel](https://www.ovh.com/auth/?action=gotomanager&from=https://www.ovh.co.uk/&ovhSubsidiary=GB){.external} |
| 23 | +- A Public Cloud project |
| 24 | + |
| 25 | +## Instructions |
| 26 | + |
| 27 | +### Upload data |
| 28 | + |
| 29 | +First, download these 2 datasets CSV files locally: |
| 30 | +* [books.csv](https://raw.githubusercontent.com/ovh/data-processing-samples/master/apache_spark_notebook_data_cleaning/books.csv) |
| 31 | +* [ratings.csv](https://raw.githubusercontent.com/ovh/data-processing-samples/master/apache_spark_notebook_data_cleaning/ratings.csv) |
| 32 | + |
| 33 | +Then, from the [OVHcloud Control Panel](https://www.ovh.com/auth/?action=gotomanager&from=https://www.ovh.co.uk/&ovhSubsidiary=GB), go to the Object Storage section, locate your S3 bucket and upload your data by clicking `Add object`{.action}. |
| 34 | + |
| 35 | +Select both files from your computer and add them to the root `/` of your bucket. |
| 36 | + |
| 37 | +{.thumbnail} |
| 38 | + |
| 39 | +### Retrieve bucket credentials |
| 40 | + |
| 41 | +There are a few information that we will need as inputs of the notebook. |
| 42 | + |
| 43 | +First, and while we're on the container page of the [OVHcloud Control Panel](https://www.ovh.com/auth/?action=gotomanager&from=https://www.ovh.co.uk/&ovhSubsidiary=GB) we will copy the `Endpoint` information and save it. |
| 44 | + |
| 45 | +Go back to the Object Storage home page and then to the S3 users tab, copy the user's `access key` and save it. |
| 46 | + |
| 47 | +Finally, click on action "hamburger" at the end of the user row `(...)`{.action} > `View the secret key`{.action}, copy the value and save it. |
| 48 | + |
| 49 | +### Launch and access a Notebook for Apache Spark |
| 50 | + |
| 51 | +From the [OVHcloud Control Panel](https://www.ovh.com/auth/?action=gotomanager&from=https://www.ovh.co.uk/&ovhSubsidiary=GB), go to the Data Processing section and create a new notebook by clicking `Data Processing`{.action} > `Create notebook`{.action}. |
| 52 | + |
| 53 | +You can then reach the `JupyterLab` URL directly from the notebooks list or from the notebook page. |
| 54 | + |
| 55 | +### Experiment with notebook |
| 56 | + |
| 57 | +Now that you have your initial datasets ready on an Object Storage and a notebook running, you could start cleaning this data! |
| 58 | + |
| 59 | +A preview of this notebook can be found on [GitHub](https://github.com/ovh/data-processing-samples/blob/master/apache_spark_notebook_data_cleaning/apache_spark_notebook_data_cleaning_tutorial.ipynb). |
| 60 | + |
| 61 | +### Go further |
| 62 | + |
| 63 | +- Do you want to create a data cleaning job you could replay based on your notebook? [Here it is](https://docs.ovh.com/fr/data-processing/submit-python/). |
| 64 | + |
| 65 | +## Feedback |
| 66 | + |
| 67 | +Please send us your questions, feedback and suggestions to improve the service: |
| 68 | + |
| 69 | +- On the OVHcloud [Discord server](https://discord.com/invite/vXVurFfwe9) |
| 70 | + |
0 commit comments