Skip to content

Commit 22ed43d

Browse files
SamyOubouazizRoRoJ
authored andcommitted
docs(dlb): doc update MTA-5604 (#4423)
* docs(dlb): doc update MTA-5604 * docs(dlb): update * docs(dlb): update * Update faq/data-lab.mdx Co-authored-by: Rowena Jones <[email protected]> * Update faq/data-lab.mdx Co-authored-by: Rowena Jones <[email protected]> * docs(dlb): update * docs(dlb): update --------- Co-authored-by: Rowena Jones <[email protected]>
1 parent e6a8d1c commit 22ed43d

File tree

4 files changed

+55
-14
lines changed

4 files changed

+55
-14
lines changed

faq/data-lab.mdx

Lines changed: 45 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,27 +5,65 @@ meta:
55
content:
66
h1: Distributed Data Lab FAQ
77
dates:
8-
validation: 2025-02-06
8+
validation: 2025-02-18
99
category: managed-services
1010
productIcon: DistributedDataLabProductIcon
1111
---
1212

13-
## What is Apache Spark?
13+
## General
14+
15+
### What workloads is Distributed Data Lab suited for?
16+
17+
Distributed Data Lab supports a range of workloads, including:
18+
19+
- Complex analytics.
20+
- Machine learning tasks.
21+
- High-speed operations on large datasets.
22+
23+
It offers scalable CPU and GPU instances with flexible node limits, and robust Apache Spark library support.
24+
25+
### What is Apache Spark?
1426

1527
Apache Spark is an open-source unified analytics engine designed for large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark offers high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs.
1628

17-
## How does Apache Spark work?
29+
### How does Apache Spark work?
1830

1931
Apache Spark processes data in memory, which allows it to perform tasks up to 100 times faster than traditional disk-based processing frameworks like [Hadoop MapReduce](https://fr.wikipedia.org/wiki/MapReduce). It uses Resilient Distributed Datasets (RDDs) to store data across multiple nodes in a cluster and perform parallel operations on this data.
2032

21-
## How am I billed for Distributed Data Lab?
33+
### How am I billed for Distributed Data Lab?
2234

2335
Distributed Data Lab is billed based on two factors:
2436
- the main node configuration selected
2537
- the worker node configuration selected, and the number of worker nodes in the cluster
2638

27-
## Can I upscale or downscale a Distributed Data Lab?
39+
## Clusters
40+
41+
### Can I upscale or downscale a Distributed Data Lab?
42+
43+
Yes, you can upscale a Data Lab cluster to distribute your workloads across more worker nodes for faster processing. You can also scale it down to zero to reduce costs, while retaining your configuration and context.
44+
45+
You can still access the notebook of a Data Lab cluster with zero worker nodes, but you cannot perform any calculations. You can resume the activity of your cluster by provisioning at least one worker node.
46+
47+
### Can I run a Distributed Data Lab using GPUs?
48+
49+
Yes, you can run your cluster on either CPUs or GPUs. Scaleway leverages Nvidia's [RAPIDS Accelerator For Apache Spark](https://www.nvidia.com/en-gb/deep-learning-ai/software/rapids/), an open-source suite of software libraries and APIs to execute end-to-end data science and analytics pipelines entirely on GPUs. This technology allows for significant acceleration of data processing tasks compared to CPU-based processing.
50+
51+
## Storage
52+
53+
### What data source options are available?
54+
55+
Data Lab natively integrates with Scaleway Object Storage for reading and writing data, making it easy to process data directly from your buckets. Your buckets are accessible using the Scaleway console, or any other Amazon S3-compatible CLI tool.
56+
57+
### Can I connect to S3 buckets from other cloud providers?
58+
59+
Currently, connections are limited to Scaleway's Object Storage environment.
60+
61+
## Notebook
62+
63+
### What notebook is included with Dedicated Data Labs?
64+
65+
The service provides a JupyterLab notebook running on a dedicated CPU instance, fully integrated with the Apache Spark cluster for seamless data processing and calculations.
2866

29-
Yes, you can upscale a Data Lab cluster to distribute your workloads across a greater number of worker nodes for faster processing. You can also scale it down to zero to reduce costs, while retaining your configuration and context.
67+
### Can I connect my local JupyterLab to the Data Lab?
3068

31-
You can still access the notebook of a Data Lab cluster with zero worker nodes, but you cannot perform any calculation. You can resume the activity of your cluster by provisioning at least one worker node.
69+
Remote connections to a Data Lab cluster are currently not supported.

pages/data-lab/concepts.mdx

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,10 @@ A Distributed Data Lab is a data lab that is distributed across multiple worker
2424

2525
A fixture is a set of data forming a request used for testing purposes.
2626

27+
## GPU
28+
29+
GPUs (Graphical Processing Units) allow Apache Spark to accelerate computations for tasks that involve large-scale parallel processing, such as machine learning and specific data-analytics, significantly reducing the processing time for massive datasets and preparation for AI models.
30+
2731
## JupyterLab
2832

2933
JupyterLab is a web-based platform for interactive computing, letting you work with notebooks, code, and data all in one place. It builds on the classic Jupyter Notebook by offering a more flexible and integrated user interface, making it easier to handle various file formats and interactive components.

pages/data-lab/how-to/connect-to-data-lab.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ categories:
3030

3131
4. Enter your [API secret key](/iam/concepts/#api-key) when prompted for a password, then click **Log in**. You are directed to the lab's home screen.
3232

33-
5. In the files list on the left, double-click the `quickstart.ipynb` file to open it.
33+
5. In the files list on the left, double-click the `DatalabDemo.ipynb` file to open it.
3434

3535
6. Update the first cell of the file with your API access key and secret key, as shown below:
3636

@@ -41,4 +41,4 @@ categories:
4141

4242
Your notebook environment is now ready to be used.
4343

44-
7. Optionally, follow the instructions contained in the `quickstart.ipynb` file to process a test batch of data.
44+
7. Optionally, follow the instructions contained in the `DatalabDemo.ipynb` file to process a test batch of data.

pages/data-lab/quickstart.mdx

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ It is composed of the following:
2222

2323
- Notebook: A JupyterLab service operating on a dedicated node type.
2424

25-
Scaleway provides dedicated node types for both the notebook and the cluster. The cluster nodes are high-end machines built for intensive computations, featuring numerous CPUs and substantial RAM.
25+
Scaleway provides dedicated node types for both the notebook and the cluster. The cluster nodes are high-end machines built for intensive computations, featuring powerful CPUs/GPUs, and substantial RAM.
2626

2727
The notebook, although capable of performing some local computations, primarily serves as a web interface for interacting with the Apache Spark cluster.
2828

@@ -41,12 +41,11 @@ The notebook, although capable of performing some local computations, primarily
4141

4242
3. Complete the following steps in the wizard:
4343
- Choose an Apache Spark version from the drop-down menu.
44-
- Select a worker node configuration.
44+
- Select a worker node configuration. For this procedure, we recommend selecting a CPU rather than a GPU.
4545
- Enter the desired number of worker nodes.
4646
<Message type="note">
4747
Provisioning zero worker nodes lets you retain and access you cluster and notebook configurations, but will not allow you to run calculations.
4848
</Message>
49-
- Optionally, choose an Object Storage bucket as your source of data and the place to store the output of your operations.
5049
- Enter a name for your Data Lab.
5150
- Optionally, add a description and/or tags for your Data Lab.
5251
- Verify the estimated cost.
@@ -65,7 +64,7 @@ The notebook, although capable of performing some local computations, primarily
6564

6665
## How to run the demo file
6766

68-
Each Distributed Data Lab comes with a default `quickstart.ipynb` demo file for testing purposes. This file contains a preconfigured notebook environment that requires no modification to run.
67+
Each Distributed Data Lab comes with a default `DatalabDemo.ipynb` demonstration file for testing purposes. This file contains a preconfigured notebook environment that requires no modification to run.
6968

7069
Execute the cells in order to perform pre-determined operations on a dummy data set.
7170

@@ -81,7 +80,7 @@ Execute the cells in order to perform pre-determined operations on a dummy data
8180
"name": "My Spark",
8281
"conf":{
8382
"spark.hadoop.fs.s3a.access.key": "your-api-access-key",
84-
"spark.hadoop.fs.s3a.secret.key": "your-api-access-key",
83+
"spark.hadoop.fs.s3a.secret.key": "your-api-secret-key",
8584
"spark.hadoop.fs.s3a.endpoint": "your-bucket-endpoint"
8685
}
8786
}

0 commit comments

Comments
 (0)