Skip to content

Commit 5516f23

Browse files
authored
docs: add Getting Started with Databricks guide (#7050)
1 parent e4d8c16 commit 5516f23

File tree

8 files changed

+537
-1
lines changed

8 files changed

+537
-1
lines changed
Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
module.exports = {
22
"core": "Cube Core",
33
"cloud": "Cube Cloud",
4+
"databricks": "Cube Cloud and Databricks",
45
"migrate-from-core": "Migrate from Cube Core"
5-
}
6+
}
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# Getting started with Cube Cloud and Databricks
2+
3+
This getting started guide will show you how to use Cube Cloud with Databricks.
4+
You will learn how to:
5+
6+
- Load sample data into your Databricks account
7+
- Connect Cube Cloud to Databricks
8+
- Create your first Cube data model
9+
- Connect to a BI tool to explore this model
10+
- Create React application with Cube REST API
11+
12+
## Prerequisites
13+
14+
- [Cube Cloud account](https://cubecloud.dev/auth/signup)
15+
- [Databricks account](https://www.databricks.com/try-databricks)
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
module.exports = {
2+
"load-data": "Load data",
3+
"connect-to-databricks": "Connect to Databricks",
4+
"create-data-model": "Create data model",
5+
"query-from-bi": "Query from BI",
6+
"query-from-react-app": "Query from React"
7+
}
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# Connect to Databricks
2+
3+
In this section, we’ll create a Cube Cloud deployment and connect it to
4+
Databricks. A deployment represents a data model, configuration, and managed
5+
infrastructure.
6+
7+
To continue with this guide, you'll need to have a Cube Cloud account. If you
8+
don't have one yet, [click here to sign up][cube-cloud-signup] for free.
9+
10+
First, [sign in to your Cube Cloud account][cube-cloud-signin]. Then,
11+
click <Btn>Create Deployment</Btn>:
12+
13+
Give the deployment a name, select the cloud provider and region of your choice,
14+
and click <Btn>Next</Btn>:
15+
16+
<Screenshot
17+
alt="Cube Cloud Create Deployment Screen"
18+
src="https://ucarecdn.com/2338323e-0db8-4224-8e7a-3b4daf9c60ec/"
19+
/>
20+
21+
<SuccessBox>
22+
23+
Microsoft Azure is available in Cube Cloud on
24+
[Premium](https://cube.dev/pricing) tier. [Contact us](https://cube.dev/contact)
25+
for details.
26+
27+
</SuccessBox>
28+
29+
## Set up a Cube project
30+
31+
Next, click <Btn>Create</Btn> to create a new project from scratch:
32+
33+
<Screenshot
34+
alt="Cube Cloud Upload Project Screen"
35+
src="https://ucarecdn.com/46b72b61-b650-4271-808d-55203f1c8d8b/"
36+
/>
37+
38+
## Connect to your Databricks
39+
40+
The last step is to connect Cube Cloud to Databricks. First, select it from the
41+
grid:
42+
43+
<Screenshot
44+
alt="Cube Cloud Setup Database Screen"
45+
src="https://ucarecdn.com/1d656ba9-dd83-4ff4-a59e-8b5f97a9ddcc/"
46+
/>
47+
48+
Then enter your Databricks credentials:
49+
50+
- **Access Token:** A personal access token for your Databricks account. [You
51+
can generate one][databricks-docs-pat] in your Databricks account settings.
52+
- **Databricks JDBC URL:** The JDBC URL for your Databricks SQL warehouse. [You
53+
can find it][databricks-docs-jdbc-url] in the SQL warehouse settings screen.
54+
- **Databricks Catalog:** This should match the same catalog where you uploaded
55+
the files in the last section. If left unspecified, the `default` catalog is
56+
used.
57+
58+
[databricks-docs-pat]:
59+
https://docs.databricks.com/en/dev-tools/auth.html#databricks-personal-access-tokens-for-workspace-users
60+
[databricks-docs-jdbc-url]:
61+
https://docs.databricks.com/en/integrations/jdbc-odbc-bi.html#get-connection-details-for-a-sql-warehouse
62+
63+
Click <Btn>Apply</Btn>, Cube Cloud will test the connection and proceed to the
64+
next step.
65+
66+
## Generate data model from your Databricks schema
67+
68+
Cube can now generate a basic data model from your data warehouse, which helps
69+
getting started with data modeling faster. Select all four tables in our catalog
70+
and click through the data model generation wizard. We'll inspect these
71+
generated files in the next section and start making changes to them.
72+
73+
[aws-docs-sec-group]:
74+
https://docs.aws.amazon.com/vpc/latest/userguide/security-groups.html
75+
[aws-docs-sec-group-rule]:
76+
https://docs.aws.amazon.com/vpc/latest/userguide/security-group-rules.html
77+
[cube-cloud-signin]: https://cubecloud.dev/auth
78+
[cube-cloud-signup]: https://cubecloud.dev/auth/signup
79+
[ref-conf-db]: /product/configuration/data-sources
80+
[ref-getting-started-cloud-generate-models]:
81+
/getting-started/cloud/generate-models
Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
# Create your first data model
2+
3+
Cube follows a dataset-oriented data modeling approach, which is inspired by and
4+
expands upon dimensional modeling. Cube incorporates this approach and provides
5+
a practical framework for implementing dataset-oriented data modeling.
6+
7+
When building a data model in Cube, you work with two dataset-centric objects:
8+
**cubes** and **views**. **Cubes** usually represent business entities such as
9+
customers, line items, and orders. In cubes, you define all the calculations
10+
within the measures and dimensions of these entities. Additionally, you define
11+
relationships between cubes, such as "an order has many line items" or "a user
12+
may place multiple orders."
13+
14+
**Views** sit on top of a data graph of cubes and create a facade of your entire
15+
data model, with which data consumers can interact. You can think of views as
16+
the final data products for your data consumers - BI users, data apps, AI
17+
agents, etc. When building views, you select measures and dimensions from
18+
different connected cubes and present them as a single dataset to BI or data
19+
apps.
20+
21+
<Diagram
22+
alt="Architecture diagram of queries being sent to cubes and views"
23+
src="https://ucarecdn.com/bfc3e04a-b690-40bc-a6f8-14a9175fb4fd/"
24+
/>
25+
26+
## Working with cubes
27+
28+
To begin building your data model, click on <Btn>Enter Development Mode</Btn> in
29+
Cube Cloud. This will take you to your personal developer space, where you can
30+
safely make changes to your data model without affecting the production
31+
environment.
32+
33+
In the previous section, we generated four cubes. To see the data graph of these
34+
four cubes and how they are connected to each other, click the <Btn>Show
35+
Graph</Btn> button on the Data Model page.
36+
37+
Let's review the `orders` cube first and update it with additional dimensions
38+
and measures.
39+
40+
Once you are in developer mode, navigate to the <Btn>Data Model</Btn> and click
41+
on the `orders.yml` file in the left sidebar inside the `model/cubes` directory
42+
to open it.
43+
44+
You should see the following content of `model/cubes/orders.yml` file.
45+
46+
```yaml
47+
cubes:
48+
- name: orders
49+
sql_table: ECOM.ORDERS
50+
51+
joins:
52+
- name: users
53+
sql: "{CUBE}.USER_ID = {users}.USER_ID"
54+
relationship: many_to_one
55+
56+
dimensions:
57+
- name: status
58+
sql: STATUS
59+
type: string
60+
61+
- name: id
62+
sql: ID
63+
type: number
64+
primary_key: true
65+
66+
- name: created_at
67+
sql: CREATED_AT
68+
type: time
69+
70+
- name: completed_at
71+
sql: COMPLETED_AT
72+
type: time
73+
74+
measures:
75+
- name: count
76+
type: count
77+
```
78+
79+
As you can see, we already have a `count` measure that we can use to calculate
80+
the total count of our orders.
81+
82+
Let's add an additional measure to the `orders` cube to calculate only
83+
**completed orders**. The `status` dimension in the `orders` cube reflects the
84+
three possible statuses: **processing**, **shipped**, or **completed**. We will
85+
create a new measure `completed_count` by using a filter on that dimension. To
86+
do this, we will use a
87+
[filter parameter](/product/data-modeling/reference/measures#filters) of the
88+
measure and
89+
[refer](/product/data-modeling/fundamentals/syntax#referring-to-objects) to the
90+
existing dimension.
91+
92+
Add the following measure definition to your `model/cubes/orders.yml` file. It
93+
should be included within the `measures` block.
94+
95+
```yaml
96+
- name: completed_count
97+
type: count
98+
filters:
99+
- sql: "{CUBE}.status = 'completed'"
100+
```
101+
102+
With these two measures in place, `count` and `completed_count`, we can create a
103+
**derived measure**. Derived measures are measures that you can create based on
104+
existing measures. Let's create the `completed_percentage` derived measure.
105+
106+
Add the following measure definition to your `model/cubes/orders.yml` file
107+
within the `measures` block.
108+
109+
```yaml
110+
- name: completed_percentage
111+
type: number
112+
sql: "({completed_count} / NULLIF({count}, 0)) * 100.0"
113+
format: percent
114+
```
115+
116+
Below you can see what your updated `orders` cube should look like with two new
117+
measures. Feel free to copy this code and paste it into your
118+
`model/cubes/order.yml` file.
119+
120+
```yaml
121+
cubes:
122+
- name: orders
123+
sql_table: ECOM.ORDERS
124+
125+
joins:
126+
- name: users
127+
sql: "{CUBE}.USER_ID = {users}.USER_ID"
128+
relationship: many_to_one
129+
130+
dimensions:
131+
- name: status
132+
sql: STATUS
133+
type: string
134+
135+
- name: id
136+
sql: ID
137+
type: number
138+
primary_key: true
139+
140+
- name: created_at
141+
sql: CREATED_AT
142+
type: time
143+
144+
- name: completed_at
145+
sql: COMPLETED_AT
146+
type: time
147+
148+
measures:
149+
- name: count
150+
type: count
151+
152+
- name: completed_count
153+
type: count
154+
filters:
155+
- sql: "{CUBE}.status = 'completed'"
156+
157+
- name: completed_percentage
158+
type: number
159+
sql: "({completed_count} / NULLIF({count}, 0)) * 100.0"
160+
format: percent
161+
```
162+
163+
Click <Btn>Save All</Btn> in the upper corner to save changes to the data model.
164+
Now, you can navigate to Cube’s Playground. The Playground is a web-based tool
165+
that allows you to query your data without connecting any tools or writing any
166+
code. It's the fastest way to explore and test your data model.
167+
168+
You can select measures and dimensions from different cubes in playground,
169+
including your newly created `completed_percentage` measure.
170+
171+
## Working with views
172+
173+
When building views, we recommend following entity-oriented design and
174+
structuring your views around your business entities. Usually, cubes tend to be
175+
normalized entities without duplicated or redundant members, while views are
176+
denormalized entities where you pick as many measures and dimensions from
177+
multiple cubes as needed to describe a business entity.
178+
179+
Let's create our first view, which will provide all necessary measures and
180+
dimensions to explore orders. Views are usually located in the `views` folder
181+
and have a `_view` postfix.
182+
183+
Create `model/views/orders_view.yml` with the following content:
184+
185+
```yaml
186+
views:
187+
- name: orders_view
188+
189+
cubes:
190+
- join_path: orders
191+
includes:
192+
- status
193+
- created_at
194+
- count
195+
- completed_count
196+
- completed_percentage
197+
198+
- join_path: orders.users
199+
prefix: true
200+
includes:
201+
- city
202+
- age
203+
- state
204+
```
205+
206+
When building views, you can leverage the `cubes` parameter, which enables you
207+
to include measures and dimensions from other cubes in the view. You can build
208+
your view by combining multiple joined cubes and specifying the path by which
209+
they should be joined for that particular view.
210+
211+
After saving, you can experiment with your newly created view in the Playground.
212+
In the next section, we will learn how to query our `orders_view` using a BI
213+
tool.
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Load data
2+
3+
The following steps will guide you through setting up a Databricks account and
4+
uploading the demo dataset, which is stored as CSV files in a public S3 bucket.
5+
6+
First, download the following files to your local machine:
7+
8+
- [`line_items.csv`](https://cube-tutorial.s3.us-east-2.amazonaws.com/line_items.csv)
9+
- [`orders.csv`](https://cube-tutorial.s3.us-east-2.amazonaws.com/orders.csv)
10+
- [`users.csv`](https://cube-tutorial.s3.us-east-2.amazonaws.com/users.csv)
11+
- [`products.csv`](https://cube-tutorial.s3.us-east-2.amazonaws.com/products.csv)
12+
13+
Next, let's ensure we have a SQL warehouse that is active. Log in to your
14+
Databricks account, then from the sidebar, click on <Btn>SQL → SQL
15+
Warehouses</Btn>:
16+
17+
<Screenshot
18+
alt="Databricks SQL Warehouses screen"
19+
src="https://ucarecdn.com/92e82ca3-0ca4-4064-8ed6-394e5a66e869/"
20+
/>
21+
22+
<InfoBox>
23+
24+
Ensure the warehouse is active by checking its status; if it is inactive, click
25+
26+
<Btn>▶️</Btn> to start it.
27+
28+
</InfoBox>
29+
30+
Next, click <Btn>New → File upload</Btn> from the sidebar, and upload
31+
`line_items.csv`. The UI will show a preview of the data within the file; when
32+
ready, click <Btn>Create table</Btn>.
33+
34+
Repeat the above steps for the three other files.

0 commit comments

Comments
 (0)