Skip to content

Commit 55e8b33

Browse files
Merge pull request #234731 from TheovanKraay/cassandra-lucene-index
lucene index preview
2 parents 76feca1 + 2bbbc77 commit 55e8b33

File tree

7 files changed

+195
-0
lines changed

7 files changed

+195
-0
lines changed

articles/managed-instance-apache-cassandra/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@
1919
href: configure-hybrid-cluster.md
2020
- name: Deploy Spark Cluster with Databricks
2121
href: deploy-cluster-databricks.md
22+
- name: Search using Lucene Index
23+
href: search-lucene-index.md
2224
- name: Tutorials
2325
items:
2426
- name: Migration

articles/managed-instance-apache-cassandra/create-cluster-portal.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,8 @@ If you don't have an Azure subscription, create a [free account](https://azure.m
3636
* **Resource Group**- Specify whether you want to create a new resource group or use an existing one. A resource group is a container that holds related resources for an Azure solution. For more information, see [Azure Resource Group](../azure-resource-manager/management/overview.md) overview article.
3737
* **Cluster name** - Enter a name for your cluster.
3838
* **Location** - Location where your cluster will be deployed to.
39+
* **Cassandra version** - Version of Apache Cassandra that will be deployed
40+
* **Extention** - Extensions that will be added, including [Cassandra Lucene Index](search-lucene-index.md).
3941
* **Initial Cassandra admin password** - Password that is used to create the cluster.
4042
* **Confirm Cassandra admin password** - Reenter your password.
4143
* **Virtual Network** - Select an Exiting Virtual Network and Subnet, or create a new one.
@@ -176,6 +178,15 @@ The service allows update to Cassandra YAML configuration on a datacenter via th
176178
> - cdc_raw_directory
177179
> - saved_caches_directory
178180
181+
## De-allocate cluster
182+
183+
1. For non-production environments, you can pause/de-allocate resources in the cluster in order to avoid being charged for them (you will continue to be charged for storage). First change cluster type to `NonProduction`, then `deallocate`.
184+
185+
> [!WARNING]
186+
> Do not execute any schema or write operations during de-allocation - this can lead to data loss and in rare cases schema corruption requiring manual intervention from the support team.
187+
188+
:::image type="content" source="./media/create-cluster-portal/pause-cluster.png" alt-text="Screenshot of pausing a cluster." lightbox="./media/create-cluster-portal/pause-cluster.png" border="true":::
189+
179190
## Troubleshooting
180191
181192
If you encounter an error when applying permissions to your Virtual Network using Azure CLI, such as *Cannot find user or service principal in graph database for 'e5007d2c-4b13-4a74-9b6a-605d99f03501'*, you can apply the same permission manually from the Azure portal. Learn how to do this [here](add-service-principal.md).

articles/managed-instance-apache-cassandra/index.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@ landingContent:
2525
links:
2626
- text: What is Azure Managed Instance for Apache Cassandra?
2727
url: introduction.md
28+
- text: Management Operations
29+
url: management-operations.md
2830
- linkListType: get-started
2931
links:
3032
- text: Differences between Azure Managed Instance for Apache Cassandra and Azure Cosmos DB for Apache Cassandra
@@ -49,6 +51,8 @@ landingContent:
4951
url: create-multi-region-cluster.md
5052
- text: Deploy a Spark cluster with Azure Databricks
5153
url: deploy-cluster-databricks.md
54+
- text: Search using Lucene Index
55+
url: search-lucene-index.md
5256
- linkListType: concept
5357
links:
5458
- text: Security overview
17.9 KB
Loading
157 KB
Loading
27.2 KB
Loading
Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
---
2+
title: Quickstart - Search Azure Managed Instance for Apache Cassandra using Stratio's Cassandra Lucene Index.
3+
description: This quickstart shows how to search Azure Managed Instance for Apache Cassandra cluster using Stratio's Cassandra Lucene Index.
4+
author: TheovanKraay
5+
ms.author: thvankra
6+
ms.service: managed-instance-apache-cassandra
7+
ms.topic: quickstart
8+
ms.date: 04/17/2023
9+
---
10+
# Quickstart: Search Azure Managed Instance for Apache Cassandra using Lucene Index (Preview)
11+
12+
Cassandra Lucene Index, derived from Stratio Cassandra, is a plugin for Apache Cassandra that extends its index functionality to provide full text search capabilities and free multivariable, geospatial and bitemporal search. It is achieved through an Apache Lucene based implementation of Cassandra secondary indexes, where each node of the cluster indexes its own data. This quickstart demonstrates how to search Azure Managed Instance for Apache Cassandra using Lucene Index.
13+
14+
> [!IMPORTANT]
15+
> Lucene Index is in public preview.
16+
> This feature is provided without a service level agreement, and it's not recommended for production workloads.
17+
> For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/).
18+
19+
> [!WARNING]
20+
> A limitation with the Lucene index plugin is that cross partition searches cannot be executed solely in the index - Cassandra needs to send the query to each node. This can lead to issues with performance (memory and CPU load) for cross partition searches that may affect steady state workloads.
21+
>
22+
> Where search requirements are significant, we recommend deploying a dedicated secondary data center to be used only for searches, with a minimal number of nodes, each having a high number of cores (minimum 16). The keyspaces in your primary (operational) data center should then be configured to replicate data to your secondary (search) data center.
23+
24+
## Prerequisites
25+
26+
- If you don't have an Azure subscription, create a [free account](https://azure.microsoft.com/free/?WT.mc_id=A261C142F) before you begin.
27+
- Deploy an Azure Managed Instance for Apache Cassandra cluster. You can do this via the [portal](create-cluster-portal.md) - Lucene indexes will be enabled by default when clusters are deployed from the portal. If you want to add Lucene indexes to an existing cluster, click `Update` in the portal overview blade, select `Cassandra Lucene Index`, and click update to deploy.
28+
29+
:::image type="content" source="./media/search-lucene-index/update-cluster.png" alt-text="Screenshot of Update Cassandra Cluster Properties." lightbox="./media/search-lucene-index/update-cluster.png" border="true":::
30+
31+
- Connect to your cluster from [CQLSH](create-cluster-portal.md#connecting-from-cqlsh).
32+
33+
## Create data with Lucene Index
34+
35+
1. In your `CQLSH` command window, create a keyspace and table as below:
36+
37+
```SQL
38+
CREATE KEYSPACE demo
39+
WITH REPLICATION = {'class': 'NetworkTopologyStrategy', 'datacenter-1': 3};
40+
USE demo;
41+
CREATE TABLE tweets (
42+
id INT PRIMARY KEY,
43+
user TEXT,
44+
body TEXT,
45+
time TIMESTAMP,
46+
latitude FLOAT,
47+
longitude FLOAT
48+
);
49+
```
50+
51+
1. Now create a custom secondary index on the table using Lucene Index:
52+
53+
```SQL
54+
CREATE CUSTOM INDEX tweets_index ON tweets ()
55+
USING 'com.stratio.cassandra.lucene.Index'
56+
WITH OPTIONS = {
57+
'refresh_seconds': '1',
58+
'schema': '{
59+
fields: {
60+
id: {type: "integer"},
61+
user: {type: "string"},
62+
body: {type: "text", analyzer: "english"},
63+
time: {type: "date", pattern: "yyyy/MM/dd"},
64+
place: {type: "geo_point", latitude: "latitude", longitude: "longitude"}
65+
}
66+
}'
67+
};
68+
```
69+
70+
1. Insert the following sample tweets:
71+
72+
```SQL
73+
INSERT INTO tweets (id,user,body,time,latitude,longitude) VALUES (1,'theo','Make money fast, 5 easy tips', '2023-04-01T11:21:59.001+0000', 0.0, 0.0);
74+
INSERT INTO tweets (id,user,body,time,latitude,longitude) VALUES (2,'theo','Click my link, like my stuff!', '2023-04-01T11:21:59.001+0000', 0.0, 0.0);
75+
INSERT INTO tweets (id,user,body,time,latitude,longitude) VALUES (3,'quetzal','Click my link, like my stuff!', '2023-04-02T11:21:59.001+0000', 0.0, 0.0);
76+
INSERT INTO tweets (id,user,body,time,latitude,longitude) VALUES (4,'quetzal','Click my link, like my stuff!', '2023-04-01T11:21:59.001+0000', 40.3930, -3.7328);
77+
INSERT INTO tweets (id,user,body,time,latitude,longitude) VALUES (5,'quetzal','Click my link, like my stuff!', '2023-04-01T11:21:59.001+0000', 40.3930, -3.7329);
78+
```
79+
80+
## Control read consistency
81+
82+
1. The index you created earlier will index all the columns in the table with the specified types, and the read index used for searching will be refreshed once per second. Alternatively, you can explicitly refresh all the index shards with an empty search with consistency ALL:
83+
84+
```SQL
85+
CONSISTENCY ALL
86+
SELECT * FROM tweets WHERE expr(tweets_index, '{refresh:true}');
87+
CONSISTENCY QUORUM
88+
```
89+
90+
1. Now, you can search for tweets within a certain date range:
91+
92+
```SQL
93+
SELECT * FROM tweets WHERE expr(tweets_index, '{filter: {type: "range", field: "time", lower: "2023/03/01", upper: "2023/05/01"}}');
94+
```
95+
96+
1. This search can also be performed by forcing an explicit refresh of the involved index shards:
97+
98+
```SQL
99+
SELECT * FROM tweets WHERE expr(tweets_index, '{
100+
filter: {type: "range", field: "time", lower: "2023/03/01", upper: "2023/05/01"},
101+
refresh: true
102+
}') limit 100;
103+
```
104+
105+
## Search data
106+
107+
1. To search the top 100 more relevant tweets where body field contains the phrase “Click my link” within a particular date range:
108+
109+
```SQL
110+
SELECT * FROM tweets WHERE expr(tweets_index, '{
111+
filter: {type: "range", field: "time", lower: "2023/03/01", upper: "2023/05/01"},
112+
query: {type: "phrase", field: "body", value: "Click my link", slop: 1}
113+
}') LIMIT 100;
114+
```
115+
116+
1. To refine the search to get only the tweets written by users whose names start with "q":
117+
118+
```SQL
119+
SELECT * FROM tweets WHERE expr(tweets_index, '{
120+
filter: [
121+
{type: "range", field: "time", lower: "2023/03/01", upper: "2023/05/01"},
122+
{type: "prefix", field: "user", value: "q"}
123+
],
124+
query: {type: "phrase", field: "body", value: "Click my link", slop: 1}
125+
}') LIMIT 100;
126+
```
127+
128+
1. To get the 100 more recent filtered results you can use the sort option:
129+
130+
```SQL
131+
SELECT * FROM tweets WHERE expr(tweets_index, '{
132+
filter: [
133+
{type: "range", field: "time", lower: "2023/03/01", upper: "2023/05/01"},
134+
{type: "prefix", field: "user", value: "q"}
135+
],
136+
query: {type: "phrase", field: "body", value: "Click my link", slop: 1},
137+
sort: {field: "time", reverse: true}
138+
}') limit 100;
139+
```
140+
141+
1. The previous search can be restricted to tweets created close to a geographical position:
142+
143+
```SQL
144+
SELECT * FROM tweets WHERE expr(tweets_index, '{
145+
filter: [
146+
{type: "range", field: "time", lower: "2023/03/01", upper: "2023/05/01"},
147+
{type: "prefix", field: "user", value: "q"},
148+
{type: "geo_distance", field: "place", latitude: 40.3930, longitude: -3.7328, max_distance: "1km"}
149+
],
150+
query: {type: "phrase", field: "body", value: "Click my link", slop: 1},
151+
sort: {field: "time", reverse: true}
152+
}') limit 100;
153+
```
154+
155+
1. It is also possible to sort the results by distance to a geographical position:
156+
157+
```SQL
158+
SELECT * FROM tweets WHERE expr(tweets_index, '{
159+
filter: [
160+
{type: "range", field: "time", lower: "2023/03/01", upper: "2023/05/01"},
161+
{type: "prefix", field: "user", value: "q"},
162+
{type: "geo_distance", field: "place", latitude: 40.3930, longitude: -3.7328, max_distance: "1km"}
163+
],
164+
query: {type: "phrase", field: "body", value: "Click my link", slop: 1},
165+
sort: [
166+
{field: "time", reverse: true},
167+
{field: "place", type: "geo_distance", latitude: 40.3930, longitude: -3.7328}
168+
]
169+
}') limit 100;
170+
```
171+
172+
173+
## Next steps
174+
175+
In this quickstart, you learned how to search an Azure Managed Instance for Apache Cassandra cluster using Lucene Search. You can now start working with the cluster:
176+
177+
> [!div class="nextstepaction"]
178+
> [Deploy a Managed Apache Spark Cluster with Azure Databricks](deploy-cluster-databricks.md)

0 commit comments

Comments
 (0)