Skip to content

Commit af5428c

Browse files
authored
feat(presto-clp): Add Docker compose setup for Presto cluster that can connect to clp-json. (#1132)
1 parent a84ce14 commit af5428c

24 files changed

+601
-1
lines changed

docs/src/user-guide/guides-overview.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,13 @@ Using object storage
1212
Using CLP to ingest logs from object storage and store archives on object storage.
1313
:::
1414

15+
:::{grid-item-card}
16+
:link: guides-using-presto
17+
Using Presto with CLP
18+
^^^
19+
How to use Presto to query compressed logs in CLP.
20+
:::
21+
1522
:::{grid-item-card}
1623
:link: guides-multi-node
1724
Multi-node deployment
Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
# Using Presto with CLP
2+
3+
[Presto] is a distributed SQL query engine that can be used to query data stored in CLP (using SQL).
4+
This guide describes how to set up and use Presto with CLP.
5+
6+
:::{warning}
7+
Currently, only the [clp-json](quick-start/clp-json.md) flavor of CLP supports queries through
8+
Presto.
9+
:::
10+
11+
:::{note}
12+
This integration with Presto is under development and may change in the future. It is also being
13+
maintained in a [fork][yscope-presto] of the Presto project. At some point, these changes will have
14+
been merged into the main Presto repository so that you can use official Presto releases with CLP.
15+
:::
16+
17+
## Requirements
18+
19+
* [CLP][clp-releases] (clp-json) v0.4.0 or higher
20+
* [Docker] v28 or higher
21+
* [Docker Compose][docker-compose] v2.20.2 or higher
22+
* Python
23+
* python3-venv (for the version of Python installed)
24+
25+
## Set up
26+
27+
Using Presto with CLP requires:
28+
29+
* [Setting up CLP](#setting-up-clp) and compressing some logs.
30+
* [Setting up Presto](#setting-up-presto) to query CLP's metadata database and archives.
31+
32+
### Setting up CLP
33+
34+
Follow the [quick-start](./quick-start/index.md) guide to set up CLP and compress your logs. A
35+
sample dataset that works well with Presto is [postgresql].
36+
37+
### Setting up Presto
38+
39+
1. Clone the CLP repository:
40+
41+
```bash
42+
git clone https://github.com/y-scope/clp.git
43+
```
44+
45+
2. Navigate to the `tools/deployment/presto-clp` directory in your terminal.
46+
3. Generate the necessary config for Presto to work with CLP:
47+
48+
```bash
49+
scripts/set-up-config.sh <clp-json-dir>
50+
```
51+
52+
* Replace `<clp-json-dir>` with the location of the clp-json package you set up in the previous
53+
section.
54+
55+
4. Configure Presto to use CLP's metadata database as follows:
56+
57+
* Open and edit `coordinator/config-template/metadata-filter.json`.
58+
* For each dataset you want to query, add a filter config of the form:
59+
60+
```json
61+
{
62+
"clp.default.<dataset>": [
63+
{
64+
"columnName": "<timestamp-key>",
65+
"rangeMapping": {
66+
"lowerBound": "begin_timestamp",
67+
"upperBound": "end_timestamp"
68+
},
69+
"required": false
70+
}
71+
]
72+
}
73+
```
74+
75+
* Replace `<dataset>` with the name of the dataset you want to query. (If you didn't specify a
76+
dataset when compressing your logs, they would be compressed into the `default` dataset.)
77+
* Replace `<timestamp-key>` with the timestamp key you specified when compressing logs for
78+
this particular dataset.
79+
* The complete syntax for this file is [here][clp-connector-docs].
80+
81+
5. Start a Presto cluster by running:
82+
83+
```bash
84+
docker compose up
85+
```
86+
87+
* To use more than Presto worker, you can use the `--scale` option as follows:
88+
89+
```bash
90+
docker compose up --scale presto-worker=<num-workers>
91+
```
92+
93+
* Replace `<num-workers>` with the number of Presto worker nodes you want to run.
94+
95+
### Stopping the Presto cluster
96+
97+
To stop the Presto cluster, use CTRL + C.
98+
99+
To clean up the Presto cluster entirely:
100+
101+
```bash
102+
docker compose rm
103+
```
104+
105+
## Querying your logs through Presto
106+
107+
To query your logs through Presto, you can use the Presto CLI:
108+
109+
```bash
110+
docker compose exec presto-coordinator \
111+
presto-cli \
112+
--catalog clp \
113+
--schema default
114+
```
115+
116+
Each dataset in CLP shows up as a table in Presto. To show all available datasets:
117+
118+
```sql
119+
SHOW TABLES;
120+
```
121+
122+
If you didn't specify a dataset when compressing your logs in CLP, your logs will have been stored
123+
in the `default` dataset. To query the logs in this dataset:
124+
125+
```sql
126+
SELECT * FROM default LIMIT 1;
127+
```
128+
129+
All kv-pairs in each log event can be queried directly using dot-notation. For example, if your logs
130+
contain the field `foo.bar`, you can query it using:
131+
132+
```sql
133+
SELECT foo.bar FROM default LIMIT 1;
134+
```
135+
136+
## Limitations
137+
138+
The Presto CLP integration has the following limitations at present:
139+
140+
* Nested fields containing special characters cannot be queried (see [y-scope/presto#8]). Allowed
141+
characters are alphanumeric characters and underscores. To get around this limitation, you'll
142+
need to preprocess your logs to remove any special characters.
143+
* Only logs stored on the filesystem, rather than S3, can be queried through Presto.
144+
145+
These limitations will be addressed in a future release of the Presto integration.
146+
147+
[clp-connector-docs]: https://docs.yscope.com/presto/connector/clp.html#metadata-filter-config-file
148+
[clp-releases]: https://github.com/y-scope/clp/releases
149+
[docker-compose]: https://docs.docker.com/compose/install/
150+
[Docker]: https://docs.docker.com/engine/install/
151+
[postgresql]: https://zenodo.org/records/10516401
152+
[Presto]: https://prestodb.io/
153+
[y-scope/presto#8]: https://github.com/y-scope/presto/issues/8
154+
[yscope-presto]: https://github.com/y-scope/presto

docs/src/user-guide/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,7 @@ quick-start/clp-text
6262
guides-overview
6363
guides-using-object-storage/index
6464
guides-multi-node
65+
guides-using-presto
6566
:::
6667

6768
:::{toctree}

taskfiles/lint.yaml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,8 @@ tasks:
103103
components/package-template/src/etc \
104104
docs \
105105
taskfile.yaml \
106-
taskfiles
106+
taskfiles \
107+
tools/deployment
107108
108109
check-cpp-format:
109110
sources: &cpp_source_files
@@ -772,6 +773,7 @@ tasks:
772773
- "components/clp-py-utils/clp_py_utils"
773774
- "components/core/tools/scripts/utils"
774775
- "components/job-orchestration/job_orchestration"
776+
- "tools/deployment"
775777
- "tools/scripts"
776778
- "docs/conf"
777779
cmd: |-
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
PRESTO_COORDINATOR_HTTPPORT=8080
2+
PRESTO_COORDINATOR_SERVICENAME=presto-coordinator
3+
4+
# node.properties
5+
PRESTO_COORDINATOR_NODEPROPERTIES_ENVIRONMENT=production
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# clp.properties
2+
PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_PROVIDER_TYPE=mysql
3+
PRESTO_COORDINATOR_CLPPROPERTIES_SPLIT_PROVIDER=mysql
4+
5+
# config.properties
6+
PRESTO_COORDINATOR_CONFIGPROPERTIES_QUERY_MAX_MEMORY=1GB
7+
PRESTO_COORDINATOR_CONFIGPROPERTIES_QUERY_MAX_MEMORY_PER_NODE=1GB
8+
9+
# jvm.config
10+
PRESTO_COORDINATOR_CONFIG_JVMCONFIG_MAXHEAPSIZE=4G
11+
PRESTO_COORDINATOR_JVMCONFIG_G1HEAPREGIONSIZE=32M
12+
13+
# log.properties
14+
PRESTO_COORDINATOR_LOGPROPERTIES_LEVEL=INFO
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
connector.name=clp
2+
clp.metadata-provider-type=${PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_PROVIDER_TYPE}
3+
clp.metadata-db-url=${PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_URL}
4+
clp.metadata-db-name=${PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_NAME}
5+
clp.metadata-db-user=${PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_USER}
6+
clp.metadata-db-password=${PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_DATABASE_PASSWORD}
7+
clp.metadata-table-prefix=${PRESTO_COORDINATOR_CLPPROPERTIES_METADATA_TABLE_PREFIX}
8+
clp.split-provider-type=${PRESTO_COORDINATOR_CLPPROPERTIES_SPLIT_PROVIDER}
9+
clp.metadata-filter-config=/opt/presto-server/etc/metadata-filter.json
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
coordinator=true
2+
node-scheduler.include-coordinator=false
3+
http-server.http.port=${PRESTO_COORDINATOR_HTTPPORT}
4+
query.max-memory=${PRESTO_COORDINATOR_CONFIGPROPERTIES_QUERY_MAX_MEMORY}
5+
query.max-memory-per-node=${PRESTO_COORDINATOR_CONFIGPROPERTIES_QUERY_MAX_MEMORY_PER_NODE}
6+
discovery-server.enabled=true
7+
discovery.uri=${PRESTO_COORDINATOR_CONFIGPROPERTIES_DISCOVERY_URI}
8+
optimizer.optimize-hash-generation=false
9+
regex-library=RE2J
10+
use-alternative-function-signatures=true
11+
inline-sql-functions=false
12+
nested-data-serialization-enabled=false
13+
native-execution-enabled=true
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
-server
2+
-Xmx${PRESTO_COORDINATOR_CONFIG_JVMCONFIG_MAXHEAPSIZE}
3+
-XX:+UseG1GC
4+
-XX:G1HeapRegionSize=${PRESTO_COORDINATOR_JVMCONFIG_G1HEAPREGIONSIZE}
5+
-XX:+UseGCOverheadLimit
6+
-XX:+ExplicitGCInvokesConcurrent
7+
-XX:+HeapDumpOnOutOfMemoryError
8+
-XX:+ExitOnOutOfMemoryError
9+
-Djdk.attach.allowAttachSelf=true
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
com.facebook.presto=${PRESTO_COORDINATOR_LOGPROPERTIES_LEVEL}

0 commit comments

Comments
 (0)