Skip to content

Commit b4c5100

Browse files
committed
Update automating_json_log_loading_with_vector.md
1 parent 8f6cd48 commit b4c5100

File tree

1 file changed

+104
-0
lines changed

1 file changed

+104
-0
lines changed

docs/en/tutorials/load/automating_json_log_loading_with_vector.md

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,9 +91,113 @@ aws s3 ls s3://databend-doc/logs/
9191
If the log file has been successfully synced to S3, you should see output similar to this:
9292

9393
```bash
94+
2024-12-10 15:22:13 0
95+
2024-12-10 17:52:42 112 1733871161-7b89e50a-6eb4-4531-8479-dd46981e4674.log.gz
96+
```
97+
98+
You can now download the synced log file from your bucket:
99+
100+
```bash
101+
aws s3 cp s3://databend-doc/logs/1733871161-7b89e50a-6eb4-4531-8479-dd46981e4674.log.gz ~/Documents/
102+
```
103+
104+
Compared to the original log, the synced log is in NDJSON format, with each record wrapped in an outer `log` field:
105+
106+
```json
107+
{"log":{"event":"login","timestamp":"2024-12-08T10:00:00Z","user_id":1}}
108+
{"log":{"event":"purchase","timestamp":"2024-12-08T10:05:00Z","user_id":2}}
109+
```
110+
111+
## Step 4: Create Task in Databend Cloud
112+
113+
1. Open a worksheet, and create an external stage that links to the `logs` folder in your bucket:
114+
115+
```sql
116+
CREATE STAGE mylog 's3://databend-doc/logs/' CONNECTION=(
117+
ACCESS_KEY_ID = '<your-access-key-id>',
118+
SECRET_ACCESS_KEY = '<your-secret-access-key>'
119+
);
120+
```
121+
122+
Once the stage is successfully created, you can list the files in it:
123+
124+
```sql
125+
LIST @mylog;
126+
127+
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
128+
│ name │ size │ md5 │ last_modified │ creator │
129+
├────────────────────────────────────────────────────────┼────────┼────────────────────────────────────┼───────────────────────────────┼──────────────────┤
130+
1733871161-7b89e50a-6eb4-4531-8479-dd46981e4674.log.gz │ 112"231ddcc590222bfaabd296b151154844"2024-12-10 22:52:42.000 +0000NULL
131+
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
132+
```
133+
134+
2. Create a table with columns mapped to the fields in the log:
135+
136+
```sql
137+
CREATE TABLE logs (
138+
event String,
139+
timestamp Timestamp,
140+
user_id Int32
141+
);
142+
```
94143

144+
3. Create a scheduled task to load logs from the external stage into the `logs` table:
145+
146+
```sql
147+
CREATE TASK IF NOT EXISTS myvectortask
148+
WAREHOUSE = 'eric'
149+
SCHEDULE = 1 MINUTE
150+
SUSPEND_TASK_AFTER_NUM_FAILURES = 3
151+
AS
152+
COPY INTO logs
153+
FROM (
154+
SELECT $1:log:event, $1:log:timestamp, $1:log:user_id
155+
FROM @mylog/
156+
)
157+
FILE_FORMAT = (TYPE = NDJSON, COMPRESSION = AUTO)
158+
MAX_FILES = 10000
159+
PURGE = TRUE;
95160
```
96161

162+
4. Start the task:
163+
164+
```sql
165+
ALTER TASK myvectortask RESUME;
166+
```
167+
168+
Wait for a moment, then check if the logs have been loaded into the table:
169+
170+
```sql
171+
SELECT * FROM logs;
172+
173+
┌──────────────────────────────────────────────────────────┐
174+
│ event │ timestamp │ user_id │
175+
├──────────────────┼─────────────────────┼─────────────────┤
176+
│ login │ 2024-12-08 10:00:001
177+
│ purchase │ 2024-12-08 10:05:002
178+
└──────────────────────────────────────────────────────────┘
179+
```
180+
181+
If you run `LIST @mylog;` now, you will see no files listed. This is because the task is configured with `PURGE = TRUE`, which deletes the synced files from S3 after the logs are loaded.
182+
183+
Now, let's simulate generating two more logs in the local log file `app.log`:
184+
185+
```bash
186+
echo '{"user_id": 3, "event": "logout", "timestamp": "2024-12-08T10:10:00Z"}' >> /Users/eric/Documents/logs/app.log
187+
echo '{"user_id": 4, "event": "login", "timestamp": "2024-12-08T10:15:00Z"}' >> /Users/eric/Documents/logs/app.log
188+
```
97189

190+
Wait for a moment for the log to sync to S3 (a new file should appear in the logs folder). The scheduled task will then load the new logs into the table. If you query the table again, you will find these logs:
98191

192+
```sql
193+
SELECT * FROM logs;
99194

195+
┌──────────────────────────────────────────────────────────┐
196+
│ event │ timestamp │ user_id │
197+
├──────────────────┼─────────────────────┼─────────────────┤
198+
│ logout │ 2024-12-08 10:10:003
199+
│ login │ 2024-12-08 10:15:004
200+
│ login │ 2024-12-08 10:00:001
201+
│ purchase │ 2024-12-08 10:05:002
202+
└──────────────────────────────────────────────────────────┘
203+
```

0 commit comments

Comments
 (0)