Skip to content

Commit 9f8ca0d

Browse files
committed
feat: initial implementation of merge-streams library
- Added package.json with metadata, dependencies, and scripts. - Implemented core merging functions for Apache Arrow, CSV, and JSON_ARRAY formats. - Created a unified API for merging multiple data streams. - Developed utility functions for handling streams and HTTP requests. - Added integration tests for merging streams from Databricks external links. - Implemented unit tests for merging Arrow, CSV, and JSON streams. - Configured TypeScript with strict settings and output directory.
0 parents  commit 9f8ca0d

19 files changed

+3447
-0
lines changed

.github/release-drafter.yml

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
name-template: 'v$RESOLVED_VERSION'
2+
tag-template: 'v$RESOLVED_VERSION'
3+
categories:
4+
- title: 'Features'
5+
labels:
6+
- 'feature'
7+
- 'enhancement'
8+
- title: 'Bug Fixes'
9+
labels:
10+
- 'bug'
11+
- 'fix'
12+
- title: 'Documentation'
13+
labels:
14+
- 'documentation'
15+
- 'docs'
16+
- title: 'Dependencies'
17+
labels:
18+
- 'dependencies'
19+
change-template: '- $TITLE @$AUTHOR (#$NUMBER)'
20+
change-title-escapes: '\<*_&'
21+
version-resolver:
22+
major:
23+
labels:
24+
- 'major'
25+
minor:
26+
labels:
27+
- 'minor'
28+
- 'feature'
29+
patch:
30+
labels:
31+
- 'patch'
32+
- 'bug'
33+
- 'fix'
34+
default: patch
35+
template: |
36+
## Changes
37+
38+
$CHANGES
39+
40+
**Full Changelog**: https://github.com/$OWNER/$REPOSITORY/compare/$PREVIOUS_TAG...v$RESOLVED_VERSION

.github/workflows/publish.yml

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
name: Publish to npm
2+
3+
on:
4+
release:
5+
types:
6+
- published
7+
8+
jobs:
9+
publish:
10+
runs-on: ubuntu-latest
11+
permissions:
12+
contents: read
13+
id-token: write
14+
steps:
15+
- uses: actions/checkout@v4
16+
17+
- uses: actions/setup-node@v4
18+
with:
19+
node-version: '20'
20+
registry-url: 'https://registry.npmjs.org'
21+
22+
- run: npm ci
23+
24+
- run: npm test
25+
26+
- run: npm run build
27+
28+
- run: npm publish --provenance --access public
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
name: Release Drafter
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
pull_request:
8+
types:
9+
- opened
10+
- reopened
11+
- synchronize
12+
13+
permissions:
14+
contents: read
15+
16+
jobs:
17+
update_release_draft:
18+
permissions:
19+
contents: write
20+
pull-requests: write
21+
runs-on: ubuntu-latest
22+
steps:
23+
- uses: release-drafter/release-drafter@v6
24+
env:
25+
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

.gitignore

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
node_modules/
2+
dist/
3+
*.log
4+
.DS_Store
5+
coverage/
6+
.env
7+
.env.*

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2024
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
# merge-streams
2+
3+
**When Databricks gives you 90+ presigned URLs, merge them into one.**
4+
5+
> *Because nobody wants to explain to their MCP client why it needs to juggle dozens of chunk URLs.*
6+
7+
---
8+
9+
## Why I Made This
10+
11+
I was building an MCP Server that queries Databricks SQL for large datasets. I chose External Links format because INLINE would blow up memory.
12+
13+
But then Databricks handed me back something like this:
14+
15+
```
16+
chunk_0.arrow (presigned URL)
17+
chunk_1.arrow (presigned URL)
18+
chunk_2.arrow (presigned URL)
19+
...
20+
chunk_89.arrow (presigned URL)
21+
```
22+
23+
My client would have to:
24+
1. Fetch each chunk sequentially
25+
2. Parse and merge them correctly (CSV headers? JSON array brackets? Arrow EOS markers?)
26+
3. Handle errors across 90 HTTP requests
27+
4. Pray nothing times out
28+
29+
That was unacceptable. So I built this.
30+
31+
---
32+
33+
## The Solution
34+
35+
`merge-streams` takes those chunked External Links and merges them into a single, unified stream.
36+
37+
```
38+
90+ presigned URLs → merge-streams → 1 clean stream → S3 → 1 presigned URL
39+
```
40+
41+
Now my MCP client gets one URL. Done.
42+
43+
### What Makes It Fast
44+
45+
- **Pre-connected**: Next chunk's connection opens while current chunk streams. No idle time.
46+
- **Zero accumulation**: Pure stream piping. Memory stays flat regardless of data size.
47+
- **Format-aware**: Not byte concatenation — actual format understanding.
48+
49+
---
50+
51+
## Features
52+
53+
- **CSV**: Automatically deduplicates headers across chunks
54+
- **JSON_ARRAY**: Properly concatenates JSON arrays (handles brackets and commas)
55+
- **ARROW_STREAM**: Merges Arrow IPC streams batch-by-batch (doesn't just byte-concat)
56+
- **Memory-efficient**: Streaming-based, never loads entire files into memory
57+
- **AbortSignal support**: Cancel mid-stream when needed
58+
59+
---
60+
61+
## Installation
62+
63+
```bash
64+
npm install merge-streams
65+
```
66+
67+
Requires Node.js 18+ (uses native `fetch()` and `Readable.fromWeb()`)
68+
69+
---
70+
71+
## Quick Start: The Databricks Use Case
72+
73+
See [test/databricks.spec.ts](test/databricks.spec.ts) for a complete working example.
74+
75+
```bash
76+
# Run the integration test
77+
DATABRICKS_TOKEN=dapi... \
78+
DATABRICKS_HOST=xxx.cloud.databricks.com \
79+
DATABRICKS_HTTP_PATH=/sql/1.0/warehouses/xxx \
80+
npm test -- test/databricks.spec.ts
81+
```
82+
83+
---
84+
85+
## API
86+
87+
### URL-based (for Databricks External Links)
88+
89+
```ts
90+
import { mergeStreamsFromUrls } from 'merge-streams'
91+
92+
await mergeStreamsFromUrls('CSV', urls, outputStream)
93+
await mergeStreamsFromUrls('JSON_ARRAY', urls, outputStream)
94+
await mergeStreamsFromUrls('ARROW_STREAM', urls, outputStream)
95+
```
96+
97+
### Options
98+
99+
```ts
100+
const controller = new AbortController()
101+
102+
await mergeCsvFromUrls(urls, output, { signal: controller.signal })
103+
104+
// Cancel anytime
105+
controller.abort()
106+
```
107+
108+
---
109+
110+
## Format Details
111+
112+
| Format | Databricks name | Behavior |
113+
|--------|----------------|----------|
114+
| CSV | `CSV` | Writes header once, skips duplicate headers from subsequent chunks |
115+
| JSON_ARRAY | `JSON_ARRAY` | Wraps in `[]`, strips brackets from chunks, inserts commas |
116+
| ARROW_STREAM | `ARROW_STREAM` | Re-encodes RecordBatches into single IPC stream (not byte-concat) |
117+
118+
---
119+
120+
## Types
121+
122+
```ts
123+
type InputSource = Readable | (() => Readable) | (() => Promise<Readable>)
124+
type MergeFormat = 'ARROW_STREAM' | 'CSV' | 'JSON_ARRAY'
125+
```
126+
127+
---
128+
129+
## Why Not Just Byte-Concatenate?
130+
131+
- **CSV**: You'd get duplicate headers scattered throughout
132+
- **JSON_ARRAY**: `[1,2][3,4]` is not valid JSON
133+
- **Arrow**: Most Arrow readers stop at the first EOS marker
134+
135+
Each format needs format-aware merging. That's what this library does.
136+
137+
---
138+
139+
## Scope
140+
141+
This library was born from a specific pain point: making Databricks External Links usable in MCP Server development. It does that one thing well.
142+
143+
If you have other use cases in mind, PRs are welcome.
144+
145+
---
146+
147+
## License
148+
149+
MIT

0 commit comments

Comments
 (0)