Skip to content

Commit a07fc65

Browse files
committed
first commit
0 parents  commit a07fc65

17 files changed

+1650
-0
lines changed

.github/workflows/ci.yml

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: [ main ]
6+
pull_request:
7+
branches: [ main ]
8+
9+
jobs:
10+
tests:
11+
runs-on: ubuntu-latest
12+
strategy:
13+
fail-fast: false
14+
matrix:
15+
php: [ '8.2', '8.3' ]
16+
17+
steps:
18+
- name: Checkout
19+
uses: actions/checkout@v4
20+
21+
- name: Setup PHP
22+
uses: shivammathur/setup-php@v2
23+
with:
24+
php-version: ${{ matrix.php }}
25+
coverage: none
26+
extensions: sqlite3, pdo_sqlite
27+
ini-values: memory_limit=-1
28+
29+
- name: Validate composer.json and composer.lock
30+
run: composer validate --no-check-publish
31+
32+
- name: Cache Composer packages
33+
uses: actions/cache@v4
34+
with:
35+
path: |
36+
~/.composer/cache/files
37+
vendor
38+
key: ${{ runner.os }}-php-${{ matrix.php }}-composer-${{ hashFiles('**/composer.lock') }}
39+
restore-keys: |
40+
${{ runner.os }}-php-${{ matrix.php }}-composer-
41+
42+
- name: Install dependencies
43+
run: |
44+
composer install --no-interaction --no-progress --prefer-dist
45+
python3 -m pip install --upgrade pip
46+
python3 -m pip install pyarrow
47+
48+
- name: Run test suite
49+
run: vendor/bin/phpunit --no-coverage

README.md

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
# ParqBridge
2+
3+
Export your Laravel database tables to real Apache Parquet files on any Storage disk (local, S3, etc.) with a simple artisan command.
4+
5+
ParqBridge focuses on zero PHP dependency bloat while still producing spec-compliant Parquet files by delegating the final write step to a tiny, embedded Python script using PyArrow (or any custom CLI you prefer). You keep full Laravel DX for configuration and Storage; we bridge your data to Parquet.
6+
7+
## Installation
8+
9+
- Require the package in your app (path repo or VCS):
10+
11+
```bash
12+
composer require dgtlss/parqbridge
13+
```
14+
15+
- Laravel will auto-discover the service provider. Alternatively, register `ParqBridge\\ParqBridgeServiceProvider` manually.
16+
17+
- Publish the config if you want to customize defaults:
18+
19+
```bash
20+
php artisan vendor:publish --tag="parqbridge-config"
21+
```
22+
23+
## Configuration
24+
25+
Set your export disk and options in `.env` or `config/parqbridge.php`.
26+
27+
- `PARQUET_DISK`: which filesystem disk to use (e.g., `s3`, `local`).
28+
- `PARQUET_OUTPUT_DIR`: directory prefix within the disk (default `parquet-exports`).
29+
- `PARQUET_CHUNK_SIZE`: rows per DB chunk when exporting (default 1000).
30+
- `PARQUET_INFERENCE`: `database|sample|hybrid` (default `hybrid`).
31+
- `PARQUET_COMPRESSION`: compression codec for Parquet (`UNCOMPRESSED`/`NONE`, `SNAPPY`, `GZIP`, `ZSTD`, `BROTLI`, `LZ4_RAW`) when using PyArrow backend.
32+
- `PARQBRIDGE_WRITER`: `pyarrow` (default) or `custom`. If `custom`, set `PARQBRIDGE_CUSTOM_CMD`.
33+
- `PARQBRIDGE_PYTHON`: python executable for PyArrow (default `python3`).
34+
35+
Example `.env`:
36+
37+
```ini
38+
PARQUET_DISK=s3
39+
PARQUET_OUTPUT_DIR=parquet-exports
40+
PARQUET_CHUNK_SIZE=2000
41+
```
42+
43+
Ensure your `filesystems` disk is configured (e.g., `s3`) in `config/filesystems.php`.
44+
45+
## Usage
46+
47+
- List tables:
48+
49+
```bash
50+
php artisan parqbridge:tables
51+
```
52+
53+
- Export a table to the configured disk:
54+
55+
```bash
56+
php artisan parqbridge:export users --where="active = 1" --limit=1000 --output="parquet-exports" --disk=s3
57+
```
58+
59+
On success, the command prints the full path written within the disk. Files are named `{table}-{YYYYMMDD_HHMMSS}.parquet`.
60+
61+
- Export ALL tables into one folder (timestamped subfolder inside `parqbridge.output_directory`):
62+
63+
```bash
64+
php artisan parqbridge:export-all --disk=s3 --output="parquet-exports" --exclude=migrations,password_resets
65+
```
66+
67+
Options:
68+
- `--include=`: comma-separated allowlist of table names
69+
- `--exclude=`: comma-separated denylist of table names
70+
71+
## Data types
72+
73+
The schema inferrer maps common DB types to a set of Parquet primitive types and logical annotations. With the PyArrow backend, an Arrow schema is constructed to faithfully write types:
74+
75+
- Primitive: `BOOLEAN`, `INT32`, `INT64`, `FLOAT`, `DOUBLE`, `BYTE_ARRAY`, `FIXED_LEN_BYTE_ARRAY`
76+
- Logical: `UTF8`, `DATE`, `TIME_MILLIS`, `TIME_MICROS`, `TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS`, `DECIMAL`
77+
78+
For decimals we write Arrow decimal types (`decimal128`/`decimal256`) with declared `precision`/`scale`.
79+
80+
## Testing
81+
82+
Run the test suite:
83+
84+
```bash
85+
composer install
86+
vendor/bin/phpunit
87+
```
88+
89+
The tests bootstrap a minimal container, create a SQLite database, and verify:
90+
- listing tables works on SQLite
91+
- exporting a table writes a Parquet file to the configured disk (magic `PAR1`)
92+
- schema inference on SQLite maps major families
93+
94+
## Backend requirements
95+
96+
- By default ParqBridge uses Python + PyArrow. Ensure `python3` is available and install PyArrow:
97+
98+
```bash
99+
python3 -m pip install --upgrade pip
100+
python3 -m pip install pyarrow
101+
```
102+
103+
- Alternatively set a custom converter command via `PARQBRIDGE_WRITER=custom` and `PARQBRIDGE_CUSTOM_CMD` (must read `{input}` CSV and write `{output}` Parquet).
104+
105+
You can automate setup via the included command:
106+
107+
```bash
108+
php artisan parqbridge:setup --write-env
109+
```
110+
111+
Options:
112+
- `--python=`: path/name of Python (default from config `parqbridge.pyarrow_python`)
113+
- `--venv=`: location for virtualenv (default `./parqbridge-venv`)
114+
- `--no-venv`: install into global Python instead of a venv
115+
- `--write-env`: append `PARQBRIDGE_PYTHON` and `PARQBRIDGE_WRITER` to `.env`
116+
- `--upgrade`: upgrade pip first
117+
- `--dry-run`: print commands without executing

composer.json

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
{
2+
"name": "dgtlss/parqbridge",
3+
"description": "Export Laravel database tables to Parquet files using Storage disks (no external deps).",
4+
"type": "library",
5+
"license": "MIT",
6+
"authors": [
7+
{ "name": "ParqBridge", "email": "dev@example.com" }
8+
],
9+
"require": {
10+
"php": ">=8.2"
11+
},
12+
"require-dev": {
13+
"phpunit/phpunit": "^10.5",
14+
"illuminate/support": "^12.0",
15+
"illuminate/container": "^12.0",
16+
"illuminate/config": "^12.0",
17+
"illuminate/filesystem": "^12.0",
18+
"illuminate/database": "^12.0",
19+
"illuminate/console": "^12.0"
20+
},
21+
"autoload": {
22+
"psr-4": {
23+
"ParqBridge\\": "src/"
24+
}
25+
},
26+
"autoload-dev": {
27+
"psr-4": {
28+
"ParqBridge\\Tests\\": "tests/"
29+
}
30+
},
31+
"extra": {
32+
"laravel": {
33+
"providers": [
34+
"ParqBridge\\ParqBridgeServiceProvider"
35+
]
36+
}
37+
}
38+
}

config/parqbridge.php

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
<?php
2+
3+
return [
4+
/*
5+
|--------------------------------------------------------------------------
6+
| Export Disk
7+
|--------------------------------------------------------------------------
8+
| The filesystem disk where Parquet files will be written. This uses
9+
| Laravel's Storage facade under the hood, so any disk configured in
10+
| config/filesystems.php is supported (e.g., "local", "s3").
11+
|
12+
| .env: PARQUET_DISK=s3
13+
*/
14+
'disk' => env('PARQUET_DISK', env('FILESYSTEM_DISK', 'local')),
15+
16+
/*
17+
|--------------------------------------------------------------------------
18+
| Output Directory
19+
|--------------------------------------------------------------------------
20+
| Directory path prefix inside the selected disk. The final path will be
21+
| {output_directory}/{table}-{timestamp}.parquet
22+
|
23+
| .env: PARQUET_OUTPUT_DIR=parquet-exports
24+
*/
25+
'output_directory' => env('PARQUET_OUTPUT_DIR', 'parquet-exports'),
26+
27+
/*
28+
|--------------------------------------------------------------------------
29+
| Chunk Size
30+
|--------------------------------------------------------------------------
31+
| Number of rows fetched per chunk when streaming data out of the database.
32+
| Larger chunks are faster but use more memory.
33+
|
34+
| .env: PARQUET_CHUNK_SIZE=1000
35+
*/
36+
'chunk_size' => (int) env('PARQUET_CHUNK_SIZE', 1000),
37+
38+
/*
39+
|--------------------------------------------------------------------------
40+
| Date/Time Formatting for Fallbacks
41+
|--------------------------------------------------------------------------
42+
| When a database driver returns date/time types as strings or DateTime,
43+
| these formats are used for the Parquet logical annotations we emit.
44+
| You usually don't need to change these.
45+
*/
46+
'date_format' => 'Y-m-d',
47+
'datetime_format' => \DateTimeInterface::ATOM,
48+
'time_format' => 'H:i:s',
49+
50+
/*
51+
|--------------------------------------------------------------------------
52+
| Schema Inference Strategy
53+
|--------------------------------------------------------------------------
54+
| "database" will use the database column types from the schema to choose
55+
| Parquet primitive/logical types. "sample" will inspect the first chunk
56+
| of data to refine types (e.g., booleans stored as tinyint(1)).
57+
| Options: database | sample | hybrid
58+
*/
59+
'inference' => env('PARQUET_INFERENCE', 'hybrid'),
60+
61+
/*
62+
|--------------------------------------------------------------------------
63+
| Compression
64+
|--------------------------------------------------------------------------
65+
| Compression codec for Parquet files. When using the PyArrow backend you
66+
| may choose from: NONE (alias UNCOMPRESSED), SNAPPY, GZIP, ZSTD, BROTLI,
67+
| LZ4_RAW. Default is UNCOMPRESSED.
68+
*/
69+
'compression' => env('PARQUET_COMPRESSION', 'UNCOMPRESSED'),
70+
71+
/*
72+
|--------------------------------------------------------------------------
73+
| Writer Backend
74+
|--------------------------------------------------------------------------
75+
| Controls how ParqBridge produces Apache Parquet files.
76+
| - pyarrow: Uses Python + PyArrow (requires `python3` with `pyarrow` installed)
77+
| - custom: Uses a custom shell command template provided below
78+
|
79+
| .env: PARQBRIDGE_WRITER=pyarrow
80+
*/
81+
'writer' => env('PARQBRIDGE_WRITER', 'pyarrow'),
82+
83+
/*
84+
| Python executable name/path for the PyArrow backend. E.g., python3 or /usr/bin/python3
85+
| .env: PARQBRIDGE_PYTHON=python3
86+
*/
87+
'pyarrow_python' => env('PARQBRIDGE_PYTHON', 'python3'),
88+
89+
/*
90+
| Custom command template when writer=custom. Use {input} and {output} placeholders.
91+
| Example (DuckDB CLI): duckdb -c "COPY (SELECT * FROM read_csv_auto({input})) TO {output} (FORMAT PARQUET)"
92+
| .env: PARQBRIDGE_CUSTOM_CMD="duckdb -c \"COPY (SELECT * FROM read_csv_auto({input})) TO {output} (FORMAT PARQUET)\""
93+
*/
94+
'custom_command' => env('PARQBRIDGE_CUSTOM_CMD', ''),
95+
];

phpunit.xml

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<phpunit colors="true" bootstrap="tests/bootstrap.php" cacheResultFile=".phpunit.result.cache">
3+
<testsuites>
4+
<testsuite name="ParqBridge Test Suite">
5+
<directory suffix="Test.php">tests</directory>
6+
</testsuite>
7+
</testsuites>
8+
<coverage processUncoveredFiles="true">
9+
<include>
10+
<directory suffix=".php">src</directory>
11+
</include>
12+
</coverage>
13+
</phpunit>
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
<?php
2+
3+
namespace ParqBridge\Console;
4+
5+
use Illuminate\Console\Command;
6+
use Illuminate\Support\Facades\DB;
7+
8+
class ExportAllTablesCommand extends Command
9+
{
10+
protected $signature = 'parqbridge:export-all {--disk=} {--output=} {--exclude=} {--include=}';
11+
protected $description = 'Export all database tables to Parquet files into a single folder on the chosen disk.';
12+
13+
public function handle(): int
14+
{
15+
$disk = (string) ($this->option('disk') ?: config('parqbridge.disk'));
16+
$rootOutput = (string) ($this->option('output') ?: config('parqbridge.output_directory'));
17+
18+
$include = $this->parseCsvOption('include');
19+
$exclude = $this->parseCsvOption('exclude');
20+
21+
$tables = $this->getTables();
22+
if (!empty($include)) {
23+
$tables = array_values(array_intersect($tables, $include));
24+
}
25+
if (!empty($exclude)) {
26+
$tables = array_values(array_diff($tables, $exclude));
27+
}
28+
29+
if (empty($tables)) {
30+
$this->warn('No tables to export.');
31+
return self::SUCCESS;
32+
}
33+
34+
$subdir = now()->format('Ymd_His');
35+
$finalOutput = trim($rootOutput, '/').'/'.$subdir;
36+
37+
$this->info('Exporting '.count($tables).' tables to folder: '.$finalOutput.' on disk '.$disk);
38+
39+
$ok = 0; $fail = 0;
40+
foreach ($tables as $t) {
41+
$exit = $this->call('parqbridge:export', [
42+
'table' => $t,
43+
'--output' => $finalOutput,
44+
'--disk' => $disk,
45+
]);
46+
if ($exit === self::SUCCESS) {
47+
$ok++;
48+
} else {
49+
$fail++;
50+
}
51+
}
52+
53+
$this->line("Completed. Success: {$ok}, Failed: {$fail}. Folder: {$finalOutput}");
54+
$this->line($finalOutput);
55+
return $fail === 0 ? self::SUCCESS : self::FAILURE;
56+
}
57+
58+
private function parseCsvOption(string $name): array
59+
{
60+
$raw = (string) ($this->option($name) ?: '');
61+
if ($raw === '') return [];
62+
return array_values(array_filter(array_map(fn($v) => trim($v), explode(',', $raw)), fn($v) => $v !== ''));
63+
}
64+
65+
private function getTables(): array
66+
{
67+
$driver = DB::getDriverName();
68+
return match ($driver) {
69+
'mysql', 'mariadb' => collect(DB::select('SHOW TABLES'))->map(fn($r) => array_values((array)$r)[0])->all(),
70+
'pgsql' => collect(DB::select("SELECT tablename FROM pg_tables WHERE schemaname = 'public'"))->pluck('tablename')->all(),
71+
'sqlite' => collect(DB::select("SELECT name FROM sqlite_master WHERE type='table' AND name NOT LIKE 'sqlite_%'"))->pluck('name')->all(),
72+
'sqlsrv' => collect(DB::select("SELECT table_name FROM information_schema.tables WHERE table_type = 'BASE TABLE'"))->pluck('table_name')->all(),
73+
default => throw new \RuntimeException("Unsupported driver: {$driver}"),
74+
};
75+
}
76+
}

0 commit comments

Comments
 (0)