Skip to content

Commit ea2e21a

Browse files
committed
0.3.0
1 parent 34033a2 commit ea2e21a

File tree

9 files changed

+131
-198
lines changed

9 files changed

+131
-198
lines changed

Cargo.lock

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 20 additions & 98 deletions
Original file line numberDiff line numberDiff line change
@@ -32,14 +32,28 @@ The function will partition the query by **evenly** splitting the specified colu
3232
ConnectorX will assign one thread for each partition to load and write data in parallel.
3333
Currently, we support partitioning on **numerical** columns (**cannot contain NULL**) for **SPJA** queries.
3434

35-
Check out more detailed usage and examples [here](#detailed-usage-and-examples). A general introduction of the project can be found in this [blog post](https://towardsdatascience.com/connectorx-the-fastest-way-to-load-data-from-databases-a65d4d4062d5).
35+
36+
We are now providing federated query support (experimental, PostgreSQL only and do not support partition for now), you can write a single query to join tables from two postgres databases. JRE (Java Runtime Environment) is required.
37+
38+
```python
39+
import connectorx as cx
40+
41+
db1 = "postgresql://username1:password1@server1:port1/database1"
42+
db2 = "postgresql://username2:password2@server2:port2/database2"
43+
44+
cx.read_sql({"db1": db1, "db2": db2}, "SELECT * FROM db1.nation n, db2.region r where n.n_regionkey = r.r_regionkey")
45+
```
46+
47+
Check out more detailed usage and examples [here](https://sfu-db.github.io/connector-x/api.html). A general introduction of the project can be found in this [blog post](https://towardsdatascience.com/connectorx-the-fastest-way-to-load-data-from-databases-a65d4d4062d5).
3648

3749
# Installation
3850

3951
```bash
4052
pip install connectorx
4153
```
4254

55+
Check out [here](https://sfu-db.github.io/connector-x/install.html#build-from-source-code) to see how to build python wheel from source.
56+
4357
# Performance
4458

4559
We compared different solutions in Python that provides the `read_sql` function, by loading a 10x TPC-H lineitem table (8.6GB) from Postgres into a DataFrame, with 4 cores parallelism.
@@ -76,18 +90,10 @@ Finally, ConnectorX will use the schema info as well as the count info to alloca
7690
Once the downloading begins, there will be one thread for each partition so that the data are downloaded in parallel at the partition level. The thread will issue the query of the corresponding
7791
partition to the database and then write the returned data to the destination row-wise or column-wise (depends on the database) in a streaming fashion.
7892

79-
#### How to specify the partition number?
80-
81-
`partition_num` will determine how many queries we are going to split from the original one and issue to the database. Underlying, we use [rayon](https://github.com/rayon-rs/rayon) as our parallel executor, which adopts a pool of threads to handle each partitioned query. The number of threads in the pool equals to the number of logical cores on the machine. It is recommended to set the `partition_num` to the number of available logical cores.
82-
83-
#### How to choose the partition column?
84-
85-
`partition_on` specifies on which column we will do the partition as above procedure. In order to achieve the best performance, it is ideal that each partitioned query will return the same number of rows. And since we partition the column evenly, it is recommended that the numerical `partition_on` column is evenly distributed. Whether a column has index or not might also affect the performance depends on the source database. You can give it a try if you have multiple candidates. Also, you can manually partition the query if our partition method cannot match your need. ConnectorX will still return a whole dataframe with all the results of the list of queries you input.
86-
8793

8894
# Supported Sources & Destinations
8995

90-
Supported protocols, data types and type mappings can be found [here](Types.md).
96+
Example connection string, supported protocols and data types for each data source can be found [here](https://sfu-db.github.io/connector-x/databases.html).
9197
For more planned data sources, please check out our [discussion](https://github.com/sfu-db/connector-x/discussions/61).
9298

9399
## Sources
@@ -100,7 +106,7 @@ For more planned data sources, please check out our [discussion](https://github.
100106
- [x] SQL Server
101107
- [x] Azure SQL Database (through mssql protocol)
102108
- [x] Oracle
103-
- [x] Big Query - Experimental: need docs and benchmark (also more tests)
109+
- [x] Big Query
104110
- [ ] ODBC (WIP)
105111
- [ ] ...
106112

@@ -110,95 +116,11 @@ For more planned data sources, please check out our [discussion](https://github.
110116
- [x] Modin (through Pandas)
111117
- [x] Dask (through Pandas)
112118
- [x] Polars (through PyArrow)
113-
114-
# Detailed Usage and Examples
115-
116-
Rust docs: [stable](https://docs.rs/connectorx) [nightly](https://sfu-db.github.io/connector-x/connectorx/)
117-
118-
## API
119-
120-
```python
121-
connectorx.read_sql(conn: str, query: Union[List[str], str], *, return_type: str = "pandas", protocol: str = "binary", partition_on: Optional[str] = None, partition_range: Optional[Tuple[int, int]] = None, partition_num: Optional[int] = None)
122-
```
123119

124-
Run the SQL query, download the data from database into a Pandas dataframe.
120+
# Documentation
125121

126-
## Parameters
127-
- `conn: str`: Connection string URI.
128-
- General supported URI scheme: `(postgres|postgressql|mysql|mssql)://username:password@addr:port/dbname`.
129-
- For now sqlite only support absolute path, example: `sqlite:///home/user/path/test.db`.
130-
- Google BigQuery requires absolute path of the authentication JSON file, example: `bigquery:///home/user/path/auth.json`
131-
- Please check out [here](Types.md) for more connection uri parameters supported for each database (e.g. `trusted_connection` and `encrypt` for Mssql, `sslmode` for Postgres)
132-
- `query: Union[str, List[str]]`: SQL query or list of SQL queries for fetching data.
133-
- `return_type: str = "pandas"`: The return type of this function. It can be `arrow`, `pandas`, `modin`, `dask` or `polars`.
134-
- `protocol: str = "binary"`: The protocol used to fetch data from source, default is `binary`. Check out [here](Types.md) to see more details.
135-
- `partition_on: Optional[str]`: The column to partition the result.
136-
- `partition_range: Optional[Tuple[int, int]]`: The value range of the partition column.
137-
- `partition_num: Optional[int]`: The number of partitions to generate.
138-
- `index_col: Optional[str]`: The index column to set for the result dataframe. Only applicable when `return_type` is `pandas`, `modin` or `dask`.
139-
140-
## Examples
141-
- Read a DataFrame from a SQL using a single thread
142-
143-
```python
144-
import connectorx as cx
145-
146-
postgres_url = "postgresql://username:password@server:port/database"
147-
query = "SELECT * FROM lineitem"
148-
149-
cx.read_sql(postgres_url, query)
150-
```
151-
152-
- Read a DataFrame parallelly using 10 threads by automatically partitioning the provided SQL on the partition column (`partition_range` will be automatically queried if not given)
153-
154-
```python
155-
import connectorx as cx
156-
157-
postgres_url = "postgresql://username:password@server:port/database"
158-
query = "SELECT * FROM lineitem"
159-
160-
cx.read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=10)
161-
```
162-
163-
- Read a DataFrame parallelly using 2 threads by manually providing two partition SQLs (the schemas of all the query results should be same)
164-
165-
```python
166-
import connectorx as cx
167-
168-
postgres_url = "postgresql://username:password@server:port/database"
169-
queries = ["SELECT * FROM lineitem WHERE l_orderkey <= 30000000", "SELECT * FROM lineitem WHERE l_orderkey > 30000000"]
170-
171-
cx.read_sql(postgres_url, queries)
172-
173-
```
174-
175-
- Read a DataFrame parallelly using 4 threads from a more complex query
176-
177-
```python
178-
import connectorx as cx
179-
180-
postgres_url = "postgresql://username:password@server:port/database"
181-
query = f"""
182-
SELECT l_orderkey,
183-
SUM(l_extendedprice * ( 1 - l_discount )) AS revenue,
184-
o_orderdate,
185-
o_shippriority
186-
FROM customer,
187-
orders,
188-
lineitem
189-
WHERE c_mktsegment = 'BUILDING'
190-
AND c_custkey = o_custkey
191-
AND l_orderkey = o_orderkey
192-
AND o_orderdate < DATE '1995-03-15'
193-
AND l_shipdate > DATE '1995-03-15'
194-
GROUP BY l_orderkey,
195-
o_orderdate,
196-
o_shippriority
197-
"""
198-
199-
cx.read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=4)
200-
201-
```
122+
Doc: https://sfu-db.github.io/connector-x/intro.html
123+
Rust docs: [stable](https://docs.rs/connectorx) [nightly](https://sfu-db.github.io/connector-x/connectorx/)
202124

203125
# Next Plan
204126

connectorx-python/Cargo.lock

Lines changed: 2 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

connectorx-python/Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
authors = ["Weiyuan Wu <youngw@sfu.ca>"]
33
edition = "2018"
44
name = "connectorx-python"
5-
version = "0.2.6-alpha.6"
5+
version = "0.3.0"
66

77
[workspace]
88
# prevents package from thinking it's in the workspace

connectorx-python/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ license = "MIT"
1818
maintainers = ["Weiyuan Wu <youngw@sfu.ca>"]
1919
name = "connectorx"
2020
readme = "README.md" # Markdown files are supported
21-
version = "0.2.6-alpha.6"
21+
version = "0.3.0"
2222

2323
[project]
2424
name = "connectorx" # Target file name of maturin build

connectorx/Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ license = "MIT"
77
name = "connectorx"
88
readme = "../README.md"
99
repository = "https://github.com/sfu-db/connector-x"
10-
version = "0.2.6-alpha.6"
10+
version = "0.3.0"
1111

1212
[dependencies]
1313
anyhow = "1"

docs/_toc.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ root: intro
66

77
chapters:
88
- file: install
9-
# - title: Databases
9+
- file: api
1010
- file: databases
1111
sections:
1212
- file: databases/bigquery

docs/api.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# Basic usage
2+
ConnectorX enables you to run the SQL query, load data from databases into a Pandas Dataframe in the fastest and most memory efficient way.
3+
4+
## API
5+
```python
6+
connectorx.read_sql(conn: Union[str, Dict[str, str]], query: Union[List[str], str], *, return_type: str = "pandas", protocol: str = "binary", partition_on: Optional[str] = None, partition_range: Optional[Tuple[int, int]] = None, partition_num: Optional[int] = None)
7+
```
8+
9+
## Parameters
10+
- `conn: Union[str, Dict[str, str]]`: Connection string URI for querying single database or dict of database names (key) and connection string URIs (value) for querying multiple databases.
11+
- Please check out [here](https://sfu-db.github.io/connector-x/databases.html) for connection string examples of each database
12+
- `query: Union[str, List[str]]`: SQL query or list of partitioned SQL queries for fetching data.
13+
- `return_type: str = "pandas"`: The return type of this function. It can be `arrow` (`arrow2`), `pandas`, `modin`, `dask` or `polars`.
14+
- `protocol: str = "binary"`: The protocol used to fetch data from source, default is `binary`. Check out [here](./databases.md) to see more details.
15+
- `partition_on: Optional[str]`: The column to partition the result.
16+
- `partition_range: Optional[Tuple[int, int]]`: The value range of the partition column.
17+
- `partition_num: Optioinal[int]`: The number of partitions to generate.
18+
- `index_col: Optioinal[str]`: The index column to set for the result dataframe. Only applicable when `return_type` is `pandas`, `modin` or `dask`.
19+
20+
21+
## Examples
22+
- Read a DataFrame from a SQL using a single thread
23+
24+
```python
25+
import connectorx as cx
26+
27+
postgres_url = "postgresql://username:password@server:port/database"
28+
query = "SELECT * FROM lineitem"
29+
30+
cx.read_sql(postgres_url, query)
31+
```
32+
33+
- Read a DataFrame parallelly using 10 threads by automatically partitioning the provided SQL on the partition column (`partition_range` will be automatically queried if not given)
34+
35+
```python
36+
import connectorx as cx
37+
38+
postgres_url = "postgresql://username:password@server:port/database"
39+
query = "SELECT * FROM lineitem"
40+
41+
cx.read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=10)
42+
```
43+
44+
- Read a DataFrame parallelly using 2 threads by manually providing two partition SQLs (the schemas of all the query results should be same)
45+
46+
```python
47+
import connectorx as cx
48+
49+
postgres_url = "postgresql://username:password@server:port/database"
50+
queries = ["SELECT * FROM lineitem WHERE l_orderkey <= 30000000", "SELECT * FROM lineitem WHERE l_orderkey > 30000000"]
51+
52+
cx.read_sql(postgres_url, queries)
53+
54+
```
55+
56+
- Read a DataFrame parallelly using 4 threads from a more complex query
57+
58+
```python
59+
import connectorx as cx
60+
61+
postgres_url = "postgresql://username:password@server:port/database"
62+
query = f"""
63+
SELECT l_orderkey,
64+
SUM(l_extendedprice * ( 1 - l_discount )) AS revenue,
65+
o_orderdate,
66+
o_shippriority
67+
FROM customer,
68+
orders,
69+
lineitem
70+
WHERE c_mktsegment = 'BUILDING'
71+
AND c_custkey = o_custkey
72+
AND l_orderkey = o_orderkey
73+
AND o_orderdate < DATE '1995-03-15'
74+
AND l_shipdate > DATE '1995-03-15'
75+
GROUP BY l_orderkey,
76+
o_orderdate,
77+
o_shippriority
78+
"""
79+
80+
cx.read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=4)
81+
82+
```
83+
84+
- Read a DataFrame from a SQL joined from multiple databases (experimental, only support PostgreSQL for now)
85+
86+
```python
87+
import connectorx as cx
88+
89+
import connectorx as cx
90+
91+
db1 = "postgresql://username1:password1@server1:port1/database1"
92+
db2 = "postgresql://username2:password2@server2:port2/database2"
93+
query = "SELECT * FROM db1.nation n, db2.region r where n.n_regionkey = r.r_regionkey"
94+
95+
cx.read_sql({"db1": db1, "db2": db2}, query)
96+
97+
```
98+

0 commit comments

Comments
 (0)