You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+20-98Lines changed: 20 additions & 98 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,14 +32,28 @@ The function will partition the query by **evenly** splitting the specified colu
32
32
ConnectorX will assign one thread for each partition to load and write data in parallel.
33
33
Currently, we support partitioning on **numerical** columns (**cannot contain NULL**) for **SPJA** queries.
34
34
35
-
Check out more detailed usage and examples [here](#detailed-usage-and-examples). A general introduction of the project can be found in this [blog post](https://towardsdatascience.com/connectorx-the-fastest-way-to-load-data-from-databases-a65d4d4062d5).
35
+
36
+
We are now providing federated query support (experimental, PostgreSQL only and do not support partition for now), you can write a single query to join tables from two postgres databases. JRE (Java Runtime Environment) is required.
cx.read_sql({"db1": db1, "db2": db2}, "SELECT * FROM db1.nation n, db2.region r where n.n_regionkey = r.r_regionkey")
45
+
```
46
+
47
+
Check out more detailed usage and examples [here](https://sfu-db.github.io/connector-x/api.html). A general introduction of the project can be found in this [blog post](https://towardsdatascience.com/connectorx-the-fastest-way-to-load-data-from-databases-a65d4d4062d5).
36
48
37
49
# Installation
38
50
39
51
```bash
40
52
pip install connectorx
41
53
```
42
54
55
+
Check out [here](https://sfu-db.github.io/connector-x/install.html#build-from-source-code) to see how to build python wheel from source.
56
+
43
57
# Performance
44
58
45
59
We compared different solutions in Python that provides the `read_sql` function, by loading a 10x TPC-H lineitem table (8.6GB) from Postgres into a DataFrame, with 4 cores parallelism.
@@ -76,18 +90,10 @@ Finally, ConnectorX will use the schema info as well as the count info to alloca
76
90
Once the downloading begins, there will be one thread for each partition so that the data are downloaded in parallel at the partition level. The thread will issue the query of the corresponding
77
91
partition to the database and then write the returned data to the destination row-wise or column-wise (depends on the database) in a streaming fashion.
78
92
79
-
#### How to specify the partition number?
80
-
81
-
`partition_num` will determine how many queries we are going to split from the original one and issue to the database. Underlying, we use [rayon](https://github.com/rayon-rs/rayon) as our parallel executor, which adopts a pool of threads to handle each partitioned query. The number of threads in the pool equals to the number of logical cores on the machine. It is recommended to set the `partition_num` to the number of available logical cores.
82
-
83
-
#### How to choose the partition column?
84
-
85
-
`partition_on` specifies on which column we will do the partition as above procedure. In order to achieve the best performance, it is ideal that each partitioned query will return the same number of rows. And since we partition the column evenly, it is recommended that the numerical `partition_on` column is evenly distributed. Whether a column has index or not might also affect the performance depends on the source database. You can give it a try if you have multiple candidates. Also, you can manually partition the query if our partition method cannot match your need. ConnectorX will still return a whole dataframe with all the results of the list of queries you input.
86
-
87
93
88
94
# Supported Sources & Destinations
89
95
90
-
Supported protocols, data types and type mappings can be found [here](Types.md).
96
+
Example connection string, supported protocols and data types for each data source can be found [here](https://sfu-db.github.io/connector-x/databases.html).
91
97
For more planned data sources, please check out our [discussion](https://github.com/sfu-db/connector-x/discussions/61).
92
98
93
99
## Sources
@@ -100,7 +106,7 @@ For more planned data sources, please check out our [discussion](https://github.
100
106
-[x] SQL Server
101
107
-[x] Azure SQL Database (through mssql protocol)
102
108
-[x] Oracle
103
-
-[x] Big Query - Experimental: need docs and benchmark (also more tests)
109
+
-[x] Big Query
104
110
-[ ] ODBC (WIP)
105
111
-[ ] ...
106
112
@@ -110,95 +116,11 @@ For more planned data sources, please check out our [discussion](https://github.
Run the SQL query, download the data from database into a Pandas dataframe.
120
+
# Documentation
125
121
126
-
## Parameters
127
-
-`conn: str`: Connection string URI.
128
-
- General supported URI scheme: `(postgres|postgressql|mysql|mssql)://username:password@addr:port/dbname`.
129
-
- For now sqlite only support absolute path, example: `sqlite:///home/user/path/test.db`.
130
-
- Google BigQuery requires absolute path of the authentication JSON file, example: `bigquery:///home/user/path/auth.json`
131
-
- Please check out [here](Types.md) for more connection uri parameters supported for each database (e.g. `trusted_connection` and `encrypt` for Mssql, `sslmode` for Postgres)
132
-
-`query: Union[str, List[str]]`: SQL query or list of SQL queries for fetching data.
133
-
-`return_type: str = "pandas"`: The return type of this function. It can be `arrow`, `pandas`, `modin`, `dask` or `polars`.
134
-
-`protocol: str = "binary"`: The protocol used to fetch data from source, default is `binary`. Check out [here](Types.md) to see more details.
135
-
-`partition_on: Optional[str]`: The column to partition the result.
136
-
-`partition_range: Optional[Tuple[int, int]]`: The value range of the partition column.
137
-
-`partition_num: Optional[int]`: The number of partitions to generate.
138
-
-`index_col: Optional[str]`: The index column to set for the result dataframe. Only applicable when `return_type` is `pandas`, `modin` or `dask`.
139
-
140
-
## Examples
141
-
- Read a DataFrame from a SQL using a single thread
- Read a DataFrame parallelly using 10 threads by automatically partitioning the provided SQL on the partition column (`partition_range` will be automatically queried if not given)
-`conn: Union[str, Dict[str, str]]`: Connection string URI for querying single database or dict of database names (key) and connection string URIs (value) for querying multiple databases.
11
+
- Please check out [here](https://sfu-db.github.io/connector-x/databases.html) for connection string examples of each database
12
+
-`query: Union[str, List[str]]`: SQL query or list of partitioned SQL queries for fetching data.
13
+
-`return_type: str = "pandas"`: The return type of this function. It can be `arrow` (`arrow2`), `pandas`, `modin`, `dask` or `polars`.
14
+
-`protocol: str = "binary"`: The protocol used to fetch data from source, default is `binary`. Check out [here](./databases.md) to see more details.
15
+
-`partition_on: Optional[str]`: The column to partition the result.
16
+
-`partition_range: Optional[Tuple[int, int]]`: The value range of the partition column.
17
+
-`partition_num: Optioinal[int]`: The number of partitions to generate.
18
+
-`index_col: Optioinal[str]`: The index column to set for the result dataframe. Only applicable when `return_type` is `pandas`, `modin` or `dask`.
19
+
20
+
21
+
## Examples
22
+
- Read a DataFrame from a SQL using a single thread
- Read a DataFrame parallelly using 10 threads by automatically partitioning the provided SQL on the partition column (`partition_range` will be automatically queried if not given)
0 commit comments