Skip to content

Commit 80b4653

Browse files
committed
Update docs
1 parent 85a1fce commit 80b4653

File tree

8 files changed

+165
-70
lines changed

8 files changed

+165
-70
lines changed

README.md

Lines changed: 48 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -27,13 +27,14 @@
2727
* Pandas -> Glue Catalog Table
2828
* Pandas -> Athena (Parallel)
2929
* Pandas -> Redshift (Append/Overwrite/Upsert) (Parallel)
30-
* Parquet (S3) -> Pandas (Parallel) (NEW :star:)
30+
* Pandas -> Aurora (MySQL/PostgreSQL) (Append/Overwrite) (Via S3) (NEW :star:)
31+
* Parquet (S3) -> Pandas (Parallel)
3132
* CSV (S3) -> Pandas (One shot or Batching)
32-
* Glue Catalog Table -> Pandas (Parallel) (NEW :star:)
33-
* Athena -> Pandas (One shot, Batching or Parallel (NEW :star:))
34-
* Redshift -> Pandas (Parallel) (NEW :star:)
35-
* Redshift -> Parquet (S3) (NEW :star:)
33+
* Glue Catalog Table -> Pandas (Parallel)
34+
* Athena -> Pandas (One shot, Batching or Parallel)
35+
* Redshift -> Pandas (Parallel)
3636
* CloudWatch Logs Insights -> Pandas
37+
* Aurora -> Pandas (MySQL) (Via S3) (NEW :star:)
3738
* Encrypt Pandas Dataframes on S3 with KMS keys
3839

3940
### PySpark
@@ -60,6 +61,8 @@
6061
* Get EMR step state
6162
* Athena query to receive the result as python primitives (*Iterable[Dict[str, Any]*)
6263
* Load and Unzip SageMaker jobs outputs
64+
* Redshift -> Parquet (S3)
65+
* Aurora -> CSV (S3) (MySQL) (NEW :star:)
6366

6467
## Installation
6568

@@ -147,6 +150,22 @@ df = sess.pandas.read_sql_athena(
147150
)
148151
```
149152

153+
#### Reading from Glue Catalog (Parquet) to Pandas
154+
155+
```py3
156+
import awswrangler as wr
157+
158+
df = wr.pandas.read_table(database="DATABASE_NAME", table="TABLE_NAME")
159+
```
160+
161+
#### Reading from S3 (Parquet) to Pandas
162+
163+
```py3
164+
import awswrangler as wr
165+
166+
df = wr.pandas.read_parquet(path="s3://...", columns=["c1", "c3"], filters=[("c5", "=", 0)])
167+
```
168+
150169
#### Reading from S3 (CSV) to Pandas
151170

152171
```py3
@@ -227,6 +246,30 @@ df = wr.pandas.read_sql_redshift(
227246
temp_s3_path="s3://temp_path")
228247
```
229248

249+
#### Loading Pandas Dataframe to Aurora (MySQL/PostgreSQL)
250+
251+
```py3
252+
import awswrangler as wr
253+
254+
wr.pandas.to_aurora(
255+
dataframe=df,
256+
connection=con,
257+
schema="...",
258+
table="..."
259+
)
260+
```
261+
262+
#### Extract Aurora query to Pandas DataFrame (MySQL)
263+
264+
```py3
265+
import awswrangler as wr
266+
267+
df = wr.pandas.read_sql_aurora(
268+
sql="SELECT ...",
269+
connection=con
270+
)
271+
```
272+
230273
### PySpark
231274

232275
#### Loading PySpark Dataframe to Redshift

awswrangler/pandas.py

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1211,7 +1211,7 @@ def drop_duplicated_columns(dataframe: pd.DataFrame, inplace: bool = True) -> pd
12111211
def read_parquet(self,
12121212
path: Union[str, List[str]],
12131213
columns: Optional[List[str]] = None,
1214-
filters: Optional[Union[List[Tuple[Any]], List[Tuple[Any]]]] = None,
1214+
filters: Optional[Union[List[Tuple[Any]], List[List[Tuple[Any]]]]] = None,
12151215
procs_cpu_bound: Optional[int] = None) -> pd.DataFrame:
12161216
"""
12171217
Read parquet data from S3
@@ -1274,7 +1274,7 @@ def _read_parquet_paths_remote(send_pipe: mp.connection.Connection,
12741274
session_primitives: Any,
12751275
path: Union[str, List[str]],
12761276
columns: Optional[List[str]] = None,
1277-
filters: Optional[Union[List[Tuple[Any]], List[Tuple[Any]]]] = None,
1277+
filters: Optional[Union[List[Tuple[Any]], List[List[Tuple[Any]]]]] = None,
12781278
procs_cpu_bound: Optional[int] = None):
12791279
df: pd.DataFrame = Pandas._read_parquet_paths(session_primitives=session_primitives,
12801280
path=path,
@@ -1288,7 +1288,7 @@ def _read_parquet_paths_remote(send_pipe: mp.connection.Connection,
12881288
def _read_parquet_paths(session_primitives: Any,
12891289
path: Union[str, List[str]],
12901290
columns: Optional[List[str]] = None,
1291-
filters: Optional[Union[List[Tuple[Any]], List[Tuple[Any]]]] = None,
1291+
filters: Optional[Union[List[Tuple[Any]], List[List[Tuple[Any]]]]] = None,
12921292
procs_cpu_bound: Optional[int] = None) -> pd.DataFrame:
12931293
"""
12941294
Read parquet data from S3
@@ -1327,7 +1327,7 @@ def _read_parquet_paths(session_primitives: Any,
13271327
def _read_parquet_path(session_primitives: Any,
13281328
path: str,
13291329
columns: Optional[List[str]] = None,
1330-
filters: Optional[Union[List[Tuple[Any]], List[Tuple[Any]]]] = None,
1330+
filters: Optional[Union[List[Tuple[Any]], List[List[Tuple[Any]]]]] = None,
13311331
procs_cpu_bound: Optional[int] = None) -> pd.DataFrame:
13321332
"""
13331333
Read parquet data from S3
@@ -1369,7 +1369,7 @@ def read_table(self,
13691369
database: str,
13701370
table: str,
13711371
columns: Optional[List[str]] = None,
1372-
filters: Optional[Union[List[Tuple[Any]], List[Tuple[Any]]]] = None,
1372+
filters: Optional[Union[List[Tuple[Any]], List[List[Tuple[Any]]]]] = None,
13731373
procs_cpu_bound: Optional[int] = None) -> pd.DataFrame:
13741374
"""
13751375
Read PARQUET table from S3 using the Glue Catalog location skipping Athena's necessity
@@ -1408,6 +1408,7 @@ def read_sql_redshift(self,
14081408
temp_s3_path = temp_s3_path[:-1] if temp_s3_path[-1] == "/" else temp_s3_path
14091409
temp_s3_path = f"{temp_s3_path}/{name}"
14101410
logger.debug(f"temp_s3_path: {temp_s3_path}")
1411+
self._session.s3.delete_objects(path=temp_s3_path)
14111412
paths: Optional[List[str]] = None
14121413
try:
14131414
paths = self._session.redshift.to_parquet(sql=sql,
@@ -1416,11 +1417,11 @@ def read_sql_redshift(self,
14161417
connection=connection)
14171418
logger.debug(f"paths: {paths}")
14181419
df: pd.DataFrame = self.read_parquet(path=paths, procs_cpu_bound=procs_cpu_bound) # type: ignore
1419-
self._session.s3.delete_listed_objects(objects_paths=paths)
1420+
self._session.s3.delete_listed_objects(objects_paths=paths + [temp_s3_path + "/manifest"]) # type: ignore
14201421
return df
14211422
except Exception as e:
14221423
if paths is not None:
1423-
self._session.s3.delete_listed_objects(objects_paths=paths)
1424+
self._session.s3.delete_listed_objects(objects_paths=paths + [temp_s3_path + "/manifest"])
14241425
else:
14251426
self._session.s3.delete_objects(path=temp_s3_path)
14261427
raise e

building/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ RUN yum install -y \
66
bison \
77
flex \
88
autoconf \
9-
python37-devel
9+
python36-devel
1010

1111
RUN pip3 install --upgrade pip
1212

building/build-lambda-layer.sh

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,15 @@
11
#!/usr/bin/env bash
22
set -e
33

4-
54
# Go back to AWSWRANGLER directory
65
cd /aws-data-wrangler/
76

87
rm -rf dist/*.zip
98

10-
# Build PyArrow files if necessary
11-
if [ ! -d "dist/pyarrow_files" ] ; then
12-
cd building
13-
./build-pyarrow.sh
14-
cd ..
15-
fi
9+
# Build PyArrow files
10+
cd building
11+
./build-pyarrow.sh
12+
cd ..
1613

1714
# Preparing directories
1815
mkdir -p dist

building/build-pyarrow.sh

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,6 @@ rm -rf \
77
dist \
88
/aws-data-wrangler/dist/pyarrow_wheels \
99
/aws-data-wrangler/dist/pyarrow_files \
10-
/aws-data-wrangler/dist/pyarrow_wheels/
1110

1211
# Clone desired Arrow version
1312
git clone \

docs/source/examples.rst

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,23 @@ Reading from AWS Athena to Pandas with the blazing fast CTAS approach
8383
database="database"
8484
)
8585
86+
Reading from Glue Catalog (Parquet) to Pandas
87+
`````````````````````````````````````````````
88+
89+
.. code-block:: python
90+
91+
import awswrangler as wr
92+
93+
df = wr.pandas.read_table(database="DATABASE_NAME", table="TABLE_NAME")
94+
95+
Reading from S3 (Parquet) to Pandas
96+
```````````````````````````````````
97+
98+
.. code-block:: python
99+
100+
import awswrangler as wr
101+
102+
df = wr.pandas.read_parquet(path="s3://...", columns=["c1", "c3"], filters=[("c5", "=", 0)])
86103
87104
Reading from S3 (CSV) to Pandas
88105
```````````````````````````````
@@ -174,6 +191,32 @@ Extract Redshift query to Pandas DataFrame
174191
connection=con,
175192
temp_s3_path="s3://temp_path")
176193
194+
Loading Pandas Dataframe to Aurora (MySQL/PostgreSQL)
195+
`````````````````````````````````````````````````````
196+
197+
.. code-block:: python
198+
199+
import awswrangler as wr
200+
201+
wr.pandas.to_aurora(
202+
dataframe=df,
203+
connection=con,
204+
schema="...",
205+
table="..."
206+
)
207+
208+
209+
Extract Aurora query to Pandas DataFrame (MySQL)
210+
````````````````````````````````````````````````
211+
212+
.. code-block:: python
213+
214+
import awswrangler as wr
215+
216+
df = wr.pandas.read_sql_aurora(
217+
sql="SELECT ...",
218+
connection=con
219+
)
177220
178221
PySpark
179222
-------

docs/source/index.rst

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -20,13 +20,14 @@ Pandas
2020
* Pandas -> Glue Catalog Table
2121
* Pandas -> Athena (Parallel)
2222
* Pandas -> Redshift (Append/Overwrite/Upsert) (Parallel)
23+
* Pandas -> Aurora (MySQL/PostgreSQL) (Append/Overwrite) (Via S3) (NEW)
2324
* Parquet (S3) -> Pandas (Parallel)
2425
* CSV (S3) -> Pandas (One shot or Batching)
2526
* Glue Catalog Table -> Pandas (Parallel)
2627
* Athena -> Pandas (One shot, Batching or Parallel)
2728
* Redshift -> Pandas (Parallel)
28-
* Redshift -> Parquet (S3)
2929
* CloudWatch Logs Insights -> Pandas
30+
* Aurora -> Pandas (MySQL) (Via S3) (NEW)
3031
* Encrypt Pandas Dataframes on S3 with KMS keys
3132

3233
PySpark
@@ -45,13 +46,16 @@ General
4546
* Get the size of S3 objects (Parallel)
4647
* Get CloudWatch Logs Insights query results
4748
* Load partitions on Athena/Glue table (repair table)
48-
* Create EMR cluster (For humans) (NEW)
49-
* Terminate EMR cluster (NEW)
50-
* Get EMR cluster state (NEW)
51-
* Submit EMR step(s) (For humans) (NEW)
52-
* Get EMR step state (NEW)
53-
* Athena query to receive the result as python primitives (Iterable[Dict[str, Any]) (NEW)
49+
* Create EMR cluster (For humans)
50+
* Terminate EMR cluster
51+
* Get EMR cluster state
52+
* Submit EMR step(s) (For humans)
53+
* Get EMR step state
54+
* Get EMR step state
55+
* Athena query to receive the result as python primitives (*Iterable[Dict[str, Any]*)
5456
* Load and Unzip SageMaker jobs outputs
57+
* Redshift -> Parquet (S3)
58+
* Aurora -> CSV (S3) (MySQL) (NEW :star:)
5559

5660

5761
Table Of Contents

0 commit comments

Comments
 (0)