|
2 | 2 |
|
3 | 3 | > DataFrames on AWS |
4 | 4 |
|
5 | | -[](https://pypi.org/project/awswrangler/) |
| 5 | +[](https://pypi.org/project/awswrangler/) |
6 | 6 | [](https://pypi.org/project/awswrangler/) |
7 | 7 | [](https://pypi.org/project/awswrangler/) |
8 | 8 | [](https://aws-data-wrangler.readthedocs.io/en/latest/?badge=latest) |
9 | 9 | [](https://pypi.org/project/awswrangler/) |
10 | 10 | [](http://isitmaintained.com/project/awslabs/aws-data-wrangler "Average time to resolve an issue") |
11 | 11 | [](https://opensource.org/licenses/Apache-2.0) |
12 | 12 |
|
13 | | -**[Read the Docs!](https://aws-data-wrangler.readthedocs.io)** |
| 13 | +## [Read the Docs](https://aws-data-wrangler.readthedocs.io) |
14 | 14 |
|
15 | | -**[Read the Tutorials](https://github.com/awslabs/aws-data-wrangler/tree/master/tutorials): [Catalog & Metadata](https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/catalog_and_metadata.ipynb) | [Athena Nested](https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/athena_nested.ipynb) | [S3 Write Modes](https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/s3_write_modes.ipynb)** |
| 15 | +## [Read the Tutorials](https://github.com/awslabs/aws-data-wrangler/tree/master/tutorials) |
| 16 | +- [Catalog & Metadata](https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/catalog_and_metadata.ipynb) |
| 17 | +- [Athena Nested](https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/athena_nested.ipynb) |
| 18 | +- [S3 Write Modes](https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/s3_write_modes.ipynb) |
16 | 19 |
|
17 | | ---- |
18 | | - |
19 | | -*Contents:* **[Use Cases](#Use-Cases)** | **[Installation](#Installation)** | **[Examples](#Examples)** | **[Diving Deep](#Diving-Deep)** | **[Step By Step](#Step-By-Step)** | **[Contributing](#Contributing)** |
20 | | - |
21 | | ---- |
| 20 | +## Contents |
| 21 | +- [Use Cases](#Use-Cases) |
| 22 | +- [Installation](#Installation) |
| 23 | +- [Examples](#Examples) |
| 24 | +- [Diving Deep](#Diving-Deep) |
| 25 | +- [Step By Step](#Step-By-Step) |
| 26 | +- [Contributing](#Contributing) |
22 | 27 |
|
23 | 28 | ## Use Cases |
24 | 29 |
|
25 | 30 | ### Pandas |
26 | 31 |
|
27 | | -* Pandas -> Parquet (S3) (Parallel) |
28 | | -* Pandas -> CSV (S3) (Parallel) |
29 | | -* Pandas -> Glue Catalog Table |
30 | | -* Pandas -> Athena (Parallel) |
31 | | -* Pandas -> Redshift (Append/Overwrite/Upsert) (Parallel) |
32 | | -* Pandas -> Aurora (MySQL/PostgreSQL) (Append/Overwrite) (Via S3) (NEW :star:) |
33 | | -* Parquet (S3) -> Pandas (Parallel) |
34 | | -* CSV (S3) -> Pandas (One shot or Batching) |
35 | | -* Glue Catalog Table -> Pandas (Parallel) |
36 | | -* Athena -> Pandas (One shot, Batching or Parallel) |
37 | | -* Redshift -> Pandas (Parallel) |
38 | | -* CloudWatch Logs Insights -> Pandas |
39 | | -* Aurora -> Pandas (MySQL) (Via S3) (NEW :star:) |
40 | | -* Encrypt Pandas Dataframes on S3 with KMS keys |
41 | | -* Glue Databases Metadata -> Pandas (Jupyter output compatible) |
42 | | -* Glue Table Metadata -> Pandas (Jupyter output compatible) |
| 32 | +| FROM | TO | Features | |
| 33 | +|--------------------------|-----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
| 34 | +| Pandas DataFrame | Amazon S3 | Parquet, CSV, Partitions, Parallelism, Overwrite/Append/Partitions-Upsert modes,<br>KMS Encryption, Glue Metadata (Athena, Spectrum, Spark, Hive, Presto) | |
| 35 | +| Amazon S3 | Pandas DataFrame| Parquet (Pushdown filters), CSV, Partitions, Parallelism,<br>KMS Encryption, Multiple files | |
| 36 | +| Amazon Athena | Pandas DataFrame| Workgroups, S3 output path, Encryption, and two different engines:<br><br>- ctas_approach=False **->** Batching and restrict memory environments<br>- ctas_approach=True **->** Blazing fast, parallelism and enhanced data types | |
| 37 | +| Pandas DataFrame | Amazon Redshift | Blazing fast using parallel parquet on S3 behind the scenes<br>Append/Overwrite/Upsert modes | |
| 38 | +| Amazon Redshift | Pandas DataFrame| Blazing fast using parallel parquet on S3 behind the scenes | |
| 39 | +| Pandas DataFrame | Amazon Aurora | Supported engines: MySQL, PostgreSQL<br>Blazing fast using parallel CSV on S3 behind the scenes<br>Append/Overwrite modes | |
| 40 | +| Amazon Aurora | Pandas DataFrame| Supported engines: MySQL<br>Blazing fast using parallel CSV on S3 behind the scenes | |
| 41 | +| CloudWatch Logs Insights | Pandas DataFrame| Query results | |
| 42 | +| Glue Catalog | Pandas DataFrame| List and get Tables details. Good fit with Jupyter Notebooks. | |
43 | 43 |
|
44 | 44 | ### PySpark |
45 | 45 |
|
46 | | -* PySpark -> Redshift (Parallel) |
47 | | -* Register Glue table from Dataframe stored on S3 |
48 | | -* Flatten nested DataFrames |
| 46 | +| FROM | TO | Features | |
| 47 | +|-----------------------------|---------------------------|------------------------------------------------------------------------------------------| |
| 48 | +| PySpark DataFrame | Amazon Redshift | Blazing fast using parallel parquet on S3 behind the scenesAppend/Overwrite/Upsert modes | |
| 49 | +| PySpark DataFrame | Glue Catalog | Register Parquet or CSV DataFrame on Glue Catalog | |
| 50 | +| Nested PySpark<br>DataFrame | Flat PySpark<br>DataFrames| Flatten structs and break up arrays in child tables | |
49 | 51 |
|
50 | 52 | ### General |
51 | 53 |
|
52 | | -* List S3 objects (Parallel) |
53 | | -* Delete S3 objects (Parallel) |
54 | | -* Delete listed S3 objects (Parallel) |
55 | | -* Delete NOT listed S3 objects (Parallel) |
56 | | -* Copy listed S3 objects (Parallel) |
57 | | -* Get the size of S3 objects (Parallel) |
58 | | -* Get CloudWatch Logs Insights query results |
59 | | -* Load partitions on Athena/Glue table (repair table) |
60 | | -* Create EMR cluster (For humans) |
61 | | -* Terminate EMR cluster |
62 | | -* Get EMR cluster state |
63 | | -* Submit EMR step(s) (For humans) |
64 | | -* Get EMR step state |
65 | | -* Get EMR step state |
66 | | -* Athena query to receive the result as python primitives (*Iterable[Dict[str, Any]*) |
67 | | -* Load and Unzip SageMaker jobs outputs |
68 | | -* Load and Unzip SageMaker models |
69 | | -* Redshift -> Parquet (S3) |
70 | | -* Aurora -> CSV (S3) (MySQL) (NEW :star:) |
71 | | -* Get Glue Metadata |
| 54 | +| Feature | Details | |
| 55 | +|---------------------------------------------|-------------------------------------| |
| 56 | +| List S3 objects | e.g. wr.s3.list_objects("s3://...") | |
| 57 | +| Delete S3 objects | Parallel | |
| 58 | +| Delete listed S3 objects | Parallel | |
| 59 | +| Delete NOT listed S3 objects | Parallel | |
| 60 | +| Copy listed S3 objects | Parallel | |
| 61 | +| Get the size of S3 objects | Parallel | |
| 62 | +| Get CloudWatch Logs Insights query results | | |
| 63 | +| Load partitions on Athena/Glue table | Through "MSCK REPAIR TABLE" | |
| 64 | +| Create EMR cluster | "For humans" | |
| 65 | +| Terminate EMR cluster | "For humans" | |
| 66 | +| Get EMR cluster state | "For humans" | |
| 67 | +| Submit EMR step(s) | "For humans" | |
| 68 | +| Get EMR step state | "For humans" | |
| 69 | +| Query Athena to receive python primitives | Returns *Iterable[Dict[str, Any]* | |
| 70 | +| Load and Unzip SageMaker jobs outputs | | |
| 71 | +| Dump Amazon Redshift as Parquet files on S3 | | |
| 72 | +| Dump Amazon Aurora as CSV files on S3 | Only for MySQL engine | |
72 | 73 |
|
73 | 74 | ## Installation |
74 | 75 |
|
|
0 commit comments