|
8 | 8 |
|
9 | 9 | --- |
10 | 10 |
|
11 | | -*Contents:* **[Use Cases](#Use-Cases)** | **[Installation](#Installation)** | **[Examples](#Examples)** | **[Diving Deep](#Diving-Deep)** |
| 11 | +*Contents:* **[Use Cases](#Use-Cases)** | **[Installation](#Installation)** | **[Examples](#Examples)** | **[Diving Deep](#Diving-Deep)** | **[Contributing](#Contributing)** |
12 | 12 |
|
13 | 13 | --- |
14 | 14 |
|
15 | 15 | ## Use Cases |
16 | 16 |
|
17 | 17 | ### Pandas |
18 | | -* Pandas -> Parquet (S3) (Parallel :rocket:) |
19 | | -* Pandas -> CSV (S3) (Parallel :rocket:) |
| 18 | +* Pandas -> Parquet (S3) (Parallel) |
| 19 | +* Pandas -> CSV (S3) (Parallel) |
20 | 20 | * Pandas -> Glue Catalog |
21 | | -* Pandas -> Athena (Parallel :rocket:) |
22 | | -* Pandas -> Redshift (Parallel :rocket:) |
| 21 | +* Pandas -> Athena (Parallel) |
| 22 | +* Pandas -> Redshift (Parallel) |
23 | 23 | * CSV (S3) -> Pandas (One shot or Batching) |
24 | 24 | * Athena -> Pandas (One shot or Batching) |
25 | 25 | * CloudWatch Logs Insights -> Pandas (NEW :star:) |
|
29 | 29 | * PySpark -> Redshift (Parallel :rocket:) (NEW :star:) |
30 | 30 |
|
31 | 31 | ### General |
32 | | -* List S3 objects (Parallel :rocket:) |
33 | | -* Delete S3 objects (Parallel :rocket:) |
34 | | -* Delete listed S3 objects (Parallel :rocket:) |
35 | | -* Delete NOT listed S3 objects (Parallel :rocket:) |
| 32 | +* List S3 objects (Parallel) |
| 33 | +* Delete S3 objects (Parallel) |
| 34 | +* Delete listed S3 objects (Parallel) |
| 35 | +* Delete NOT listed S3 objects (Parallel) |
36 | 36 | * Copy listed S3 objects (Parallel :rocket:) |
37 | 37 | * Get the size of S3 objects (Parallel :rocket:) |
38 | 38 | * Get CloudWatch Logs Insights query results (NEW :star:) |
@@ -194,3 +194,39 @@ results = session.cloudwatchlogs.query( |
194 | 194 | ### Spark to Redshift Flow |
195 | 195 |
|
196 | 196 |  |
| 197 | + |
| 198 | +## Contributing |
| 199 | + |
| 200 | +* AWS Data Wrangler practically only makes integrations. So we prefer to dedicate our energy / time writing integration tests instead of unit tests. We really like an end-to-end approach for all features. |
| 201 | + |
| 202 | +* All integration tests are between a local Docker container and a remote/real AWS service. |
| 203 | + |
| 204 | +* We have a Docker recipe to set up the local end (testing/Dockerfile). |
| 205 | + |
| 206 | +* We have a Cloudformation to set up the AWS end (testing/template.yaml). |
| 207 | + |
| 208 | +### Step-by-step |
| 209 | + |
| 210 | +**DISCLAIMER**: Make sure to know what you are doing. This steps will charge some services on your AWS account. And requires a minimum security skills to keep your environment safe. |
| 211 | + |
| 212 | +* Pick up a Linux or MacOS. |
| 213 | + |
| 214 | +* Install Python 3.6+ |
| 215 | + |
| 216 | +* Install Docker and configure at least 4 cores and 8 GB of memory |
| 217 | + |
| 218 | +* Fork the AWS Data Wrangler repository and clone that into your development environment |
| 219 | + |
| 220 | +* Go to the project's directory create a Python's virtual environment for the project **python -m venv venv && source source venv/bin/activate** |
| 221 | + |
| 222 | +* Run **./install-dev.sh** |
| 223 | + |
| 224 | +* Go to the *testing* directory |
| 225 | + |
| 226 | +* Configure the parameters.json file with your AWS environment infos (Make sure that your Redshift will not be open for the World!) |
| 227 | + |
| 228 | +* Deploy the Cloudformation stack **./deploy-cloudformation.sh** |
| 229 | + |
| 230 | +* Open the docker image **./open-image.sh** |
| 231 | + |
| 232 | +* Inside the image you finally can run **./run-tests.sh** |
0 commit comments