Skip to content

Commit adb1319

Browse files
committed
Update tutorials.
1 parent 3fad112 commit adb1319

13 files changed

+244
-448
lines changed

tutorials/001 - Introduction.ipynb

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
"\n",
1818
"An [open-source](https://github.com/awslabs/aws-data-wrangler>) Python package that extends the power of [Pandas](https://github.com/pandas-dev/pandas>) library to AWS connecting **DataFrames** and AWS data related services (**Amazon Redshift**, **AWS Glue**, **Amazon Athena**, **Amazon EMR**, etc).\n",
1919
"\n",
20-
"Built on top of other open-source projects like [Pandas](https://github.com/pandas-dev/pandas), [Apache Arrow](https://github.com/apache/arrow), [Boto3](https://github.com/boto/boto3), [s3fs](https://github.com/dask/s3fs), [SQLAlchemy](https://github.com/sqlalchemy/sqlalchemy), [Psycopg2](https://github.com/psycopg/psycopg2) and [PyMySQL](https://github.com/PyMySQL/PyMySQL), it offers abstracted functions to execute usual ETL tasks like load/unload data from **Data Lakes**, **Data Warehouses** and **Databases**.\n",
20+
"Built on top of other open-source projects like [Pandas](https://github.com/pandas-dev/pandas), [Apache Arrow](https://github.com/apache/arrow), [Boto3](https://github.com/boto/boto3), [SQLAlchemy](https://github.com/sqlalchemy/sqlalchemy), [Psycopg2](https://github.com/psycopg/psycopg2) and [PyMySQL](https://github.com/PyMySQL/PyMySQL), it offers abstracted functions to execute usual ETL tasks like load/unload data from **Data Lakes**, **Data Warehouses** and **Databases**.\n",
2121
"\n",
2222
"Check our [list of functionalities](https://aws-data-wrangler.readthedocs.io/en/latest/api.html)."
2323
]
@@ -33,7 +33,8 @@
3333
" - [PyPi (pip)](https://aws-data-wrangler.readthedocs.io/en/latest/install.html#pypi-pip)\n",
3434
" - [Conda](https://aws-data-wrangler.readthedocs.io/en/latest/install.html#conda)\n",
3535
" - [AWS Lambda Layer](https://aws-data-wrangler.readthedocs.io/en/latest/install.html#aws-lambda-layer)\n",
36-
" - [AWS Glue Wheel](https://aws-data-wrangler.readthedocs.io/en/latest/install.html#aws-glue-wheel)\n",
36+
" - [AWS Glue Python Shell Jobs](https://aws-data-wrangler.readthedocs.io/en/latest/install.html#aws-glue-python-shell-jobs)\n",
37+
" - [AWS Glue PySpark Jobs](https://aws-data-wrangler.readthedocs.io/en/latest/install.html#aws-glue-pyspark-jobs)\n",
3738
" - [Amazon SageMaker Notebook](https://aws-data-wrangler.readthedocs.io/en/latest/install.html#amazon-sagemaker-notebook)\n",
3839
" - [Amazon SageMaker Notebook Lifecycle](https://aws-data-wrangler.readthedocs.io/en/latest/install.html#amazon-sagemaker-notebook-lifecycle)\n",
3940
" - [EMR Cluster](https://aws-data-wrangler.readthedocs.io/en/latest/install.html#emr-cluster)\n",
@@ -69,16 +70,16 @@
6970
},
7071
{
7172
"cell_type": "code",
72-
"execution_count": 1,
73+
"execution_count": 2,
7374
"metadata": {},
7475
"outputs": [
7576
{
7677
"data": {
7778
"text/plain": [
78-
"'1.7.0'"
79+
"'1.9.0'"
7980
]
8081
},
81-
"execution_count": 1,
82+
"execution_count": 2,
8283
"metadata": {},
8384
"output_type": "execute_result"
8485
}

tutorials/002 - Sessions.ipynb

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@
3636
"cell_type": "markdown",
3737
"metadata": {},
3838
"source": [
39-
"## Using the default Session"
39+
"## Using the default Boto3 Session"
4040
]
4141
},
4242
{
@@ -63,7 +63,7 @@
6363
"cell_type": "markdown",
6464
"metadata": {},
6565
"source": [
66-
"## Customizing and using the default Session"
66+
"## Customizing and using the default Boto3 Session"
6767
]
6868
},
6969
{
@@ -92,7 +92,7 @@
9292
"cell_type": "markdown",
9393
"metadata": {},
9494
"source": [
95-
"## Using a new custom Session"
95+
"## Using a new custom Boto3 Session"
9696
]
9797
},
9898
{

tutorials/003 - Amazon S3.ipynb

Lines changed: 12 additions & 178 deletions
Original file line numberDiff line numberDiff line change
@@ -1180,8 +1180,8 @@
11801180
"metadata": {},
11811181
"outputs": [],
11821182
"source": [
1183-
"begin = datetime.strptime(\"05/06/20 16:30\", \"%d/%m/%y %H:%M\")\n",
1184-
"end = datetime.strptime(\"15/06/21 16:30\", \"%d/%m/%y %H:%M\")\n",
1183+
"begin = datetime.strptime(\"20-07-31 20:30\", \"%y-%m-%d %H:%M\")\n",
1184+
"end = datetime.strptime(\"21-07-31 20:30\", \"%y-%m-%d %H:%M\")\n",
11851185
"\n",
11861186
"begin_utc = pytz.utc.localize(begin)\n",
11871187
"end_utc = pytz.utc.localize(end)"
@@ -1200,198 +1200,32 @@
12001200
"metadata": {},
12011201
"outputs": [],
12021202
"source": [
1203-
"begin = datetime.strptime(\"05/06/20 16:30\", \"%d/%m/%y %H:%M\")\n",
1204-
"end = datetime.strptime(\"10/06/21 16:30\", \"%d/%m/%y %H:%M\")\n",
1203+
"begin = datetime.strptime(\"20-07-31 20:30\", \"%y-%m-%d %H:%M\")\n",
1204+
"end = datetime.strptime(\"21-07-31 20:30\", \"%y-%m-%d %H:%M\")\n",
12051205
"\n",
12061206
"timezone = pytz.timezone(\"America/Los_Angeles\")\n",
12071207
"\n",
12081208
"begin_Los_Angeles = timezone.localize(begin)\n",
12091209
"end_Los_Angeles = timezone.localize(end)"
12101210
]
12111211
},
1212-
{
1213-
"cell_type": "code",
1214-
"execution_count": 21,
1215-
"metadata": {},
1216-
"outputs": [
1217-
{
1218-
"name": "stdout",
1219-
"output_type": "stream",
1220-
"text": [
1221-
"2020-06-05 16:30:00+00:00\n",
1222-
"2021-06-15 16:30:00+00:00\n",
1223-
"2020-06-05 16:30:00-07:00\n",
1224-
"2021-06-10 16:30:00-07:00\n"
1225-
]
1226-
}
1227-
],
1228-
"source": [
1229-
"print(begin_utc)\n",
1230-
"print(end_utc)\n",
1231-
"print(begin_Los_Angeles)\n",
1232-
"print(end_Los_Angeles)"
1233-
]
1234-
},
12351212
{
12361213
"cell_type": "markdown",
12371214
"metadata": {},
12381215
"source": [
1239-
"### 5.3 Read json with no LastModified filter "
1216+
"### 5.3 Read json using the LastModified filters "
12401217
]
12411218
},
12421219
{
12431220
"cell_type": "code",
1244-
"execution_count": 22,
1245-
"metadata": {},
1246-
"outputs": [
1247-
{
1248-
"name": "stdout",
1249-
"output_type": "stream",
1250-
"text": [
1251-
"# read_fwf\n",
1252-
" id name date\n",
1253-
"0 1 Herfelingen 27-12-18\n",
1254-
"1 2 Lambusart 14-06-18\n",
1255-
"2 3 Spormaggiore 15-04-18\n",
1256-
"3 4 Buizingen 05-09-19\n",
1257-
"4 5 San Rafael 04-09-19\n",
1258-
"\n",
1259-
" read_json\n",
1260-
" id name\n",
1261-
"0 1 foo\n",
1262-
"1 2 boo\n",
1263-
"0 3 bar\n",
1264-
"\n",
1265-
" read_csv\n",
1266-
" id name\n",
1267-
"0 1 foo\n",
1268-
"1 2 boo\n",
1269-
"2 3 bar\n",
1270-
"\n",
1271-
" read_parquet\n"
1272-
]
1273-
},
1274-
{
1275-
"data": {
1276-
"text/html": [
1277-
"<div>\n",
1278-
"<style scoped>\n",
1279-
" .dataframe tbody tr th:only-of-type {\n",
1280-
" vertical-align: middle;\n",
1281-
" }\n",
1282-
"\n",
1283-
" .dataframe tbody tr th {\n",
1284-
" vertical-align: top;\n",
1285-
" }\n",
1286-
"\n",
1287-
" .dataframe thead th {\n",
1288-
" text-align: right;\n",
1289-
" }\n",
1290-
"</style>\n",
1291-
"<table border=\"1\" class=\"dataframe\">\n",
1292-
" <thead>\n",
1293-
" <tr style=\"text-align: right;\">\n",
1294-
" <th></th>\n",
1295-
" <th>id</th>\n",
1296-
" <th>name</th>\n",
1297-
" </tr>\n",
1298-
" </thead>\n",
1299-
" <tbody>\n",
1300-
" <tr>\n",
1301-
" <th>0</th>\n",
1302-
" <td>1</td>\n",
1303-
" <td>foo</td>\n",
1304-
" </tr>\n",
1305-
" <tr>\n",
1306-
" <th>1</th>\n",
1307-
" <td>2</td>\n",
1308-
" <td>boo</td>\n",
1309-
" </tr>\n",
1310-
" <tr>\n",
1311-
" <th>2</th>\n",
1312-
" <td>3</td>\n",
1313-
" <td>bar</td>\n",
1314-
" </tr>\n",
1315-
" </tbody>\n",
1316-
"</table>\n",
1317-
"</div>"
1318-
],
1319-
"text/plain": [
1320-
" id name\n",
1321-
"0 1 foo\n",
1322-
"1 2 boo\n",
1323-
"2 3 bar"
1324-
]
1325-
},
1326-
"execution_count": 22,
1327-
"metadata": {},
1328-
"output_type": "execute_result"
1329-
}
1330-
],
1331-
"source": [
1332-
"print('# read_fwf')\n",
1333-
"print(wr.s3.read_fwf(f\"s3://{bucket}/fwf/\", names=[\"id\", \"name\", \"date\"]))\n",
1334-
"print('\\n read_json')\n",
1335-
"print(wr.s3.read_json(f\"s3://{bucket}/json/\"))\n",
1336-
"print('\\n read_csv')\n",
1337-
"print(wr.s3.read_csv(f\"s3://{bucket}/csv/\"))\n",
1338-
"print('\\n read_parquet')\n",
1339-
"wr.s3.read_parquet(f\"s3://{bucket}/parquet/\")"
1340-
]
1341-
},
1342-
{
1343-
"cell_type": "markdown",
1344-
"metadata": {},
1345-
"source": [
1346-
"### 5.4 Read json using the LastModified filter "
1347-
]
1348-
},
1349-
{
1350-
"cell_type": "code",
1351-
"execution_count": 23,
1221+
"execution_count": 21,
13521222
"metadata": {},
1353-
"outputs": [
1354-
{
1355-
"name": "stdout",
1356-
"output_type": "stream",
1357-
"text": [
1358-
"# read_fwf\n",
1359-
" id name date\n",
1360-
"0 1 Herfelingen 27-12-18\n",
1361-
"1 2 Lambusart 14-06-18\n",
1362-
"2 3 Spormaggiore 15-04-18\n",
1363-
"3 4 Buizingen 05-09-19\n",
1364-
"4 5 San Rafael 04-09-19\n",
1365-
"\n",
1366-
" read_json\n",
1367-
" id name\n",
1368-
"0 1 foo\n",
1369-
"1 2 boo\n",
1370-
"0 3 bar\n",
1371-
"\n",
1372-
" read_csv\n",
1373-
" id name\n",
1374-
"0 1 foo\n",
1375-
"1 2 boo\n",
1376-
"2 3 bar\n",
1377-
"\n",
1378-
" read_parquet\n",
1379-
" id name\n",
1380-
"0 1 foo\n",
1381-
"1 2 boo\n",
1382-
"2 3 bar\n"
1383-
]
1384-
}
1385-
],
1223+
"outputs": [],
13861224
"source": [
1387-
"print('# read_fwf')\n",
1388-
"print(wr.s3.read_fwf(f\"s3://{bucket}/fwf/\", names=[\"id\", \"name\", \"date\"], last_modified_begin=begin_utc, last_modified_end=end_utc))\n",
1389-
"print('\\n read_json')\n",
1390-
"print(wr.s3.read_json(f\"s3://{bucket}/json/\", last_modified_begin=begin_utc, last_modified_end=end_utc))\n",
1391-
"print('\\n read_csv')\n",
1392-
"print(wr.s3.read_csv(f\"s3://{bucket}/csv/\", last_modified_begin=begin_utc, last_modified_end=end_utc))\n",
1393-
"print('\\n read_parquet')\n",
1394-
"print(wr.s3.read_parquet(f\"s3://{bucket}/parquet/\", last_modified_begin=begin_utc, last_modified_end=end_utc))"
1225+
"wr.s3.read_fwf(f\"s3://{bucket}/fwf/\", names=[\"id\", \"name\", \"date\"], last_modified_begin=begin_utc, last_modified_end=end_utc)\n",
1226+
"wr.s3.read_json(f\"s3://{bucket}/json/\", last_modified_begin=begin_utc, last_modified_end=end_utc)\n",
1227+
"wr.s3.read_csv(f\"s3://{bucket}/csv/\", last_modified_begin=begin_utc, last_modified_end=end_utc)\n",
1228+
"wr.s3.read_parquet(f\"s3://{bucket}/parquet/\", last_modified_begin=begin_utc, last_modified_end=end_utc);"
13951229
]
13961230
},
13971231
{
@@ -1403,7 +1237,7 @@
14031237
},
14041238
{
14051239
"cell_type": "code",
1406-
"execution_count": 24,
1240+
"execution_count": 22,
14071241
"metadata": {},
14081242
"outputs": [],
14091243
"source": [

tutorials/004 - Parquet Datasets.ipynb

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@
5050
"name": "stdin",
5151
"output_type": "stream",
5252
"text": [
53-
" ···········································\n"
53+
" ············\n"
5454
]
5555
}
5656
],
@@ -184,31 +184,31 @@
184184
" <tbody>\n",
185185
" <tr>\n",
186186
" <th>0</th>\n",
187-
" <td>3</td>\n",
188-
" <td>bar</td>\n",
189-
" <td>2020-01-03</td>\n",
190-
" </tr>\n",
191-
" <tr>\n",
192-
" <th>1</th>\n",
193187
" <td>1</td>\n",
194188
" <td>foo</td>\n",
195189
" <td>2020-01-01</td>\n",
196190
" </tr>\n",
197191
" <tr>\n",
198-
" <th>2</th>\n",
192+
" <th>1</th>\n",
199193
" <td>2</td>\n",
200194
" <td>boo</td>\n",
201195
" <td>2020-01-02</td>\n",
202196
" </tr>\n",
197+
" <tr>\n",
198+
" <th>2</th>\n",
199+
" <td>3</td>\n",
200+
" <td>bar</td>\n",
201+
" <td>2020-01-03</td>\n",
202+
" </tr>\n",
203203
" </tbody>\n",
204204
"</table>\n",
205205
"</div>"
206206
],
207207
"text/plain": [
208208
" id value date\n",
209-
"0 3 bar 2020-01-03\n",
210-
"1 1 foo 2020-01-01\n",
211-
"2 2 boo 2020-01-02"
209+
"0 1 foo 2020-01-01\n",
210+
"1 2 boo 2020-01-02\n",
211+
"2 3 bar 2020-01-03"
212212
]
213213
},
214214
"execution_count": 4,

tutorials/005 - Glue Catalog.ipynb

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@
3939
"name": "stdin",
4040
"output_type": "stream",
4141
"text": [
42-
" ···········································\n"
42+
" ············\n"
4343
]
4444
}
4545
],
@@ -197,9 +197,7 @@
197197
"text": [
198198
" Database Description\n",
199199
"0 aws_data_wrangler AWS Data Wrangler Test Arena - Glue Database\n",
200-
"1 aws_dataframes AWS DataFrames Test Arena - Glue Database\n",
201-
"2 covid-19 \n",
202-
"3 default Default Hive database\n"
200+
"1 default Default Hive database\n"
203201
]
204202
}
205203
],
@@ -226,10 +224,8 @@
226224
"text": [
227225
" Database Description\n",
228226
"0 aws_data_wrangler AWS Data Wrangler Test Arena - Glue Database\n",
229-
"1 aws_dataframes AWS DataFrames Test Arena - Glue Database\n",
230-
"2 awswrangler_test \n",
231-
"3 covid-19 \n",
232-
"4 default Default Hive database\n"
227+
"1 awswrangler_test \n",
228+
"2 default Default Hive database\n"
233229
]
234230
}
235231
],

tutorials/007 - Redshift, MySQL, PostgreSQL.ipynb

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -168,13 +168,6 @@
168168
"wr.db.read_sql_query(\"SELECT * FROM test.tutorial\", con=eng_mysql) # MySQL\n",
169169
"wr.db.read_sql_query(\"SELECT * FROM public.tutorial\", con=eng_redshift) # Redshift"
170170
]
171-
},
172-
{
173-
"cell_type": "code",
174-
"execution_count": null,
175-
"metadata": {},
176-
"outputs": [],
177-
"source": []
178171
}
179172
],
180173
"metadata": {

0 commit comments

Comments
 (0)