Skip to content

Commit f100749

Browse files
committed
Tuning the default_block_size for s3fs
1 parent e6a57b9 commit f100749

File tree

4 files changed

+9
-5
lines changed

4 files changed

+9
-5
lines changed

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,10 @@
3434
- [3 - Amazon S3](https://github.com/awslabs/aws-data-wrangler/blob/dev-1.0.0/tutorials/3%20-%20Amazon%20S3.ipynb)
3535
- [4 - Parquet Datasets](https://github.com/awslabs/aws-data-wrangler/blob/dev-1.0.0/tutorials/4%20-%20Parquet%20Datasets.ipynb)
3636
- [5 - Glue Catalog](https://github.com/awslabs/aws-data-wrangler/blob/dev-1.0.0/tutorials/5%20-%20Glue%20Catalog.ipynb)
37+
- [6 - Amazon Athena](https://github.com/awslabs/aws-data-wrangler/blob/dev-1.0.0/tutorials/6%20-%20Amazon%20Athena.ipynb)
38+
- [7 - Databases (Redshift, MySQL and PostgreSQL)](https://github.com/awslabs/aws-data-wrangler/blob/dev-1.0.0/tutorials/7%20-%20Redshift%2C%20MySQL%2C%20PostgreSQL.ipynb)
39+
- [8 - Redshift Copy & Unload.ipynb](https://github.com/awslabs/aws-data-wrangler/blob/dev-1.0.0/tutorials/8%20-%20Redshift%20Copy%20%26%20Unload.ipynb)
40+
- [9 - Parquet Crawler.ipynb](https://github.com/awslabs/aws-data-wrangler/blob/dev-1.0.0/tutorials/9%20-%20Parquet%20Crawler.ipynb)
3741
- [**API Reference**](https://aws-data-wrangler.readthedocs.io/en/dev-1.0.0/api.html)
3842
- [Amazon S3](https://aws-data-wrangler.readthedocs.io/en/dev-1.0.0/api.html#amazon-s3)
3943
- [AWS Glue Catalog](https://aws-data-wrangler.readthedocs.io/en/dev-1.0.0/api.html#aws-glue-catalog)

awswrangler/_utils.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -133,7 +133,7 @@ def get_fs(
133133
use_ssl=True,
134134
default_cache_type="none",
135135
default_fill_cache=False,
136-
default_block_size=52_428_800, # 50 MB (50 * 2**20)
136+
default_block_size=134_217_728, # 128 MB (50 * 2**20)
137137
config_kwargs={"retries": {"mode": "adaptive", "max_attempts": 10}},
138138
session=ensure_session(session=session)._session, # pylint: disable=protected-access
139139
s3_additional_kwargs=s3_additional_kwargs,

awswrangler/athena.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -340,15 +340,15 @@ def read_sql_query( # pylint: disable=too-many-branches,too-many-locals
340340
341341
1 - `ctas_approach=True` (`Default`):
342342
Wrap the query with a CTAS and then reads the table data as parquet directly from s3.
343-
PROS: Faster and can handle some level of nested types
343+
PROS: Faster and can handle some level of nested types.
344344
CONS: Requires create/delete table permissions on Glue and Does not support timestamp with time zone
345345
(A temporary table will be created and then deleted immediately).
346346
347347
2 - `ctas_approach False`:
348348
Does a regular query on Athena and parse the regular CSV result on s3.
349-
PROS: Does not require create/delete table permissions on Glue and give support timestamp with time zone.
349+
PROS: Does not require create/delete table permissions on Glue and supports timestamp with time zone.
350350
CONS: Slower (But stills faster than other libraries that uses the regular Athena API)
351-
and does not handle nested types at all
351+
and does not handle nested types at all.
352352
353353
Note
354354
----

requirements-dev.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ pytest~=5.4.1
1010
pytest-cov~=2.8.1
1111
pytest-xdist~=1.31.0
1212
scikit-learn~=0.22.1
13-
awscli~=1.18.37
13+
awscli~=1.18.39
1414
cfn-lint~=0.29.4
1515
twine~=3.1.1
1616
wheel~=0.34.2

0 commit comments

Comments
 (0)