Skip to content

Commit 68e7ec1

Browse files
committed
Updating branch
1 parent f11ff15 commit 68e7ec1

File tree

4 files changed

+55
-1
lines changed

4 files changed

+55
-1
lines changed

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,8 @@
2626
- [AWS Glue Wheel](https://aws-data-wrangler.readthedocs.io/en/dev-1.0.0/install.html#aws-glue-wheel)
2727
- [Amazon SageMaker Notebook](https://aws-data-wrangler.readthedocs.io/en/dev-1.0.0/install.html#amazon-sagemaker-notebook)
2828
- [Amazon SageMaker Notebook Lifecycle](https://aws-data-wrangler.readthedocs.io/en/dev-1.0.0/install.html#amazon-sagemaker-notebook-lifecycle)
29+
- [EMR](https://aws-data-wrangler.readthedocs.io/en/dev-1.0.0/install.html#emr)
30+
- [From source](https://aws-data-wrangler.readthedocs.io/en/dev-1.0.0/install.html#from-source)
2931
- [**Tutorials**](https://github.com/awslabs/aws-data-wrangler/tree/dev-1.0.0/tutorials)
3032
- [**API Reference**](https://aws-data-wrangler.readthedocs.io/en/dev-1.0.0/api.html)
3133
- [Amazon S3](https://aws-data-wrangler.readthedocs.io/en/dev-1.0.0/api.html#amazon-s3)

awswrangler/s3.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1198,10 +1198,11 @@ def _read_parquet_init(
11981198
"""Encapsulate all initialization before the use of the pyarrow.parquet.ParquetDataset."""
11991199
if dataset is False:
12001200
path_or_paths: Union[str, List[str]] = _path2list(path=path, boto3_session=boto3_session)
1201+
elif isinstance(path, str):
1202+
path_or_paths = path[:-1] if path.endswith("/") else path
12011203
else:
12021204
path_or_paths = path
12031205
_logger.debug(f"path_or_paths: {path_or_paths}")
1204-
print(f"path_or_paths: {path_or_paths}")
12051206
fs: s3fs.S3FileSystem = _utils.get_fs(session=boto3_session, s3_additional_kwargs=s3_additional_kwargs)
12061207
cpus: int = _utils.ensure_cpu_count(use_threads=use_threads)
12071208
data: pyarrow.parquet.ParquetDataset = pyarrow.parquet.ParquetDataset(

docs/source/install.rst

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,10 @@ Install
55
and on several platforms (AWS Lambda, AWS Glue Python Shell, EMR, EC2,
66
on-premises, Amazon SageMaker, local, etc).
77

8+
Some good practices for most of the methods bellow are:
9+
- Use new and individual Virtual Environments for each project (`venv <https://docs.python.org/3/library/venv.html>`_).
10+
- On Notebooks, always restart your kernel after installations.
11+
812
PyPI (pip)
913
----------
1014

@@ -86,3 +90,48 @@ SageMaker kernels (`Reference <https://github.com/aws-samples/amazon-sagemaker-n
8690
done
8791
8892
EOF
93+
94+
EMR
95+
---
96+
97+
Even not being a distributed library,
98+
AWS Data Wrangler could be a good helper to
99+
complement Big Data pipelines.
100+
101+
- Configure Python 3 as the default interpreter for
102+
PySpark under your cluster configuration
103+
104+
.. code-block:: json
105+
106+
[
107+
{
108+
"Classification": "spark-env",
109+
"Configurations": [
110+
{
111+
"Classification": "export",
112+
"Properties": {
113+
"PYSPARK_PYTHON": "/usr/bin/python3"
114+
}
115+
}
116+
]
117+
}
118+
]
119+
120+
- Keep the bootstrap script above on S3 and reference it on your cluster.
121+
122+
.. code-block:: sh
123+
124+
#!/usr/bin/env bash
125+
set -ex
126+
127+
sudo pip-3.6 install awswrangler
128+
129+
.. note:: Make sure to freeze the Wrangler version in the bootstrap for productive
130+
environments (e.g. awswrangler==1.0.0)
131+
132+
From Source
133+
-----------
134+
135+
>>> git clone https://github.com/awslabs/aws-data-wrangler.git
136+
>>> cd aws-data-wrangler
137+
>>> pip install .

testing/test_awswrangler/test_emr.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,7 @@ def test_cluster(bucket, cloudformation_outputs):
8383
step_state = wr.emr.get_step_state(cluster_id=cluster_id, step_id=step_id)
8484
assert step_state == "PENDING"
8585
wr.emr.terminate_cluster(cluster_id=cluster_id)
86+
wr.s3.delete_objects(f"s3://{bucket}/emr-logs/")
8687

8788

8889
def test_cluster_single_node(bucket, cloudformation_outputs):
@@ -144,3 +145,4 @@ def test_cluster_single_node(bucket, cloudformation_outputs):
144145
steps.append(wr.emr.build_step(name=cmd, command=cmd))
145146
wr.emr.submit_steps(cluster_id=cluster_id, steps=steps)
146147
wr.emr.terminate_cluster(cluster_id=cluster_id)
148+
wr.s3.delete_objects(f"s3://{bucket}/emr-logs/")

0 commit comments

Comments
 (0)