11Install
22=======
33
4- **AWS Data Wrangler ** runs with Python ``3.6 ``, ``3.7 `` and ``3.8 ``
4+ **AWS Data Wrangler ** runs with Python ``3.6 ``, ``3.7 ``, `` 3.8 `` and ``3.9 ``
55and on several platforms (AWS Lambda, AWS Glue Python Shell, EMR, EC2,
66on-premises, Amazon SageMaker, local, etc).
77
@@ -57,10 +57,13 @@ AWS Glue PySpark Jobs
5757Go to your Glue PySpark job and create a new *Job parameters * key/value:
5858
5959* Key: ``--additional-python-modules ``
60- * Value: ``awswrangler ==2.3.0 ``
60+ * Value: ``pyarrow ==2,awswrangler ``
6161
62- P.S. By now AWS Glue PySpark Jobs does not support PyArrow +3.0.0.
63- Please use awswrangler==2.3.0 that uses PyArrow 2.0.0 to overcome this limitation.
62+ To install a specific version, set the value for above Job parameter as follows:
63+
64+ * Value: ``pyarrow==2,awswrangler==2.4.0 ``
65+
66+ .. note :: Pyarrow 3 is not currently supported in Glue PySpark Jobs, which is why a previous installation of pyarrow 2 is required.
6467
6568`Official Glue PySpark Reference <https://docs.aws.amazon.com/glue/latest/dg/reduced-start-times-spark-etl-jobs.html#reduced-start-times-new-features >`_
6669
@@ -115,7 +118,7 @@ AWS Data Wrangler could be a good helper to
115118complement Big Data pipelines.
116119
117120- Configure Python 3 as the default interpreter for
118- PySpark under your cluster configuration
121+ PySpark on your cluster configuration [ONLY REQUIRED FOR EMR < 6]
119122
120123 .. code-block:: json
121124
@@ -135,15 +138,28 @@ complement Big Data pipelines.
135138
136139- Keep the bootstrap script above on S3 and reference it on your cluster.
137140
141+ - For EMR Release < 6
142+
138143 .. code-block:: sh
139144
140145 #!/usr/bin/env bash
141146 set -ex
142147
143- sudo pip-3.6 install awswrangler
148+ sudo pip-3.6 install pyarrow==2 awswrangler
149+
150+ - For EMR Release >= 6
151+
152+ .. code-block:: sh
153+
154+ #!/usr/bin/env bash
155+ set -ex
156+
157+ sudo pip install pyarrow==2 awswrangler
144158
145159.. note:: Make sure to freeze the Wrangler version in the bootstrap for productive
146- environments (e.g. awswrangler==1.8.1)
160+ environments (e.g. awswrangler==2.4.0)
161+
162+ .. note:: Pyarrow 3 is not currently supported in the default EMR image, which is why a previous installation of pyarrow 2 is required.
147163
148164From Source
149165-----------
0 commit comments