Skip to content

Commit b94c13d

Browse files
Merge pull request #3004 from segmentio/2949-content-create-aws-glue-etl-job
Resolving issue #2949: Create AWS Glue ETL job
2 parents 259b320 + 23109da commit b94c13d

File tree

1 file changed

+26
-38
lines changed
  • src/connections/destinations/catalog/amazon-personalize

1 file changed

+26
-38
lines changed

src/connections/destinations/catalog/amazon-personalize/index.md

Lines changed: 26 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -17,16 +17,16 @@ Developing the machine-learning capabilities necessary to produce these recommen
1717

1818
These are the pre-requisites you need before getting started:
1919

20-
1. Segment data flowing into an S3 destination OR a warehouse
20+
1. Segment data flowing into an S3 destination, a Snowflake warehouse, or Amazon Redshift warehouse.
2121
2. You have the ability to create AWS Glue jobs (only required if using S3 to [train your model](#train-your-model))
2222
3. You have the ability to deploy Lambda functions in Amazon Web Services
2323
4. You have access to AWS Personalize
2424

25-
If you don't have S3, Redshift warehouse, or Snowflake warehouse configured, you can read more about setting up [S3](/docs/connections/storage/catalog/amazon-s3/), [Redshift](/docs/connections/storage/catalog/redshift/), and [Snowflake](/docs/connections/storage/catalog/snowflake/).
25+
If you don't have S3, Redshift warehouse, or Snowflake warehouse configured, you can read more about setting up [S3](/docs/connections/storage/catalog/aws-s3/), [Redshift](/docs/connections/storage/catalog/redshift/), and [Snowflake](/docs/connections/storage/catalog/snowflake/).
2626

2727
***If you're a Segment business tier customer, contact your Success contact to initiate a replay to S3 or your Warehouse.***
2828

29-
There are 3 main parts to using Amazon Personalize with Segment:
29+
There are three main parts to using Amazon Personalize with Segment:
3030

3131
1. [Train your model](/docs/connections/destinations/catalog/amazon-personalize/#train-your-model) on historical data in S3 or a Warehouse.
3232
2. [Create a Personalize Dataset Group](/docs/connections/destinations/catalog/amazon-personalize/#create-personalize-dataset-group-solution-and-campaign) and Campaign
@@ -135,7 +135,7 @@ DELIMITER AS ','
135135
PARALLEL OFF;
136136
```
137137

138-
Note: Use `date_part(epoch,"timestamp") as TIMESTAMP` because Personalize requires timestamps to be specified in UNIX/epoch time.
138+
**Note:** Use `date_part(epoch,"timestamp") as TIMESTAMP` because Personalize requires timestamps to be specified in UNIX/epoch time.
139139

140140
**Verify the Output file**
141141
Browse to the S3 service page in the AWS console and navigate to the bucket path specified in the `unload` command. You should see the output file.
@@ -201,15 +201,7 @@ The following examples show how to configure an AWS Glue job to convert Segment
201201
**Create AWS Glue ETL Job**
202202

203203
To create an AWS Glue ETL Job:
204-
1. Navigate to the Glue service in your AWS console.
205-
2. Click **Get started** and then click **Jobs** in the left navigation on the Glue console page.
206-
207-
![](images/GlueJobs.png)
208-
209-
210-
3. Click **Add job**.
211-
4. Enter a job name such as "SegmentEventsJsonToCsv".
212-
5. For IAM role, create a role and execution policies that gives your Glue job the ability to write to your S3 bucket. For example:
204+
1. Create a new AWS service IAM role using the following execution policies. These policies give your Glue job the ability to write to your S3 bucket:
213205
* Policy 1:
214206

215207
```json
@@ -326,13 +318,10 @@ To create an AWS Glue ETL Job:
326318
]
327319
}
328320
```
329-
330-
6. Leave Type as **Spark**.
331-
7. For **This job runs**, select **A new script to be authored by you**.
332-
8. Leave everything else the same and click **Next** at the bottom of the form.
333-
9. On the **Connections** step, click **Save job and edit script** since you won't access data in a database for this job.
334-
335-
10. The source code for a generic Glue job is below. Modify this code to reflect the names of the events you wish to extract from the Segment logs (see line #25). Copy the code example to your clipboard and paste it into the Glue editor window.
321+
1. Navigate to the Glue service in your AWS console.
322+
2. Click **Get started** and then select **Jobs** from the left navigation on the Glue console page.
323+
3. Select **Spark script editor** and click **Create**.
324+
4. The following code sample is the source code for a generic Glue job. Copy the code example to your clipboard and paste it into the Glue editor window, modifying as necessary to reflect the names of the events you wish to extract from the Segment logs (see line #25).
336325
337326
```python
338327
import sys
@@ -354,7 +343,7 @@ To create an AWS Glue ETL Job:
354343
job.init(args['JOB_NAME'], args)
355344
356345
# Load JSON into dynamic frame
357-
datasource0 = glueContext.create_dynamic_frame.from_options('s3', {'paths': [args['S3_JSON_INPUT_PATH']]}, 'json')
346+
datasource0 = glueContext.create_dynamic_frame.from_options('s3', {'paths': [args['S3_JSON_INPUT_PATH']], 'recurse': True}, 'json')
358347
print("Input file: ", args['S3_JSON_INPUT_PATH'])
359348
print("Input file total record count: ", datasource0.count())
360349
@@ -399,13 +388,12 @@ To create an AWS Glue ETL Job:
399388
400389
job.commit()
401390
```
391+
5. Select the **Job details** tab.
392+
6. Enter a name for your Glue job.
393+
6. Leave Type as **Spark**.
394+
7. Make any optional changes on the Job details page, and click **Save** to save the job script.
402395
403-
11. Click **Save** to save the job script.
404-
405-
![](images/GlueEditJobScript.png)
406-
407-
408-
To review key parts of the script in more detail:
396+
To review key parts of the Python script in more detail:
409397
1. The script is initialized with a few job parameters. You'll see how to specify these parameter values when the job below runs. For now, see that Segment is passing in the location of the raw JSON files using `S3_JSON_INPUT_PATH` and the location where the output CSV should be written through `S3_CSV_OUTPUT_PATH`.
410398
411399
```python
@@ -425,7 +413,7 @@ To review key parts of the script in more detail:
425413
3. The first step in Segment's Job is to load the raw JSON file as a Glue DynamicFrame.
426414
427415
```python
428-
datasource0 = glueContext.create_dynamic_frame.from_options('s3', {'paths': [args['S3_JSON_INPUT_PATH']]}, 'json')
416+
datasource0 = glueContext.create_dynamic_frame.from_options('s3', {'paths': [args['S3_JSON_INPUT_PATH']], 'recurse': True}, 'json')
429417
```
430418
431419
4. Since not all events that are written to S3 by Segment are relevant to training a Personalize model, Segment uses Glue's `Filter` transformation to keep the records needed.
@@ -489,7 +477,7 @@ With Segment's ETL Job script created and saved, it's time to run the job to cre
489477
490478
491479
4. Scroll down to the **Job parameters** section. This is where Segment will specify the job parameters that Segment's script expects for the path to the input data and the path to the output file.
492-
5. Create 2 job parameters with the following key and value.
480+
5. Create two job parameters with the following key and value.
493481
* Be sure to prefix each key with `--` as shown. Substitute your account ID for `[ACCOUNT_ID]` in the values below. You copy the bucket name to your clipboard from the S3 service page in the tab/window you opened above. The order they are specified does not matter.
494482
495483
| **Key** | **Value** |
@@ -584,7 +572,7 @@ To create a personalize dataset group:
584572
585573
10. Click **Next** to save the schema and move to the next step.
586574
587-
11. The **Import user-item interaction data** step is displayed next. To complete this form Segment needs to get 2 pieces of information from IAM and S3. Give your import job a name and set the automatic import to **Off**.
575+
11. The **Import user-item interaction data** step is displayed next. To complete this form Segment needs to get two pieces of information from IAM and S3. Give your import job a name and set the automatic import to **Off**.
588576
589577
12. For the **IAM service role**, select **Create a new role** from the dropdown.
590578
13. In the next pop-up, Segment recommends listing your bucket name in the **Specific S3 buckets** option, but you're free to choose the option that best suits your needs.
@@ -601,9 +589,9 @@ Be patient as this process can take a long time to complete.
601589
602590
### Create Personalize Solution
603591
604-
Once Segment's event CSV is finished importing into a user-item interaction dataset, Segment can create a Personalize Solution. To do thi:
592+
Once Segment's event CSV is finished importing into a user-item interaction dataset, Segment can create a Personalize Solution. To do this:
605593
606-
1. From the Dashboard page for the dataset group we created above, click **Start** in the **Create solutions** column.
594+
1. From the Dashboard page for the dataset group created above, click **Start** in the **Create solutions** column.
607595
608596
![](images/PersonalizeCreateSolution.png)
609597
@@ -621,7 +609,7 @@ Once Segment's event CSV is finished importing into a user-item interaction data
621609
622610
A deployed solution is known as a campaign, and is able to make recommendations for your users. To deploy a solution, you create a campaign in the console or by calling the CreateCampaign API. You can choose which version of the solution to use. By default, a campaign uses the latest version of a solution.
623611
624-
To create a Personlize campaign:
612+
To create a Personalize campaign:
625613
626614
1. From the Dataset Group Dashboard, click **Create new campaign**.
627615
@@ -694,7 +682,7 @@ To create an IAM role:
694682
> note ""
695683
> **NOTE:** Your Source ID can be found by navigating to **Settings > API Keys** from your Segment source homepage.
696684
>
697-
> For security purposes, Segment will set your Workspace ID as your External ID. If you are currently using an External ID different from your Workspace ID, reach out to our support team so they can change it and make your account more secure.
685+
> For security purposes, Segment will set your Workspace ID as your External ID. If you are currently using an External ID different from your Workspace ID, reach out to Segment support so they can change it and make your account more secure.
698686
699687
```json
700688
{
@@ -811,7 +799,7 @@ To install Segment's Layer:
811799

812800
**Update your IAM role for your Lambda to call Personalize**
813801

814-
You need to modify the IAM Role & Policy originally created with this Lambda to allow it to send and recieve data from Personalize. To do this:
802+
You need to modify the IAM Role & Policy originally created with this Lambda to allow it to send and receive data from Personalize. To do this:
815803

816804
1. From the **Execution role** section of your Lambda function, click the **View the <your-role-name>** link.
817805

@@ -972,7 +960,7 @@ You need to create a Personalize Event Tracker for the Dataset Group you created
972960
![](images/PersonalizeCampaignArn.png)
973961

974962

975-
12. Return to our Lambda function and scroll down to the **Environment variables** panel.
963+
12. Return to your Lambda function and scroll down to the **Environment variables** panel.
976964

977965
13. Add an environment variable with the key `personalize_campaign_arn` and value of the Campaign ARN in your clipboard.
978966
14. Scroll to the top of the page and click **Save** to save your changes.
@@ -1008,7 +996,7 @@ Segment allows you to send each call type to a different Lambda. If you leave th
1008996

1009997
There are two settings relevant for track calls:
1010998

1011-
1. Lambda for track calls - the Lambda where we should route track calls.
999+
1. Lambda for track calls - the Lambda where the Segment app should route track calls.
10121000
2. Events - a list of specific events to send. You may send *all* track events (see setting details for instructions on how), but use caution with this option, as it may significantly increase your Lambda costs.
10131001

10141002

@@ -1020,4 +1008,4 @@ This setting controls the [Log Type](https://docs.aws.amazon.com/lambda/latest/d
10201008

10211009
**My Lambda <> Segment connection is timing out, what do I do?**
10221010

1023-
Due to how our event delivery system, [Centrifuge](https://segment.com/blog/introducing-centrifuge/), works, your Lambda can't take more than 5 seconds to run per message. If you're consistently running into timeout issues, you should consult the [AWS Lambda docs](https://docs.aws.amazon.com/lambda/index.html#lang/en_us), as well as docs for your language of choice, for tips on optimizing performance.
1011+
Due to how Segment's event delivery system, [Centrifuge](https://segment.com/blog/introducing-centrifuge/), works, your Lambda can't take more than five seconds to run per message. If you're consistently running into timeout issues, you should consult the [AWS Lambda docs](https://docs.aws.amazon.com/lambda/index.html#lang/en_us), as well as docs for your language of choice, for tips on optimizing performance.

0 commit comments

Comments
 (0)