Skip to content
This repository was archived by the owner on Dec 18, 2024. It is now read-only.

Commit dd1ec00

Browse files
authored
Tech report pipeline - add flex template (#251)
* added pipeline runner run_tech_report.py * added flex template metadata flex_template_metadata_tech_report.json * added tech_report to flex template build script build_flex_template.sh * added tech_report to flex template run script run_flex_template.sh * added .venv to .gitignore * updated README with minimal information
1 parent de42ed9 commit dd1ec00

File tree

6 files changed

+75
-4
lines changed

6 files changed

+75
-4
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
/build
55
/env
66
/venv
7+
/.venv
78
*.pyc
89
/.vscode
910
.coverage

README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,11 +37,15 @@ The new HTTP Archive data pipeline built entirely on GCP
3737

3838
This repo handles the HTTP Archive data pipeline, which takes the results of the monthly HTTP Archive run and saves this to the `httparchive` dataset in BigQuery.
3939

40+
A secondary pipeline is responsible for populating the Technology Report Firestore collections.
41+
4042
There are currently two main pipelines:
4143

4244
- The `all` pipeline which saves data to the new `httparchive.all` dataset
4345
- The `combined` pipline which saves data to the legacy tables. This processes both the `summary` tables (`summary_pages` and `summary_requests`) and `non-summary` pipeline (`pages`, `requests`, `response_bodies`....etc.)
4446

47+
The secondary `tech_report` pipeline saves data to a Firestore database (e.g. `tech-report-apis-prod`) across various collections ([see `TECHNOLOGY_QUERIES` in constants.py](modules/constants.py))
48+
4549
The pipelines are run in Google Cloud Platform (GCP) and are kicked off automatically on crawl completion, based on the code in the `main` branch which is deployed to GCP on each merge.
4650

4751
The [`data-pipeline` workflow](https://console.cloud.google.com/workflows/workflow/us-west1/data-pipeline/executions?project=httparchive) as defined by the [data-pipeline-workflows.yaml](./data-pipeline-workflows.yaml) file, runs the whole process from start to finish, including generating the manifest file for each of the two runs (desktop and mobile) and then starting the four dataflow jobs (desktop all, mobile all, desktop combined, mobile combined) in sequence to upload of the HAR files to the BigQuery tables. This can be rerun in case of failure by [publishing a crawl-complete message](#publishing-a-pubsub-message), providing no data was saved to the final BigQuery tables.
@@ -156,6 +160,7 @@ This method is best used when developing locally, as a convenience for running t
156160
# run the pipeline using a flex template
157161
./run_flex_template all [...]
158162
./run_flex_template combined [...]
163+
./run_flex_template tech_report [...]
159164
```
160165

161166
### Running a flex template from the Cloud Console

build_flex_template.sh

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,13 @@ set -u
55

66
BUILD_TAG=$(date -u +"%Y-%m-%d_%H-%M-%S")
77

8+
# all and combined pipelines
89
for type in all combined
910
do
10-
gcloud builds submit --substitutions=_TYPE="${type}",_BUILD_TAG="${BUILD_TAG}" .
11+
gcloud builds submit --substitutions=_TYPE="${type}",_BUILD_TAG="${BUILD_TAG}",_WORKER_TYPE=n1-standard-32 .
1112
done
1213

14+
# tech_report pipeline
15+
gcloud builds submit --substitutions=_TYPE=tech_report,_BUILD_TAG="${BUILD_TAG}",_WORKER_TYPE=n1-standard-1 .
16+
1317
echo "${BUILD_TAG}"
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
{
2+
"name": "Technology Report API pipeline",
3+
"description": "Runs a pipeline to generate firestore API results",
4+
"parameters": [
5+
{
6+
"name": "query_type",
7+
"label":"Query type",
8+
"helpText": "Technology query type",
9+
"isOptional": true
10+
},
11+
{
12+
"name": "firestore_project",
13+
"label":"Firestore project",
14+
"helpText": "Google Cloud project with Firestore",
15+
"isOptional": true
16+
},
17+
{
18+
"name": "firestore_collection",
19+
"label":"Firestore collection",
20+
"helpText": "Firestore collection with HTTPArchive data",
21+
"isOptional": true
22+
},
23+
{
24+
"name": "firestore_database",
25+
"label":"Firestore database",
26+
"helpText": "Firestore database with HTTPArchive data",
27+
"isOptional": true
28+
},
29+
{
30+
"name": "date",
31+
"label":"Date",
32+
"helpText": "Date to query",
33+
"isOptional": true
34+
}
35+
]
36+
}

run_flex_template.sh

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,26 +12,32 @@ DF_JOB_ID="${REPO}-${TYPE}-$(date -u +%Y%m%d-%H%M%S)"
1212
DF_TEMP_BUCKET="gs://${PROJECT}-staging/dataflow"
1313
TEMPLATE_BASE_PATH="gs://${PROJECT}/dataflow/templates"
1414

15+
# find the latest template if unset
16+
: "${TEMPLATE_PATH:=$(gsutil ls ${TEMPLATE_BASE_PATH}/${REPO}-"${TYPE}"*.json | sort -r | head -n 1)}"
17+
1518
case "${TYPE}~${TEMPLATE_PATH}" in
1619
all~|combined~) : ;;
1720
all~gs://*all*) : ;;
1821
combined~gs://*combined*) : ;;
22+
tech_report~gs://*tech_report*) : ;;
1923
*)
20-
echo "Expected an argumment of either [all|combined] and optionally TEMPLATE_PATH to be set (otherwise the latest template will be used)"
24+
echo "Expected an argumment of either [all|combined|tech_report] and optionally TEMPLATE_PATH to be set (otherwise the latest template will be used)"
2125
echo "Examples"
2226
echo " $(basename "$0") all ..."
2327
echo " $(basename "$0") combined ..."
28+
echo " $(basename "$0") tech_report ..."
2429
echo " TEMPLATE_PATH=${TEMPLATE_BASE_PATH}/${REPO}-all-2022-10-12_00-19-44.json $(basename "$0") all ..."
2530
echo " TEMPLATE_PATH=${TEMPLATE_BASE_PATH}/${REPO}-combined-2022-10-12_00-19-44.json $(basename "$0") combined ..."
31+
echo " TEMPLATE_PATH=${TEMPLATE_BASE_PATH}/${REPO}-tech_report-2022-10-12_00-19-44.json $(basename "$0") tech_report ..."
2632
exit 1
2733
;;
2834
esac
2935

3036
# drop the first argument
3137
shift
3238

33-
# find the latest template if unset
34-
: "${TEMPLATE_PATH:=$(gsutil ls ${TEMPLATE_BASE_PATH}/${REPO}-"${TYPE}"*.json | sort -r | head -n 1)}"
39+
# replace underscores with hyphens in the job id
40+
DF_JOB_ID=${DF_JOB_ID//_/-}
3541

3642
set -u
3743

run_tech_report.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
#!/usr/bin/env python3
2+
3+
import logging
4+
5+
from apache_beam.runners import DataflowRunner
6+
7+
from modules import tech_report_pipeline
8+
9+
10+
def run(argv=None):
11+
logging.getLogger().setLevel(logging.INFO)
12+
p = tech_report_pipeline.create_pipeline()
13+
pipeline_result = p.run(argv)
14+
if not isinstance(p.runner, DataflowRunner):
15+
pipeline_result.wait_until_finish()
16+
17+
18+
if __name__ == "__main__":
19+
run()

0 commit comments

Comments
 (0)