Skip to content

Commit 23ecf3e

Browse files
authored
feat: Adding Dockerfile for stats and CI for building container images (#24)
* feat: Adding CI for building Docker images * fixed ci * removed install r package * fixed ci * Added pytest as a dependency * Added missing dependencies * running container workflow on plots * fixed CI image tag * CI image push not for forks * Set white background explicitly * Set theme kwargs * increased font size * force colors * update font family * update font size * increased line width
1 parent 086d23d commit 23ecf3e

30 files changed

+227
-22
lines changed

.github/workflows/ci.yml

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
name: CI Pipeline
2+
3+
on:
4+
push:
5+
pull_request:
6+
7+
env:
8+
REGISTRY: ghcr.io
9+
IMAGE_NAME: ${{ github.repository }}
10+
11+
jobs:
12+
test-and-build-stats:
13+
runs-on: ubuntu-latest
14+
permissions:
15+
contents: read
16+
packages: write
17+
18+
steps:
19+
- name: Checkout repository
20+
uses: actions/checkout@v4
21+
22+
- name: Log in to Container Registry
23+
uses: docker/login-action@v3
24+
with:
25+
registry: ${{ env.REGISTRY }}
26+
username: ${{ github.actor }}
27+
password: ${{ secrets.GITHUB_TOKEN }}
28+
29+
- name: Extract metadata for stats image
30+
id: meta-stats
31+
uses: docker/metadata-action@v5
32+
with:
33+
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}/stats
34+
tags: |
35+
type=ref,event=branch
36+
type=ref,event=pr
37+
type=sha
38+
type=raw,value=latest,enable={{is_default_branch}}
39+
40+
- name: Build stats Docker image
41+
uses: docker/build-push-action@v5
42+
with:
43+
context: .
44+
file: ./stats.Dockerfile
45+
push: false
46+
tags: |
47+
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}/stats:${{ github.sha }}
48+
${{ steps.meta-stats.outputs.tags }}
49+
labels: ${{ steps.meta-stats.outputs.labels }}
50+
51+
- name: Run unit tests
52+
run: |
53+
docker run --rm \
54+
-v ${{ github.workspace }}/tests:/app/tests \
55+
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}/stats:${{ github.sha }} \
56+
python -m pytest -s tests/
57+
58+
- name: Push stats Docker image
59+
if: success() && github.event.pull_request.head.repo.full_name == github.repository
60+
uses: docker/build-push-action@v5
61+
with:
62+
context: .
63+
file: ./stats.Dockerfile
64+
push: true
65+
tags: ${{ steps.meta-stats.outputs.tags }}
66+
labels: ${{ steps.meta-stats.outputs.labels }}
67+
68+
build-site:
69+
runs-on: ubuntu-latest
70+
permissions:
71+
contents: read
72+
packages: write
73+
74+
steps:
75+
- name: Checkout repository
76+
uses: actions/checkout@v4
77+
78+
- name: Log in to Container Registry
79+
uses: docker/login-action@v3
80+
with:
81+
registry: ${{ env.REGISTRY }}
82+
username: ${{ github.actor }}
83+
password: ${{ secrets.GITHUB_TOKEN }}
84+
85+
- name: Extract metadata for site image
86+
id: meta-site
87+
uses: docker/metadata-action@v5
88+
with:
89+
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}/site
90+
tags: |
91+
type=ref,event=branch
92+
type=ref,event=pr
93+
type=sha
94+
type=raw,value=latest,enable={{is_default_branch}}
95+
96+
- name: Build and push site Docker image
97+
uses: docker/build-push-action@v5
98+
with:
99+
context: .
100+
file: ./site.Dockerfile
101+
push: ${{ github.event.pull_request.head.repo.full_name == github.repository }}
102+
tags: ${{ steps.meta-site.outputs.tags }}
103+
labels: ${{ steps.meta-site.outputs.labels }}

README.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,30 @@ To preview local changes, it's possible to serve the site locally:
129129

130130
... and then the site will be served on http://0.0.0.0:4000 instead. (You will of course need to rebuild the Docker image after updating the Dockerfile.)
131131

132+
133+
Run via Container
134+
-----------------
135+
136+
The whole workflow can be run as a container (docker or podman) including downloading stats files from Common Crawl's S3 bucket and generating new plots.
137+
138+
```bash
139+
# clone the repository (to have the latest crawl IDs)
140+
git clone https://github.com/commoncrawl/cc-crawl-statistics.git
141+
cd cc-crawl-statistics
142+
143+
# download stats and generate plots
144+
# SSH, AWS keys, and stats and plots directories must be mounted into the container
145+
podman run --rm -v ~/.ssh:/root/.ssh:ro -v ~/.aws:/root/.aws:ro -v $(pwd -P)/stats:/app/stats -v $(pwd -P)/plots:/app/plots ghcr.io/commoncrawl/cc-crawl-statistics/stats:latest
146+
147+
# if needed you can manually build the container image
148+
podman build -f stats.Dockerfile -t ghcr.io/commoncrawl/cc-crawl-statistics/stats:latest
149+
150+
# for development it is recommend to mount the whole repository into the container
151+
podman run -it -v ~/.ssh:/root/.ssh:ro -v ~/.aws:/root/.aws:ro -v $(pwd -P):/app ghcr.io/commoncrawl/cc-crawl-statistics/stats:latest /bin/bash
152+
153+
```
154+
155+
132156
Related Projects
133157
----------------
134158

crawlplot.py

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,14 @@
1111
from rpy2.robjects.lib import ggplot2
1212
from rpy2.robjects import pandas2ri
1313
pandas2ri.activate()
14-
GGPLOT2_THEME = ggplot2.theme_minimal()
14+
# use minimal theme with white background set in plot constructor
15+
# https://ggplot2.tidyverse.org/reference/ggtheme.html
16+
GGPLOT2_THEME = ggplot2.theme_minimal(base_size=12, base_family="Helvetica")
17+
18+
GGPLOT2_THEME_KWARGS = {
19+
'panel.background': ggplot2.element_rect(fill='white', color='white'),
20+
'plot.background': ggplot2.element_rect(fill='white', color='white')
21+
}
1522
# GGPLOT2_THEME = ggplot2.theme_grey()
1623

1724

@@ -48,10 +55,11 @@ def line_plot(self, data, title, ylabel, img_file,
4855
data['size'] = data['size'].astype(float)
4956
p = ggplot2.ggplot(data) \
5057
+ ggplot2.aes_string(x=x, y=y, color=c) \
51-
+ ggplot2.geom_line(linewidth=.2) + ggplot2.geom_point() \
58+
+ ggplot2.geom_line(linewidth=.5) + ggplot2.geom_point() \
5259
+ GGPLOT2_THEME \
5360
+ ggplot2.theme(**{'legend.position': 'bottom',
54-
'aspect.ratio': ratio}) \
61+
'aspect.ratio': ratio,
62+
**GGPLOT2_THEME_KWARGS}) \
5563
+ ggplot2.labs(title=title, x='', y=ylabel, color=clabel)
5664
img_path = os.path.join(PLOTDIR, img_file)
5765
p.save(img_path)

get_stats_and_plot.sh

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
#!/bin/bash
2+
set -e
3+
4+
echo "Starting ..."
5+
6+
./get_stats.sh
7+
8+
# make sure plot directories exist
9+
mkdir -p plots/crawler
10+
mkdir -p plots/crawloverlap
11+
mkdir -p plots/crawlsize
12+
mkdir -p plots/throughput
13+
mkdir -p plots/tld
14+
15+
./plot.sh
16+
17+
echo "Done."

plot/crawl_size.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,9 @@
99

1010
from rpy2.robjects.lib import ggplot2
1111
from rpy2.robjects import pandas2ri
12+
from rpy2 import robjects
1213

13-
from crawlplot import CrawlPlot, PLOTDIR, GGPLOT2_THEME
14+
from crawlplot import CrawlPlot, PLOTDIR, GGPLOT2_THEME, GGPLOT2_THEME_KWARGS
1415

1516
from crawlstats import CST, CrawlStatsJSONDecoder, HYPERLOGLOG_ERROR,\
1617
MonthlyCrawl
@@ -286,9 +287,10 @@ def plot(self):
286287
color='black', size=2,
287288
position=ggplot2.position_dodge(width=.5)) \
288289
+ GGPLOT2_THEME \
289-
+ ggplot2.scale_fill_hue() \
290+
+ ggplot2.scale_fill_manual(values=robjects.r('c("duplicate"="#00BA38", "revisit"="#619CFF", "new"="#F8766D")')) \
290291
+ ggplot2.theme(**{'legend.position': 'right',
291-
'aspect.ratio': .7},
292+
'aspect.ratio': .7,
293+
**GGPLOT2_THEME_KWARGS},
292294
**{'axis.text.x':
293295
ggplot2.element_text(angle=45, size=10,
294296
vjust=1, hjust=1)}) \

plot/crawler_metrics.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
from rpy2.robjects.lib import ggplot2
88
from rpy2.robjects import pandas2ri
99

10-
from crawlplot import PLOTDIR, GGPLOT2_THEME
10+
from crawlplot import PLOTDIR, GGPLOT2_THEME, GGPLOT2_THEME_KWARGS
1111

1212
from crawlstats import CST, MultiCount
1313
from crawl_size import CrawlSizePlot
@@ -143,7 +143,8 @@ def plot_fetch_status(self, data, row_filter, img_file, ratio=1.0):
143143
guide=ggplot2.guide_legend(reverse=True)) \
144144
+ GGPLOT2_THEME \
145145
+ ggplot2.theme(**{'legend.position': 'bottom',
146-
'aspect.ratio': ratio}) \
146+
'aspect.ratio': ratio,
147+
**GGPLOT2_THEME_KWARGS}) \
147148
+ ggplot2.labs(title='Percentage of Fetch Status',
148149
x='', y='', fill='')
149150
img_path = os.path.join(PLOTDIR, img_file)
@@ -172,7 +173,8 @@ def plot_crawldb_status(self, data, row_filter, img_file, ratio=1.0):
172173
guide=ggplot2.guide_legend(reverse=False)) \
173174
+ GGPLOT2_THEME \
174175
+ ggplot2.theme(**{'legend.position': 'bottom',
175-
'aspect.ratio': ratio}) \
176+
'aspect.ratio': ratio,
177+
**GGPLOT2_THEME_KWARGS}) \
176178
+ ggplot2.labs(title='CrawlDb Size and Status Counts',
177179
x='', y='', fill='')
178180
img_path = os.path.join(PLOTDIR, img_file)

plot/histogram.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
from rpy2.robjects.lib import ggplot2
1010
from rpy2.robjects import pandas2ri
1111

12-
from crawlplot import CrawlPlot, PLOTDIR, GGPLOT2_THEME
12+
from crawlplot import CrawlPlot, PLOTDIR, GGPLOT2_THEME, GGPLOT2_THEME_KWARGS
1313

1414
pandas2ri.activate()
1515

@@ -119,6 +119,7 @@ def plot_domain_cumul(self, crawl):
119119
+ ggplot2.aes_string(x='cum_domains', y='cum_urls') \
120120
+ ggplot2.geom_line() + ggplot2.geom_point() \
121121
+ GGPLOT2_THEME \
122+
+ ggplot2.theme(**GGPLOT2_THEME_KWARGS) \
122123
+ ggplot2.labs(title=title, x='domains cumulative',
123124
y='URLs cumulative') \
124125
+ ggplot2.scale_y_log10() \

plot/overlap.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212

1313
import pygraphviz
1414

15-
from crawlplot import CrawlPlot, PLOTDIR, GGPLOT2_THEME
15+
from crawlplot import CrawlPlot, PLOTDIR, GGPLOT2_THEME, GGPLOT2_THEME_KWARGS
1616

1717
pandas2ri.activate()
1818

@@ -135,7 +135,8 @@ def plot_similarity_matrix(self, item_type, image_file, title):
135135
+ ggplot2.coord_fixed() \
136136
+ ggplot2.theme(**{'axis.text.x':
137137
ggplot2.element_text(angle=45,
138-
vjust=1, hjust=1)}) \
138+
vjust=1, hjust=1),
139+
**GGPLOT2_THEME_KWARGS}) \
139140
+ ggplot2.labs(title=title, x='', y='') \
140141
+ ggplot2.geom_text(color='black', size=textsize)
141142
img_path = os.path.join(PLOTDIR, image_file)

plot/tld_by_continent.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88

99
from rpy2.robjects.lib import ggplot2
1010

11-
from crawlplot import PLOTDIR, GGPLOT2_THEME
11+
from crawlplot import PLOTDIR, GGPLOT2_THEME, GGPLOT2_THEME_KWARGS
1212
from crawlstats import MonthlyCrawl, MultiCount
1313
from top_level_domain import TopLevelDomain
1414

@@ -226,6 +226,7 @@ def tld2continent(tld):
226226
x='', y='Percentage', fill='TLD / Continent') \
227227
+ ggplot2.theme(**{'legend.position': 'right',
228228
'aspect.ratio': .7,
229+
**GGPLOT2_THEME_KWARGS,
229230
'axis.text.x':
230231
ggplot2.element_text(angle=45,
231232
vjust=1, hjust=1)})

plots/crawler/crawldb_status.png

-164 KB
Loading

0 commit comments

Comments
 (0)