Skip to content

Commit a1199a5

Browse files
committed
Init
0 parents  commit a1199a5

File tree

15 files changed

+808
-0
lines changed

15 files changed

+808
-0
lines changed

.dockerignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
kubernetes/
2+
Dockerfile
3+
settings.yaml
4+
README.md

.gitignore

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# C extensions
7+
*.so
8+
9+
# Distribution / packaging
10+
.Python
11+
build/
12+
develop-eggs/
13+
dist/
14+
downloads/
15+
eggs/
16+
.eggs/
17+
lib/
18+
lib64/
19+
parts/
20+
sdist/
21+
var/
22+
wheels/
23+
share/python-wheels/
24+
*.egg-info/
25+
.installed.cfg
26+
*.egg
27+
MANIFEST
28+
29+
# PyInstaller
30+
# Usually these files are written by a python script from a template
31+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
32+
*.manifest
33+
*.spec
34+
35+
# Installer logs
36+
pip-log.txt
37+
pip-delete-this-directory.txt
38+
39+
# Unit test / coverage reports
40+
htmlcov/
41+
.tox/
42+
.nox/
43+
.coverage
44+
.coverage.*
45+
.cache
46+
nosetests.xml
47+
coverage.xml
48+
*.cover
49+
*.py,cover
50+
.hypothesis/
51+
.pytest_cache/
52+
cover/
53+
54+
# Translations
55+
*.mo
56+
*.pot
57+
58+
# Django stuff:
59+
*.log
60+
local_settings.py
61+
db.sqlite3
62+
db.sqlite3-journal
63+
64+
# Flask stuff:
65+
instance/
66+
.webassets-cache
67+
68+
# Scrapy stuff:
69+
.scrapy
70+
71+
# Sphinx documentation
72+
docs/_build/
73+
74+
# PyBuilder
75+
.pybuilder/
76+
target/
77+
78+
# Jupyter Notebook
79+
.ipynb_checkpoints
80+
81+
# IPython
82+
profile_default/
83+
ipython_config.py
84+
85+
# pyenv
86+
# For a library or package, you might want to ignore these files since the code is
87+
# intended to run in multiple environments; otherwise, check them in:
88+
# .python-version
89+
90+
# pipenv
91+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
92+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
93+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
94+
# install all needed dependencies.
95+
#Pipfile.lock
96+
97+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
98+
__pypackages__/
99+
100+
# Celery stuff
101+
celerybeat-schedule
102+
celerybeat.pid
103+
104+
# SageMath parsed files
105+
*.sage.py
106+
107+
# Environments
108+
.env
109+
.venv
110+
env/
111+
venv/
112+
ENV/
113+
env.bak/
114+
venv.bak/
115+
116+
# Spyder project settings
117+
.spyderproject
118+
.spyproject
119+
120+
# Rope project settings
121+
.ropeproject
122+
123+
# mkdocs documentation
124+
/site
125+
126+
# mypy
127+
.mypy_cache/
128+
.dmypy.json
129+
dmypy.json
130+
131+
# Pyre type checker
132+
.pyre/
133+
134+
# pytype static type analyzer
135+
.pytype/
136+
137+
# Cython debug symbols
138+
cython_debug/

Dockerfile

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
FROM python:3.7
2+
3+
ENV PYTHONUNBUFFERED 1
4+
5+
COPY . /app
6+
WORKDIR /app
7+
8+
RUN pip install --upgrade pip && \
9+
pip install -r requirements.txt
10+
11+
CMD ["python", "index.py"]

README.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
## Data reporting for MyTardis
2+
3+
This script will populate Elasticsearch initial index for reporting from MyTardis datafile records. It can also add new or update existing data for a time period.
4+
5+
We will support MyTardis version 4.2+
6+
7+
### Technical details
8+
9+
Settings are available through default setting.yaml config file.
10+
11+
You must specify credentials to the database and location of Elasticsearch server. You can increase number of rows fetched per single bulk call.
12+
13+
Run from command line:
14+
15+
```
16+
python index.py [--config CONFIG] [--days DAYS] [--rebuild]
17+
18+
optional arguments:
19+
--config CONFIG Config file location.
20+
--days DAYS Populate past DAYS of data only, default is -1 to index all data.
21+
--rebuild Delete and create index.
22+
```
23+
24+
### Docker and Kubernetes
25+
26+
We build automatically latest version of Docker image and publish it on DockerHub with mytardis/es-reporting:latest image name.
27+
28+
Sample files in [kubernetes](./kubernetes/) folder will provide you with example of running this tool in Kubernetes.

index.py

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
import sys
2+
import argparse
3+
import os
4+
import yaml
5+
import json
6+
7+
from psycopg2 import connect
8+
from psycopg2.extras import RealDictCursor
9+
from elasticsearch import Elasticsearch, helpers
10+
11+
from reporting import count_from_db, data_from_db, get_extras, data_to_es
12+
13+
14+
def init_es_index(index_name):
15+
16+
with open("{}.json".format(index_name)) as f:
17+
config = json.load(f)
18+
19+
es.indices.delete(
20+
index=index_name,
21+
ignore_unavailable=True
22+
)
23+
24+
es.indices.create(
25+
index=index_name,
26+
body=config
27+
)
28+
29+
30+
parser = argparse.ArgumentParser()
31+
parser.add_argument(
32+
"--config",
33+
default="settings.yaml",
34+
help="Config file location."
35+
)
36+
parser.add_argument(
37+
"--days",
38+
type=int,
39+
default=1,
40+
help="Populate past days of data."
41+
)
42+
parser.add_argument(
43+
"--rebuild",
44+
action="store_true",
45+
help="Delete and create index."
46+
)
47+
48+
args = parser.parse_args()
49+
50+
if os.path.isfile(args.config):
51+
with open(args.config) as f:
52+
settings = yaml.load(f, Loader=yaml.Loader)
53+
else:
54+
sys.exit("Can't find settings.")
55+
56+
try:
57+
con = connect(
58+
host=settings["database"]["host"],
59+
port=settings["database"]["port"],
60+
user=settings["database"]["username"],
61+
password=settings["database"]["password"],
62+
database=settings["database"]["database"]
63+
)
64+
except Exception:
65+
sys.exit("Can't connect to the database.")
66+
67+
try:
68+
es_host = "{}:{}".format(
69+
settings["elasticsearch"]["host"],
70+
settings["elasticsearch"]["port"]
71+
)
72+
es = Elasticsearch([es_host])
73+
except Exception:
74+
con.close()
75+
sys.exit("Can't connect to the Elasticsearch.")
76+
77+
cur = con.cursor(cursor_factory=RealDictCursor)
78+
79+
if args.rebuild:
80+
print("Rebuild index.")
81+
init_es_index(settings["index"]["name"])
82+
83+
start = 0
84+
to_go = 1
85+
cache = {}
86+
87+
while to_go > 0:
88+
89+
to_go = count_from_db(cur, args.days, start)
90+
print(
91+
"{:,} datafileobjects to index, {:,} datasets cached"
92+
.format(to_go, len(cache))
93+
)
94+
95+
if to_go > 0:
96+
97+
rows = data_from_db(cur, args.days, start, settings["index"]["limit"])
98+
99+
dataset_ids = list(set([row["dataset_id"] for row in rows]))
100+
extra_ds_ids = []
101+
for ds_id in dataset_ids:
102+
if ds_id not in cache:
103+
extra_ds_ids.append(ds_id)
104+
if len(extra_ds_ids) != 0:
105+
extras = get_extras(cur, extra_ds_ids)
106+
for k in extras:
107+
cache[k] = extras[k]
108+
109+
data = []
110+
for row in rows:
111+
if row["dfo_id"] > start:
112+
start = row["dfo_id"]
113+
ds_id = row["dataset_id"]
114+
if ds_id in cache:
115+
data.append({**row, **cache[ds_id]})
116+
else:
117+
data.append(row)
118+
119+
helpers.bulk(es, data_to_es(settings["index"]["name"], data))
120+
121+
print("Completed.")
122+
cur.close()
123+
con.close()

kubernetes/README.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
## Kubernetes deployment
2+
3+
1. Start with changing configMap values in `configmap.yaml` file according to your setup (namespaces, names, credentials).
4+
5+
Deploy config map:
6+
`kubectl create -f configmap.yaml`
7+
8+
9+
2. Run initial job to build index and populate data.
10+
11+
`kubectl create -f job-create.yaml`
12+
13+
It will provide script with 2 optional arguments, --days=-1 to index all data and --rebuild to create index in Elasticsearch.
14+
15+
3. Schedule cron job to update only last 24 hours of data.
16+
17+
`kubectl create -f cronjob-update.yaml`
18+
19+
### Kibana
20+
21+
You can deploy Kibana:
22+
23+
`kubectl create -f kibana.yaml`
24+
25+
and expose it to the public using (nginx) Ingress with username/password authentication.
26+
27+
Firstly, generate auth file and load it as a secret:
28+
29+
```
30+
htpasswd -c auth.txt mytardis
31+
kubectl -n mytardis create secret generic reporting --from-file=auth.txt
32+
rm auth.txt
33+
```
34+
35+
Secondly, deploy ingress with annotation to use secret:
36+
37+
`kubectl create -f ingress.yaml`

kubernetes/configmap.yaml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
apiVersion: v1
2+
kind: ConfigMap
3+
metadata:
4+
name: es-reporting
5+
namespace: mytardis
6+
data:
7+
settings.yaml: |
8+
database:
9+
host: pgbouncer.postgres.svc.cluster.local
10+
port: 5432
11+
username: user
12+
password: pass
13+
database: postgres
14+
15+
elasticsearch:
16+
host: elasticsearch.mytardis.svc.cluster.local
17+
port: 9200
18+
19+
index:
20+
name: reporting
21+
limit: 10000

kubernetes/cronjob-update.yaml

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
apiVersion: batch/v1beta1
2+
kind: CronJob
3+
metadata:
4+
name: es-reporting-update
5+
namespace: mytardis
6+
spec:
7+
schedule: "15 0 * * *"
8+
jobTemplate:
9+
spec:
10+
template:
11+
spec:
12+
restartPolicy: OnFailure
13+
containers:
14+
- name: go
15+
image: mytardis/es-reporting:latest
16+
imagePullPolicy: Always
17+
command:
18+
- python
19+
- index.py
20+
- --days=1
21+
volumeMounts:
22+
- name: settings
23+
mountPath: /app/settings.yaml
24+
subPath: settings.yaml
25+
volumes:
26+
- name: settings
27+
configMap:
28+
name: es-reporting

0 commit comments

Comments
 (0)