Skip to content

Commit 5122a3e

Browse files
author
syncmachineuser
committed
Updated
1 parent ef7b5a2 commit 5122a3e

File tree

7 files changed

+133
-1553
lines changed

7 files changed

+133
-1553
lines changed

.gitignore

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# C extensions
7+
*.so
8+
9+
# Distribution / packaging
10+
.Python
11+
build/
12+
develop-eggs/
13+
dist/
14+
downloads/
15+
eggs/
16+
.eggs/
17+
lib/
18+
lib64/
19+
parts/
20+
sdist/
21+
var/
22+
wheels/
23+
pip-wheel-metadata/
24+
share/python-wheels/
25+
*.egg-info/
26+
.installed.cfg
27+
*.egg
28+
MANIFEST
29+
30+
# Installer logs
31+
pip-log.txt
32+
pip-delete-this-directory.txt
33+
34+
# Unit test / coverage reports
35+
htmlcov/
36+
.tox/
37+
.nox/
38+
.coverage
39+
.coverage.*
40+
.cache
41+
nosetests.xml
42+
coverage.xml
43+
*.cover
44+
*.py,cover
45+
.hypothesis/
46+
.pytest_cache/
47+
48+
# Translations
49+
*.mo
50+
*.pot
51+
52+
# Django stuff:
53+
*.log
54+
local_settings.py
55+
db.sqlite3
56+
db.sqlite3-journal
57+
58+
# Flask stuff:
59+
instance/
60+
.webassets-cache
61+
62+
# Scrapy stuff:
63+
.scrapy
64+
65+
# Sphinx documentation
66+
docs/_build/
67+
68+
# PyBuilder
69+
target/
70+
71+
# Jupyter Notebook
72+
.ipynb_checkpoints
73+
74+
# IPython
75+
profile_default/
76+
ipython_config.py
77+
78+
# pyenv
79+
.python-version
80+
81+
# pipenv
82+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
83+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
84+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
85+
# install all needed dependencies.
86+
#Pipfile.lock
87+
88+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
89+
__pypackages__/
90+
91+
# Celery stuff
92+
celerybeat-schedule
93+
celerybeat.pid
94+
95+
# SageMath parsed files
96+
*.sage.py
97+
98+
# Environments
99+
.env
100+
.venv
101+
env/
102+
venv/
103+
ENV/
104+
env.bak/
105+
venv.bak/
106+
107+
# Spyder project settings
108+
.spyderproject
109+
.spyproject
110+
111+
# Rope project settings
112+
.ropeproject
113+
114+
# mkdocs documentation
115+
/site
116+
117+
# mypy
118+
.mypy_cache/
119+
.dmypy.json
120+
dmypy.json
121+
122+
# Pyre type checker
123+
.pyre/

README.md

Lines changed: 10 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,26 @@
1-
# spark_log_parser
2-
The **Spark log parser** parses unmodified Spark output logs.
1+
# log_parser
2+
The **Parser for Apache Spark** parses unmodified Apache Spark History Server Event logs.
33

4-
Parsed logs contain metadata pertaining to your Spark application execution. Particularly, the runtime for a task, the amount of data read & written, the amount of memory used, etc. These logs do not contain
5-
sensitive information such as the data that your Spark application is processing. Below is an example of the output of the log parser
6-
![Output of Log Parser](https://github.com/synccomputingcode/spark_log_parser/blob/main/docs/output.png)
4+
Parsed logs contain metadata pertaining to your Apache Spark application execution. Particularly, the runtime for a task, the amount of data read & written, the amount of memory used, etc. These logs do not contain
5+
sensitive information such as the data that your Apache Spark application is processing. Below is an example of the output of the log parser
6+
![Output of Log Parser](docs/output.png)
77

88
# Installation
99
Clone this repo to the desired directory.
1010

1111
# Getting Started
12-
### Step 0: Generate the appropriate Apache Spark EMR log
13-
If you have not already done so, complete the [instructions](https://github.com/synccomputingcode/spark_log_parser/blob/main/docs/event_log_download.pdf) to download the spark event log.
12+
### Step 0: Generate the appropriate Apache Spark History Server Event log
13+
If you have not already done so, complete the [instructions](docs/event_log_download.pdf) to download the Apache Spark event log.
1414

1515
### Step 1: Parse the log to strip away sensitive information
1616
1. To process a log file, execute the parse.py script in the sync_parser folder, and provide a
1717
log file destination with the -d flag.
1818

1919
`python3 sync_parser/parse.py -d [log file location]`
2020

21-
The parsed file `[log file name].spk` will appear in the sync_parser/results directory.
22-
23-
To re-process and overwrite a previously generated parsed log add the -o flag:
21+
The parsed file `parsed-[log file name]` will appear in the results directory.
2422

25-
`python3 sync_parser/parse.py -d [log file location] -o`
2623

27-
3. Send Sync Computing the parsed log
24+
2. Send Sync Computing the parsed log
2825

29-
The parsed file `[log file name].spk` will appear in the sync_parser/results directory. Email
30-
your contact at Sync Computing the parsed file.
26+
Email Sync Computing (or upload to the Sync Auto-tuner) the parsed event log.

docs/event_log_download.pdf

-339 KB
Binary file not shown.

docs/output.png

-1.53 MB
Binary file not shown.

0 commit comments

Comments
 (0)