You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+43-39Lines changed: 43 additions & 39 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,22 +3,24 @@
3
3
This coding challenge is a collection of _Python_ jobs that are supposed to extract, transform and load data.
4
4
These jobs are using _PySpark_ to process larger volumes of data and are supposed to run on a _Spark_ cluster (via `spark-submit`).
5
5
6
-
## Gearing Up for the Pairing Session
6
+
## Preparing for the interview
7
7
8
-
**✅ Goals**
8
+
> [!WARNING]
9
+
> The exercises will be given at the time of interview, and **solved by pairing with the interviewer**.
10
+
> Please do not solve the exercises before the interview.
9
11
10
-
1.**Get a working environment** See local [local](#local-setup)
11
-
2.**Get a high-level understanding of the code and test dataset structure**
12
-
3. Have your preferred text editor or IDE setup and ready to go.
13
-
14
-
**❌ Non-Goals**
12
+
**✅ Goals:**
15
13
16
-
- solving the exercises / writing code
17
-
> ⚠️ The exercises will be given at the time of interview, and solved by pairing with the interviewer.
14
+
1.**Get a [working environment set up](#setup-the-environment).** You can setup a [local environment](#option-1-local-setup), use a [devcontainer](#option-2-devcontainer-setup) or use [Github codespaces](#option-3-github-codespaces).
15
+
2.2.**Get a high-level understanding of the code and test dataset structure**
16
+
3. Have your preferred text editor or IDE setup and ready to go.
17
+
4. ⚠️ Don't solve the exercises before the interview. ⚠️
18
18
19
-
### Local Setup
19
+
## Setup the environment
20
+
### Option 1: Local Setup
20
21
21
-
> 💡 Use the [Devcontainer setup](#devcontainer-setup) if you encounter issues.
22
+
> [!TIP]
23
+
> Use the [Devcontainer setup](#option-2-devcontainer-setup) if you encounter issues.
22
24
23
25
#### Pre-requisites
24
26
@@ -32,23 +34,23 @@ Please make sure you have the following installed and can run them
32
34
33
35
We recommend using WSL 2 on Windows for this exercise, due to the [lack of support](https://cwiki.apache.org/confluence/display/HADOOP2/WindowsProblems) of windows paths from Hadoop/Spark.
34
36
35
-
Follow instructions on the [Windows official page](https://learn.microsoft.com/en-us/windows/wsl/setup/environment) and then the linux install.
36
-
37
-
> 💡 Use the [Devcontainer setup](#devcontainer-setup) if you encounter issues.
37
+
Follow instructions on the [Windows official page](https://learn.microsoft.com/en-us/windows/wsl/setup/environment) and then the linux install.
38
+
Use the [Devcontainer setup](#option-2-devcontainer-setup) if you encounter issues.
38
39
39
40
#### Install all dependencies
40
41
41
42
```bash
42
43
poetry install
43
44
```
44
45
45
-
### Devcontainer setup
46
+
### Option 2: Devcontainer setup
46
47
47
48
Configuration to use dev containers is provided in `.devcontainer`
48
49
49
-
> ⚠️ this take up to 7 minutes to setup, make sure to have things running before the interview.
50
+
> [!WARNING]
51
+
> This takes up to 7 minutes to setup, make sure to have things running before the interview.
50
52
51
-
### In Github codespaces
53
+
### Option 3: Github codespaces
52
54
53
55
1.[Fork](https://github.com/techops-recsys-lateral-hiring/dataengineer-transformations-python/fork) this repository.
54
56
2. Follow [codespace instructions](https://docs.github.com/en/codespaces/developing-in-a-codespace/creating-a-codespace-for-a-repository#the-codespace-creation-process) from the forked repository, to create the environment.
@@ -59,23 +61,23 @@ This requires a working local docker setup matching your OS and licensing situat
59
61
60
62
If you have all of these, follow instructions in https://code.visualstudio.com/docs/devcontainers/containers. Otherwise, consider using codespaces.
61
63
62
-
###Verify setup
64
+
## Verify setup
63
65
64
-
> All of the following commands should be running successfully
66
+
All of the following tests should be running successfully
65
67
66
-
####Run unit tests
68
+
### Run unit tests
67
69
68
70
```bash
69
71
poetry run pytest tests/unit
70
72
```
71
73
72
-
####Run integration tests
74
+
### Run integration tests
73
75
74
76
```bash
75
77
poetry run pytest tests/integration
76
78
```
77
79
78
-
####Run style checks
80
+
### Run style checks
79
81
80
82
```bash
81
83
poetry run mypy --ignore-missing-imports --disallow-untyped-calls --disallow-untyped-defs --disallow-incomplete-defs \
@@ -84,24 +86,29 @@ poetry run mypy --ignore-missing-imports --disallow-untyped-calls --disallow-unt
84
86
poetry run ruff format && poetry run ruff check
85
87
```
86
88
87
-
### Anything else?
89
+
### Done!
88
90
89
91
All commands are passing?
90
92
You are good to go!
91
93
92
-
> ⚠️ do not try to solve the exercises ahead of the interview
94
+
> [!WARNING]
95
+
> Remember, do not try to solve the exercises ahead of the interview.
96
+
97
+
> [!TIP]
98
+
> You are allowed to customize your environment (having the test in vscode directly for example): feel free to spend the time making this comfortable for you. This is not an expectation.
93
99
94
-
You are allowed to customize your environment (having the test in vscode directly for example): feel free to spend the time making this comfortable for you. This is not an expectation.
95
100
96
-
## Jobs
97
101
98
-
There are two exercises in this repo: Word Count, and Citibike.
102
+
## Interview Exercises
103
+
104
+
There are two exercises in this repo: [Word Count](#word-count), and [Citibike](#citibike).
99
105
100
106
Currently, these exist as skeletons, and have some **initial test cases** which are defined but some are skipped.
101
107
102
-
The following section provides context over them.
108
+
The following section provides context over them. Read this before the interview to familiarise yourself with the exercises and its structure.
103
109
104
-
> ⚠️ do not try to solve the exercises ahead of the interview
110
+
> [!WARNING]
111
+
> Please, do not try to solve the exercises ahead of the interview.
105
112
106
113
### Code walk
107
114
@@ -191,7 +198,7 @@ flowchart TD
191
198
192
199
There is a dump of the datalake for this under `resources/citibike/citibike.csv` with historical data.
193
200
194
-
#### Ingest
201
+
#### 1. Ingest
195
202
196
203
Reads a `*.csv` file and transforms it to parquet format. The column names will be sanitized (whitespaces replaced).
0 commit comments