Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions Pipfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

[packages]
flask = "*"
requests = "*"
docker = "==5.0.3"

[dev-packages]

[requires]
python_version = "3.9"
178 changes: 178 additions & 0 deletions Pipfile.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

94 changes: 56 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,66 +1,84 @@
# Python Systems / Infrastructure Hiring Challenge
# Infra Hiring Challenge

# Introduction
This is my submissions to the Infra Hiring Challenge for the second interview for Datapane.

Datapane is an API-driven product for building analytics reports in Python - part of this includes running user's Python scripts and controlling their execution from a central server. This project simulates some of the necessary tasks required in developing such a system.
## Requirements
A docker environment, python3 and pipenv are all needed on the machine running this script.
## Setup

# Task
Use the package manager [pipenv] to install the necessary modules. Make sure you're in the [python-files directory] and run the following commands

For this task we'll be building a simple API server with a few endpoints that reolve around accepting arbitary Python code and running it "securely".
```bash
pipenv shell
pipenv install
```

The system must run on Linux, and can make use of any client/server technologies of your choice.
Edit the `app.py` file and set the directory you want to save the clients' projects to. This is for both endpoints.
Edit the `python-api.py` file and set the root directory to this project directory.

## Server
## Usage

The API server supports a few endpoints.
```
python3 python-api.py
```

`/run-file/`
## Testing Uploads of files /run-file
Using any method of your choice such as curl or postman, send a POST request as shown below
```bash
curl --location --request POST 'http://localhost:5000/run-file' \
--form 'files[]=@"/home/rentan/main.py"' \
--form 'client="test-client"'
```

this takes as a payload an uploaded python script containing the Python code to run

`/run-json/`
## Testing Uploads of json blob
```bash
curl --location --request POST 'http://localhost:5000/run-json' \
--header 'Content-Type: application/json' \
--data-raw '{
"script_name" : "main.py",
"client" : "new-company",
"code" :
["import flask, marshmallow, os, zipp, docker,time","print(\"hello world\")","os.chdir(\"/\")","print(os.getcwd())","print(\"goodbye world\")"]
}'
```

as per `run-file` above, but takes a JSON blob with a field called `code` containing the Python code to run

### Results
### Output
It should show the output of the script when completed

You may decide if the `/run-*` endpoints blocks and return a status code, or whether to implement an non-blocking model with a separate `/status/` endpoint to query each run.
## Notes
### Security
As security precautions, the image and container being built and ran will have a `python` user to execute the script which has no root privileges.

Either way, the server should listen for commands from client and act upon them - it should always be able to accept new messages.
The docker container is also run using namespaces isolation to prevent them from communicating with each other, as well as having virtual environments.

## Running Code
Each client has their own project directory and container to be used.

You need to be able to run arbitrary Python code in a clean environment - i.e. each invocation should not affect the others. You will need to make decisions around venvs, installed libraries and dependencies, and more.
The files being updated by the clients are also passed through a module to make sure the name is secure, as well as the size of the files being uploaded.

## Securing code
There is also an option to limit the memory and cpu usage.

The uploaded Python code needs to be executed as securely as possible and handle code that may be hostile. As such you'll need to provide protections against user code that may attempt to use excess resources, e.g. time, space, cpu, etc.
For the time being only `.py` files are accepted

You can look at any collection of technologies to perform sandboxing, such as systemd slices/scopes, podman, docker, chroots, seccomp filtering, and/or anything else
### Performance
A pre-built custom image called `datapane-py-slim` derived from `python:3.10-slim` was created, which had the necessary python3 modules installed, updated linux security packages, the python user created, and the necessary directories.

## Technologies
The image being used can be changed if performance is not suitable.

- Build systems, tools, and scripts of your choice, e.g. poetry, `setup.py`, docker, etc.
- The system must run on Linux and be simple to setup and run
- Any libraries you may find useful to help your task, we prioritise using existing libraries to accomplish tasks rather than building in-house and/or writing custom code that wouldn't scale to larger use-cases
The `runner` stage which uses this image only needs to switch to the python user, set env variable if needed, copy the clients' files, install the required python packages from those files, and run the main.py script

## Requirements
Execution is fairly fast as most of the time depends on the clients' script.

- You do not need to worry about client/server service discovery - the locations of the systems can be hard-coded, provided as env vars, command-line parameters, etc.
- Instructions should be provided on how to build / bundle / start the system
- You should aim to use the latest Python language features, ecosystem, tooling, and libraries where possible

### Optional Features
### Results
For the time being the results of the scripts are only shown on stdout. It wasn't clear from the task what is needed to be done.

- Defence is depth is a valid strategy, how many of the sandboxing techniques can be combined
- Consider how you would improve this approach and productise it - what issues do you foresee and how would you attempt to solve them
- How would you tackle performance, i.e. delays in spinning up a pod on kubernetes, container startup delays, Python VM startup, reusing / caching files?
- Tests
## Improvements

# Review
Upon further understanding I realised that there could be a `/status` endpoint from which the client could check their results from. This was not implemented in time but in the future a simple client ID and cookie/key could be used. This would make sure clients' couldn't access each other's results.

Please don't spend more than 2-4 hours on this - we're looking to see how you approached the problem and the decisions made rather than a complete solution. This should be a fun challenge rather than a stressful endeavour.
Smaller docker images could be used as currently each invocation takes up 500mb.

There is no right answer as such, we will mainly be looking at code quality, software architecture skills, completeness of the solution from a software engineering perspective, and clarity of thought.
Currently there is no limit to how long the python script can take to execute.

Once completed, please create a PR containing your work, send us an email, and schedule a [second follow-up interview](https://calendar.google.com/calendar/selfsched?sstoken=UU1sbG9QV1hfcHlGfGRlZmF1bHR8ODI1ZjRlZWJlZTY0ZTQ1ZTI4MzNkZThhOGQ5MjZkNzg).
With more time better logging could be implemented to identify why an image wasn't built or why a container failed to start such as exceeded the resource limits, failed scripts etc...
10 changes: 10 additions & 0 deletions app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
from flask import Flask

# Where the clients' projects are created
UPLOAD_FOLDER = '/home/user/datapane-infra-challenge/'

app = Flask(__name__)
#app.secret_key = "secret key"
app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER
# Size of the allowed scripts being uploaded
app.config['MAX_CONTENT_LENGTH'] = 16 * 1024 * 1024 # 16mb
20 changes: 20 additions & 0 deletions default_dir/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
certifi==2021.10.8
charset-normalizer==2.0.12
click==8.1.2
docker==5.0.3
docker-pycreds==0.4.0
docopt==0.6.2
Flask==2.1.1
idna==3.3
importlib-metadata==4.11.3
itsdangerous==2.1.2
Jinja2==3.1.1
MarkupSafe==2.1.1
pipreqs==0.4.11
requests==2.27.1
six==1.16.0
urllib3==1.26.9
websocket-client==1.3.2
Werkzeug==2.1.1
yarg==0.1.9
zipp==3.8.0
Loading