Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions docs/source/analytics/cloud-costs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
(analytics/cloud-costs)=

# Cloud Costs Data

In an effort to be transparent about how we use our funds, we publish the amount of money spent each day in cloud compute costs for running mybinder.org.

## Interpreting the data

You can find the data in the [Analytics Archive](https://archive.analytics.mybinder.org) at [cloud-costs.jsonl](https://archive.analytics.mybinder.org/cloud-costs.jsonl). Each line in the file is a JSON object, with the following keys:

1. **version**

Currently _1_, will be incremented when the structure of this format changes.

2. **start_time** and **end_time**

The start and end of the billing period this item represents. These times are inclusive, and in pacific time observing DST (so PDT or PST). The timezone choice is unfortunate, but unfortunately our cloud provider (Google Cloud Platform) provides detailed billing reports in this timezone only.

3. **cost**

The cost of all cloud compute resources used during this time period. This is denominated in US Dollars.

The lines are sorted by `start_time`.
37 changes: 0 additions & 37 deletions docs/source/analytics/cloud-costs.rst

This file was deleted.

79 changes: 79 additions & 0 deletions docs/source/analytics/events-archive.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
(analytics/events-archive)=

# The Analytics Events Archive

BinderHub emits an event each time a repository is launched. They are recorded as JSON, and made available to the public at [archive.analytics.mybinder.org](https://archive.analytics.mybinder.org).

This page describes what is available in the Events Archive & how to interpret it.

## File format

All data files are in [jsonl](https://jsonlines.org/) format. Each line, delimited by a `\n` is a is a well formed JSON object. These files can be read / written in a streaming fashion, one line at a time, without having to read the entire file into memory.

## Launch data by date

For each day since we started keeping track (2018-11-03), there is a file named `events-<YYYY>-<MM>-<DD>.jsonl` that contains data for all the launches performed by mybinder.org on that date. All timestamps and dates are in [UTC](https://en.wikipedia.org/wiki/Coordinated_Universal_Time).

Each line is a JSON object that conforms to [this JSON Schema](https://github.com/jupyterhub/binderhub/blob/HEAD/binderhub/event-schemas/launch.json). A description of these fields is provided below.

1. **schema** and **version**

Currently set to `binderhub.jupyter.org/launch` and `1` respectively. These identify the kind of event this is (a launch event from BinderHub) and the current version of the event schema. This lets us evolve the format of the events emitted without breaking existing analytics code. New versions of the launch schema may add additional fields, or change meanings of current ones. We will definitely add other events that are available here too -for example, successful builds.

Your analytics code **must** make sure the event you are parsing has the schema and version you are expecting before proceeding. If you don\'t do this, your code might fail in unexpected ways in the future.

2. **timestamp**

ISO8601 formatted timestamp when the event was emitted. These are rounded down to the closest minute. The lines in the file are ordered by timestamp, starting at the earliest.

3. **provider**

Where the launched repository was hosted. Current options are `GitHub`, `GitLab` and `Git`.

4. **spec**

Specification identifying the repository / commit immutably & uniquely in the provider.

For GitHub, it is `<repo>/<commit-spec>`. Example would be `yuvipanda/example-requirements/HEAD`. For GitLab, it is `<repo>/<commit-spec>`, except `repo` is URL escaped. For raw Git repositories, it is `<repo-url>/<commit-spec>`. `repo-url` is full URL escaped to the repo and `commit-spec` is a full commit hash.

5. **status**

Wether the launch succeeded (`success`) or failed (`failure`). Currently only successful launches are recorded.

### Example code

Some popular ways of reading this event data into a useful data structure are provided here.

#### `pandas`

```python
import pandas as pd
df = pd.read_json("https://archive.analytics.mybinder.org/events-2018-11-05.jsonl", lines=True)
df
```

#### Plain Python

```python
import requests
import json

response = requests.get("https://archive.analytics.mybinder.org/events-2018-11-05.jsonl")
data = [json.loads(l) for l in response.iter_lines()]
```

## `index.jsonl`

The [index.jsonl](https://archive.analytics.mybinder.org/index.jsonl) file lists all the dates an event archive is available for. The following fields are present for each line:

1. **date**

The UTC date the event archive is for

2. **name**

The name of the file containing the events. This is a relative path - since we got the `index.jsonl` file from [https://archive.analytics.mybinder.org]{.title-ref}, that is the base URL used to resolve these. For example when `name` is `events-2018-11-05.jsonl`, the full URL to the file is `https://archive.analytics.mybinder.org/events-2018-11-05.jsonl`.

3. **count**

Total number of events in the file.
122 changes: 0 additions & 122 deletions docs/source/analytics/events-archive.rst

This file was deleted.

9 changes: 9 additions & 0 deletions docs/source/analytics/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Analytics

A public events archive with data about daily Binder launches.

```{toctree}
:maxdepth: 2
events-archive.md
cloud-costs.md
```
10 changes: 0 additions & 10 deletions docs/source/analytics/index.rst

This file was deleted.

12 changes: 12 additions & 0 deletions docs/source/components/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Components

These pages describe the different technical pieces that make up the mybinder.org deployment.

```{toctree}
:maxdepth: 2
metrics.md
dashboards.md
ingress.md
cloud.md
matomo.md
```
15 changes: 0 additions & 15 deletions docs/source/components/index.rst

This file was deleted.

41 changes: 41 additions & 0 deletions docs/source/components/matomo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Matomo (formerly Piwik) analytics

[Matomo](https://matomo.org/) is a self-hosted free & open source alternative to [Google Analytics](https://analytics.google.com).

## Why?

Matomo gives us better control of what is tracked, how long it is stored & what we can do with the data. We would like to collect as little data as possible & share it with the world in safe ways as much as possible. Matomo is an important step in making this possible.

## How it is set up?

Matomo is a PHP+MySQL application. We use the apache based upstream [docker image](https://hub.docker.com/_/matomo/) to run it. We can improve performance in the future if we wish by switching to `nginx+fpm`.

We use [Google CloudSQL for MySQL](https://cloud.google.com/sql/docs/mysql/) to provision a fully managed, standard mysql database. The [sidecar pattern](https://cloud.google.com/sql/docs/mysql/connect-kubernetes-engine) is used to connect Matomo to this database. A service account with appropriate credentials to connect to the database has been provisioned & checked-in to the repo. A MySQL user with name `matomo` & a MySQL database with name `matomo` should also be created in the Google Cloud Console.

## Initial Installation

Matomo is a PHP application, and this has a number of drawbacks. The initial install **[must](https://github.com/matomo-org/matomo/issues/10257)** be completed with a manual web interface. Matomo will error if it finds a complete `config.ini.php` file (which we provide) but no database tables exist.

The first time you install Matomo, you need to do the following:

1. Do a deploy. This sets up Matomo, but not the database tables
2. Use `kubectl --namespace=<namespace> exec -it <matomo-pod> /bin/bash` to get shell on the matomo container.
3. Run `rm config/config.ini.php`.
4. Visit the web interface & complete installation. The database username & password are available in the secret encrypted files in this repo. So is the admin username and password. This creates the database tables.
5. When the setup is complete, delete the pod. This should bring up our `config.ini.php` file, and everything should work normally.

This is not ideal.

## Admin access

The admin username for Matomo is `admin`. You can find the password in `secret/staging.yaml` for staging & `secret/prod.yaml` for prod.

## Security

PHP code is notoriously hard to secure. Matomo has had security audits, so it\'s not the worst. However, we should treat it with suspicion & wall off as much of it away as possible. Arbitrary code execution vulnerabilities often happen in PHP, so we gotta use that as our security model.

We currently have:

1. A firewall hole (in Google Cloud) allowing it access to the CloudSQL instance it needs to store data in. Only port 3307 (which is used by the OAuth2+ServiceAccount authenticated CloudSQLProxy) is open. This helps prevent random MySQL password grabbers from inside the cluster.
2. A Kubernetes NetworkPolicy is in place that limits what outbound connections Matomo can make. This should be further tightened down -ingress should only be allowed on the nginx port from our ingress controllers.
3. We do not mount a Kubernetes ServiceAccount in the Matomo pod. This denies it access to the KubernetesAPI.
Loading