diff --git a/docs/source/analytics/cloud-costs.md b/docs/source/analytics/cloud-costs.md new file mode 100644 index 000000000..931556158 --- /dev/null +++ b/docs/source/analytics/cloud-costs.md @@ -0,0 +1,23 @@ +(analytics/cloud-costs)= + +# Cloud Costs Data + +In an effort to be transparent about how we use our funds, we publish the amount of money spent each day in cloud compute costs for running mybinder.org. + +## Interpreting the data + +You can find the data in the [Analytics Archive](https://archive.analytics.mybinder.org) at [cloud-costs.jsonl](https://archive.analytics.mybinder.org/cloud-costs.jsonl). Each line in the file is a JSON object, with the following keys: + +1. **version** + + Currently _1_, will be incremented when the structure of this format changes. + +2. **start_time** and **end_time** + + The start and end of the billing period this item represents. These times are inclusive, and in pacific time observing DST (so PDT or PST). The timezone choice is unfortunate, but unfortunately our cloud provider (Google Cloud Platform) provides detailed billing reports in this timezone only. + +3. **cost** + + The cost of all cloud compute resources used during this time period. This is denominated in US Dollars. + +The lines are sorted by `start_time`. diff --git a/docs/source/analytics/cloud-costs.rst b/docs/source/analytics/cloud-costs.rst deleted file mode 100644 index f91c2e55b..000000000 --- a/docs/source/analytics/cloud-costs.rst +++ /dev/null @@ -1,37 +0,0 @@ -.. _analytics/cloud-costs: - -================ -Cloud Costs Data -================ - -In an effort to be transparent about how we use our funds, -we publish the amount of money spent each day in cloud -compute costs for running mybinder.org. - -Interpreting the data -===================== - -You can find the data in the `Analytics Archive -`_ at `cloud-costs.jsonl -`_. Each line in -the file is a JSON object, with the following keys: - -#. **version** - - Currently *1*, will be incremented when the structure of this format - changes. - -#. **start_time** and **end_time** - - The start and end of the billing period this item represents. These - times are inclusive, and in pacific time observing DST (so PDT or PST). - The timezone choice is unfortunate, but unfortunately our cloud provider - (Google Cloud Platform) provides detailed billing reports in this timezone - only. - -#. **cost** - - The cost of all cloud compute resources used during this time period. This - is denominated in US Dollars. - -The lines are sorted by ``start_time``. diff --git a/docs/source/analytics/events-archive.md b/docs/source/analytics/events-archive.md new file mode 100644 index 000000000..408cc8d86 --- /dev/null +++ b/docs/source/analytics/events-archive.md @@ -0,0 +1,79 @@ +(analytics/events-archive)= + +# The Analytics Events Archive + +BinderHub emits an event each time a repository is launched. They are recorded as JSON, and made available to the public at [archive.analytics.mybinder.org](https://archive.analytics.mybinder.org). + +This page describes what is available in the Events Archive & how to interpret it. + +## File format + +All data files are in [jsonl](https://jsonlines.org/) format. Each line, delimited by a `\n` is a is a well formed JSON object. These files can be read / written in a streaming fashion, one line at a time, without having to read the entire file into memory. + +## Launch data by date + +For each day since we started keeping track (2018-11-03), there is a file named `events---
.jsonl` that contains data for all the launches performed by mybinder.org on that date. All timestamps and dates are in [UTC](https://en.wikipedia.org/wiki/Coordinated_Universal_Time). + +Each line is a JSON object that conforms to [this JSON Schema](https://github.com/jupyterhub/binderhub/blob/HEAD/binderhub/event-schemas/launch.json). A description of these fields is provided below. + +1. **schema** and **version** + + Currently set to `binderhub.jupyter.org/launch` and `1` respectively. These identify the kind of event this is (a launch event from BinderHub) and the current version of the event schema. This lets us evolve the format of the events emitted without breaking existing analytics code. New versions of the launch schema may add additional fields, or change meanings of current ones. We will definitely add other events that are available here too -for example, successful builds. + + Your analytics code **must** make sure the event you are parsing has the schema and version you are expecting before proceeding. If you don\'t do this, your code might fail in unexpected ways in the future. + +2. **timestamp** + + ISO8601 formatted timestamp when the event was emitted. These are rounded down to the closest minute. The lines in the file are ordered by timestamp, starting at the earliest. + +3. **provider** + + Where the launched repository was hosted. Current options are `GitHub`, `GitLab` and `Git`. + +4. **spec** + + Specification identifying the repository / commit immutably & uniquely in the provider. + + For GitHub, it is `/`. Example would be `yuvipanda/example-requirements/HEAD`. For GitLab, it is `/`, except `repo` is URL escaped. For raw Git repositories, it is `/`. `repo-url` is full URL escaped to the repo and `commit-spec` is a full commit hash. + +5. **status** + + Wether the launch succeeded (`success`) or failed (`failure`). Currently only successful launches are recorded. + +### Example code + +Some popular ways of reading this event data into a useful data structure are provided here. + +#### `pandas` + +```python +import pandas as pd +df = pd.read_json("https://archive.analytics.mybinder.org/events-2018-11-05.jsonl", lines=True) +df +``` + +#### Plain Python + +```python +import requests +import json + +response = requests.get("https://archive.analytics.mybinder.org/events-2018-11-05.jsonl") +data = [json.loads(l) for l in response.iter_lines()] +``` + +## `index.jsonl` + +The [index.jsonl](https://archive.analytics.mybinder.org/index.jsonl) file lists all the dates an event archive is available for. The following fields are present for each line: + +1. **date** + + The UTC date the event archive is for + +2. **name** + + The name of the file containing the events. This is a relative path - since we got the `index.jsonl` file from [https://archive.analytics.mybinder.org]{.title-ref}, that is the base URL used to resolve these. For example when `name` is `events-2018-11-05.jsonl`, the full URL to the file is `https://archive.analytics.mybinder.org/events-2018-11-05.jsonl`. + +3. **count** + + Total number of events in the file. diff --git a/docs/source/analytics/events-archive.rst b/docs/source/analytics/events-archive.rst deleted file mode 100644 index 27d003607..000000000 --- a/docs/source/analytics/events-archive.rst +++ /dev/null @@ -1,122 +0,0 @@ -.. _analytics/events-archive: - -============================ -The Analytics Events Archive -============================ - -BinderHub emits an event each time a repository is launched. They -are recorded as JSON, and made available to the public at -`archive.analytics.mybinder.org `_. - -This page describes what is available in the Events Archive & how to -interpret it. - -File format -=========== - -All data files are in `jsonl `_ format. Each line, -delimited by a ``\n`` is a is a well formed JSON object. These files can -be read / written in a streaming fashion, one line at a time, without -having to read the entire file into memory. - -Launch data by date -=================== - -For each day since we started keeping track (2018-11-03), there is a -file named ``events---
.jsonl`` that contains data for -all the launches performed by mybinder.org on that date. All timestamps -and dates are in `UTC `_. - -Each line is a JSON object that conforms to `this JSON Schema -`_. -A description of these fields is provided below. - -#. **schema** and **version** - - Currently set to ``binderhub.jupyter.org/launch`` and ``1`` respectively. These - identify the kind of event this is (a launch event from BinderHub) and the - current version of the event schema. This lets us evolve the format of the - events emitted without breaking existing analytics code. New versions of - the launch schema may add additional fields, or change meanings of current - ones. We will definitely add other events that are available here too - - for example, successful builds. - - Your analytics code **must** make sure the event you are parsing has - the schema and version you are expecting before proceeding. If you - don't do this, your code might fail in unexpected ways in the future. - -#. **timestamp** - - ISO8601 formatted timestamp when the event was emitted. These are rounded - down to the closest minute. The lines in the file are ordered by timestamp, - starting at the earliest. - -#. **provider** - - Where the launched repository was hosted. Current options are ``GitHub``, - ``GitLab`` and ``Git``. - -#. **spec** - - Specification identifying the repository / commit immutably & uniquely in - the provider. - - For GitHub, it is ``/``. Example would be ``yuvipanda/example-requirements/HEAD``. - For GitLab, it is ``/``, except ``repo`` is URL escaped. - For raw Git repositories, it is ``/``. ``repo-url`` is full URL escaped - to the repo and ``commit-spec`` is a full commit hash. - -#. **status** - - Wether the launch succeeded (``success``) or failed (``failure``). Currently - only successful launches are recorded. - -Example code ------------- - -Some popular ways of reading this event data into a useful data structure are -provided here. - -``pandas`` -~~~~~~~~~~ - -.. code-block:: python - - import pandas as pd - df = pd.read_json("https://archive.analytics.mybinder.org/events-2018-11-05.jsonl", lines=True) - df - -Plain Python -~~~~~~~~~~~~ - -.. code-block:: python - - import requests - import json - - response = requests.get("https://archive.analytics.mybinder.org/events-2018-11-05.jsonl") - data = [json.loads(l) for l in response.iter_lines()] - - -``index.jsonl`` -=============== - -The `index.jsonl `_ file lists -all the dates an event archive is available for. The following fields are present -for each line: - -#. **date** - - The UTC date the event archive is for - -#. **name** - - The name of the file containing the events. This is a relative path - since we - got the ``index.jsonl`` file from `https://archive.analytics.mybinder.org`, that - is the base URL used to resolve these. For example when ``name`` is - ``events-2018-11-05.jsonl``, the full URL to the file is - ``https://archive.analytics.mybinder.org/events-2018-11-05.jsonl``. - -#. **count** - - Total number of events in the file. diff --git a/docs/source/analytics/index.md b/docs/source/analytics/index.md new file mode 100644 index 000000000..bcf0e42e1 --- /dev/null +++ b/docs/source/analytics/index.md @@ -0,0 +1,9 @@ +# Analytics + +A public events archive with data about daily Binder launches. + +```{toctree} +:maxdepth: 2 +events-archive.md +cloud-costs.md +``` diff --git a/docs/source/analytics/index.rst b/docs/source/analytics/index.rst deleted file mode 100644 index 469f35774..000000000 --- a/docs/source/analytics/index.rst +++ /dev/null @@ -1,10 +0,0 @@ -Analytics ---------- - -A public events archive with data about daily Binder launches. - -.. toctree:: - :maxdepth: 2 - - events-archive - cloud-costs diff --git a/docs/source/components/index.md b/docs/source/components/index.md new file mode 100644 index 000000000..6b8813982 --- /dev/null +++ b/docs/source/components/index.md @@ -0,0 +1,12 @@ +# Components + +These pages describe the different technical pieces that make up the mybinder.org deployment. + +```{toctree} +:maxdepth: 2 +metrics.md +dashboards.md +ingress.md +cloud.md +matomo.md +``` diff --git a/docs/source/components/index.rst b/docs/source/components/index.rst deleted file mode 100644 index 42ab5f157..000000000 --- a/docs/source/components/index.rst +++ /dev/null @@ -1,15 +0,0 @@ -========== -Components -========== - -These pages describe the different technical pieces that make up the -mybinder.org deployment. - -.. toctree:: - :maxdepth: 2 - - metrics.md - dashboards.md - ingress.md - cloud.md - matomo.rst diff --git a/docs/source/components/matomo.md b/docs/source/components/matomo.md new file mode 100644 index 000000000..7893be1b9 --- /dev/null +++ b/docs/source/components/matomo.md @@ -0,0 +1,41 @@ +# Matomo (formerly Piwik) analytics + +[Matomo](https://matomo.org/) is a self-hosted free & open source alternative to [Google Analytics](https://analytics.google.com). + +## Why? + +Matomo gives us better control of what is tracked, how long it is stored & what we can do with the data. We would like to collect as little data as possible & share it with the world in safe ways as much as possible. Matomo is an important step in making this possible. + +## How it is set up? + +Matomo is a PHP+MySQL application. We use the apache based upstream [docker image](https://hub.docker.com/_/matomo/) to run it. We can improve performance in the future if we wish by switching to `nginx+fpm`. + +We use [Google CloudSQL for MySQL](https://cloud.google.com/sql/docs/mysql/) to provision a fully managed, standard mysql database. The [sidecar pattern](https://cloud.google.com/sql/docs/mysql/connect-kubernetes-engine) is used to connect Matomo to this database. A service account with appropriate credentials to connect to the database has been provisioned & checked-in to the repo. A MySQL user with name `matomo` & a MySQL database with name `matomo` should also be created in the Google Cloud Console. + +## Initial Installation + +Matomo is a PHP application, and this has a number of drawbacks. The initial install **[must](https://github.com/matomo-org/matomo/issues/10257)** be completed with a manual web interface. Matomo will error if it finds a complete `config.ini.php` file (which we provide) but no database tables exist. + +The first time you install Matomo, you need to do the following: + +1. Do a deploy. This sets up Matomo, but not the database tables +2. Use `kubectl --namespace= exec -it /bin/bash` to get shell on the matomo container. +3. Run `rm config/config.ini.php`. +4. Visit the web interface & complete installation. The database username & password are available in the secret encrypted files in this repo. So is the admin username and password. This creates the database tables. +5. When the setup is complete, delete the pod. This should bring up our `config.ini.php` file, and everything should work normally. + +This is not ideal. + +## Admin access + +The admin username for Matomo is `admin`. You can find the password in `secret/staging.yaml` for staging & `secret/prod.yaml` for prod. + +## Security + +PHP code is notoriously hard to secure. Matomo has had security audits, so it\'s not the worst. However, we should treat it with suspicion & wall off as much of it away as possible. Arbitrary code execution vulnerabilities often happen in PHP, so we gotta use that as our security model. + +We currently have: + +1. A firewall hole (in Google Cloud) allowing it access to the CloudSQL instance it needs to store data in. Only port 3307 (which is used by the OAuth2+ServiceAccount authenticated CloudSQLProxy) is open. This helps prevent random MySQL password grabbers from inside the cluster. +2. A Kubernetes NetworkPolicy is in place that limits what outbound connections Matomo can make. This should be further tightened down -ingress should only be allowed on the nginx port from our ingress controllers. +3. We do not mount a Kubernetes ServiceAccount in the Matomo pod. This denies it access to the KubernetesAPI. diff --git a/docs/source/components/matomo.rst b/docs/source/components/matomo.rst deleted file mode 100644 index e22588884..000000000 --- a/docs/source/components/matomo.rst +++ /dev/null @@ -1,79 +0,0 @@ -================================= -Matomo (formerly Piwik) analytics -================================= - -`Matomo `_ is a self-hosted free & -open source alternative to `Google Analytics `_. - -Why? -==== - -Matomo gives us better control of what is tracked, how long it is stored -& what we can do with the data. We would like to collect as -little data as possible & share it with the world in safe ways -as much as possible. Matomo is an important step in making this possible. - -How it is set up? -================= - -Matomo is a PHP+MySQL application. We use the apache based upstream -`docker image `_ to run it. We can -improve performance in the future if we wish by switching to ``nginx+fpm``. - -We use `Google CloudSQL for MySQL `_ -to provision a fully managed, standard mysql database. The -`sidecar pattern `_ -is used to connect Matomo to this database. A service account with appropriate -credentials to connect to the database has been provisioned & checked-in -to the repo. A MySQL user with name ``matomo`` & a MySQL database with name ``matomo`` -should also be created in the Google Cloud Console. - -Initial Installation -==================== - -Matomo is a PHP application, and this has a number of drawbacks. The initial -install **`must `_** be completed -with a manual web interface. Matomo will error if it finds a complete ``config.ini.php`` -file (which we provide) but no database tables exist. - -The first time you install Matomo, you need to do the following: - -1. Do a deploy. This sets up Matomo, but not the database tables -2. Use ``kubectl --namespace= exec -it /bin/bash`` to - get shell on the matomo container. -3. Run ``rm config/config.ini.php``. -4. Visit the web interface & complete installation. The database username & password - are available in the secret encrypted files in this repo. So is the admin username - and password. This creates the database tables. -5. When the setup is complete, delete the pod. This should bring up our ``config.ini.php`` - file, and everything should work normally. - -This is not ideal. - -Admin access -============ - -The admin username for Matomo is ``admin``. You can find the password in -``secret/staging.yaml`` for staging & ``secret/prod.yaml`` for prod. - -Security -======== - -PHP code is notoriously hard to secure. Matomo has had security audits, -so it's not the worst. However, we should treat it with suspicion & -wall off as much of it away as possible. Arbitrary code execution -vulnerabilities often happen in PHP, so we gotta use that as our -security model. - -We currently have: - -1. A firewall hole (in Google Cloud) allowing it access to the CloudSQL - instance it needs to store data in. Only port 3307 (which is used by - the OAuth2+ServiceAccount authenticated CloudSQLProxy) is open. This - helps prevent random MySQL password grabbers from inside the cluster. -2. A Kubernetes NetworkPolicy is in place that limits what outbound - connections Matomo can make. This should be further tightened down - - ingress should only be allowed on the nginx port from our ingress - controllers. -3. We do not mount a Kubernetes ServiceAccount in the Matomo pod. This - denies it access to the KubernetesAPI. diff --git a/docs/source/conf.py b/docs/source/conf.py index 3cc5ea8d1..148e975af 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -19,9 +19,6 @@ "jupyterhub_sphinx_theme", ] -# The suffix(es) of source filenames. -source_suffix = [".rst", ".md"] - # The root toctree document. root_doc = master_doc = "index" @@ -35,6 +32,9 @@ # This patterns also effect to html_static_path and html_extra_path exclude_patterns = [] +# A string of reStructuredText that will be included at the end of every source file that is read. +with open("hyperlink-targets.md", encoding="utf-8") as _hyperlink_targets: + rst_epilog = _hyperlink_targets.read() # -- Options for HTML output ---------------------------------------------- # ref: http://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output diff --git a/docs/source/deployment/index.md b/docs/source/deployment/index.md new file mode 100644 index 000000000..29884ae80 --- /dev/null +++ b/docs/source/deployment/index.md @@ -0,0 +1,9 @@ +# Deployment and Operation + +```{toctree} +:maxdepth: 2 +prereqs.md +how.md +what.md +k3s.md +``` diff --git a/docs/source/deployment/index.rst b/docs/source/deployment/index.rst deleted file mode 100644 index a30909a1c..000000000 --- a/docs/source/deployment/index.rst +++ /dev/null @@ -1,11 +0,0 @@ -======================== -Deployment and Operation -======================== - -.. toctree:: - :maxdepth: 2 - - prereqs - how - what - k3s diff --git a/docs/source/deployment/prereqs.md b/docs/source/deployment/prereqs.md index af93ea5d6..9fc38bd99 100644 --- a/docs/source/deployment/prereqs.md +++ b/docs/source/deployment/prereqs.md @@ -1,14 +1,14 @@ # Pre-requisite technologies -The following are tools and technologies that mybinder.org uses. You should have -a working familiarity with them in order to make changes to the mybinder.org deployment. +The following are tools and technologies that [mybinder.org] uses. You should have +a working familiarity with them in order to make changes to the [mybinder.org] deployment. -This is a non-exhaustive list. Feel free to ask us questions on the gitter channel or -here if something specific could be clearer! +This is a non-exhaustive list. Feel free to ask us questions on the [Jupyter instance of Zulip] or +at the [`mybinder.org-deploy`] Git repository if something specific could be clearer! ## Google Cloud Platform -MyBinder.org currently runs on Google Cloud. There are two Google Cloud projects +[mybinder.org] currently runs on Google Cloud. There are two Google Cloud projects that we use: 1. `binder-staging` contains all resources for the staging deployment @@ -17,13 +17,13 @@ that we use: We'll hand out credentials to anyone who wants to play with the staging deployment, so please just ask! -While you only need merge access in this repository to deploy changes, ideally -you should also have access to the two Google Cloud Projects so you can debug +While you only need merge access in [`mybinder.org-deploy`] Git repository to deploy changes, ideally +you should also have access to the two Google Cloud projects so you can debug things when deployments fail. ## Kubernetes -We heavily use [Kubernetes](https://kubernetes.io/) for the mybinder.org deployment, and it is important you +We heavily use [Kubernetes] for the [mybinder.org] deployment, and it is important you have a working knowledge of how to use Kubernetes. Detailed explanations are out of the scope of this repository, but there is a good [list of tutorials](https://kubernetes.io/docs/tutorials/). Specifically, going through the [interactive tutorial](https://kubernetes.io/docs/tutorials/kubernetes-basics/) @@ -31,9 +31,9 @@ to get comfortable using `kubectl` is required. ## Helm -We use [helm](https://helm.sh) to manage our deployments, and it is important you -have a working knowledge of how to use helm. Detailed explanations are out of the -scope of this repository, but [docs.helm.sh](https://docs.helm.sh) is an excellent +We use [Helm](https://helm.sh) to manage our deployments, and it is important you +have a working knowledge of how to use Helm. Detailed explanations are out of the +scope of this repository, but [Helm's official documentation](https://docs.helm.sh) is an excellent source of information. At a minimum, you must at least understand: - [What is a chart?](https://helm.sh/docs/chart_template_guide/getting_started/#charts) @@ -42,11 +42,8 @@ source of information. At a minimum, you must at least understand: ## GitHub Actions -We use [GitHub Actions](https://docs.github.com/en/actions) for doing all our deployments. Our +We use [GitHub Actions] for doing all our deployments. Our [`.github/workflows/cd.yml`](https://github.com/jupyterhub/mybinder.org-deploy/blob/main/.github/workflows/cd.yml) file contains the entire configuration for our **continuous** deployment. -Because mybinder.org dependes on JupyterHub, BinderHub and repo2docker, we also use [GitHub Actions to watch those dependencies](https://github.com/jupyterhub/mybinder.org-deploy/blob/main/.github/workflows/watch-dependencies.yaml) once every day and, if needed, create a pull request. mybinder.org operators can manually trigger a dependency check by clicking the "[Run workflow](https://github.com/jupyterhub/mybinder.org-deploy/actions/workflows/watch-dependencies.yaml)" button. - -[mybinder.org]: https://mybinder.org -[staging.mybinder.org]: https://staging.mybinder.org +Because [mybinder.org] dependes on JupyterHub, BinderHub and `repo2docker`, we also use [GitHub Actions to watch those dependencies](https://github.com/jupyterhub/mybinder.org-deploy/blob/main/.github/workflows/watch-dependencies.yaml) once every day and, if needed, create a pull request. [mybinder.org] operators can manually trigger a dependency check by clicking the "[Run workflow](https://github.com/jupyterhub/mybinder.org-deploy/actions/workflows/watch-dependencies.yaml)" button. diff --git a/docs/source/getting_started/getting_started.md b/docs/source/getting_started/getting_started.md index 400b64dfe..dbb6d045c 100644 --- a/docs/source/getting_started/getting_started.md +++ b/docs/source/getting_started/getting_started.md @@ -6,7 +6,7 @@ maintain the BinderHub deployment at . ## Make sure you have access on the Google Cloud project Go to and see if you have `binderhub` listed -in your projects. If not, message one of the Binder devs on [Jupyter instance of Zulip](https://jupyter.zulipchat.com/) +in your projects. If not, message one of the Binder devs on [Jupyter instance of Zulip] to get access. ## Install `kubectl` @@ -93,6 +93,6 @@ useful in spotting and debugging problems in the future. ## Start helping out! There are many ways that you can help debug/maintain/improve the `mybinder.org` -deployment. The best way to get started is to keep an eye on the [Jupyter instance of Zulip](https://jupyter.zulipchat.com/) +deployment. The best way to get started is to keep an eye on the [Jupyter instance of Zulip] as well as the Grafana dashboard. If you see something interesting, don't hesitate to ask questions or make suggestions! diff --git a/docs/source/getting_started/index.md b/docs/source/getting_started/index.md new file mode 100644 index 000000000..558bfad7a --- /dev/null +++ b/docs/source/getting_started/index.md @@ -0,0 +1,14 @@ +# Getting started + +These resources describe how to get started with the mybinder.org +operations team. It contains checklists of steps to take to make sure +you have the right permissions, as well as contextual information about +the mybinder.org deployment. + +```{toctree} +:maxdepth: 3 +local_environment.md +production_environment.md +getting_started.md +terminology.rst +``` diff --git a/docs/source/getting_started/index.rst b/docs/source/getting_started/index.rst deleted file mode 100644 index 10f32ec82..000000000 --- a/docs/source/getting_started/index.rst +++ /dev/null @@ -1,15 +0,0 @@ -=============== -Getting started -=============== - -These resources describe how to get started with the mybinder.org operations -team. It contains checklists of steps to take to make sure you have the right -permissions, as well as contextual information about the mybinder.org deployment. - -.. toctree:: - :maxdepth: 3 - - local_environment - getting_started - production_environment - terminology diff --git a/docs/source/getting_started/production_environment.md b/docs/source/getting_started/production_environment.md index 10707d17a..e4c03bbd8 100644 --- a/docs/source/getting_started/production_environment.md +++ b/docs/source/getting_started/production_environment.md @@ -1,23 +1,23 @@ # Production environment This section is an overview of the repositories, projects, and -systems used in a mybinder.org production deployment. +systems used in the [mybinder.org] production deployment. Reference: [Google SRE book section on Production Environment](https://sre.google/sre-book/production-environment/) ## Repository structure -This repository contains a 'meta chart' (`mybinder`) that fully captures the -state of the deployment on mybinder.org. Since it is a full helm chart, you -can read the [official helm chart structure](https://docs.helm.sh/developing_charts/#the-chart-file-structure) -document to know more about its structure. +This repository contains a Helm "meta chart" (`mybinder`) that fully captures the +state of the deployment on . Since it is a full Helm chart, you +can read the [official helm chart structure documentation](https://docs.helm.sh/developing_charts/#the-chart-file-structure) +to know more about its structure. ## Dependent charts -The core of the meta-chart pattern is to install a bunch of [dependent charts](https://docs.helm.sh/developing_charts/#chart-dependencies), +The core of the "meta chart" pattern is to install a bunch of [dependent charts](https://docs.helm.sh/developing_charts/#chart-dependencies), specified in `mybinder/Chart.yaml`. This contains both support -charts like nginx-ingress, grafana, prometheus, but also the core application chart -`binderhub`. Everything is version pinned here. +charts like `nginx-ingress`, `grafana`, `prometheus`, **and** the core application chart +`binderhub`. Everything is version pinned in `mybinder/Chart.yaml`. ## Configuration values @@ -39,22 +39,29 @@ The following files fully capture the state of the production deployment: 3. `config/prod.yaml` - Non-secret values specific to the production deployment -**Important**: For maintainability and consistency, we try to keep the contents +```{important} +For maintainability and consistency, we try to keep the contents of `staging.yaml` and `prod.yaml` super minimal - they should be as close to each other as possible. We want all common config in `values.yaml` so testing -on staging gives us confidence it will work on prod. We also never share the same -secrets between staging & prod for security boundary reasons. +on staging gives us confidence it will work on production. We also never share the same +secrets between staging and production for security boundary reasons. +``` ## Deployment nodes and pools +## Staging + The staging cluster has one node pool, which makes things simple. + +## Production + The production cluster has two, one for "core" pods (the hub, etc.) and another dedicated to "user" pods (builds and user servers). This strategy helps protect our key services from potential issues caused by users and helps us drain user nodes when we need to. -Since ~only user pods should be running on the user nodes, +Since "only" user pods should be running on the user nodes, cordoning that node should result in it being drained and reclaimed -after the max-pod-age lifetime limit +after the `max-pod-age` lifetime limit which often wouldn't happen without manual intervention. It is still _not quite true_ that only user pods are running on the user nodes at this point. @@ -69,13 +76,13 @@ Users and core pods are assigned to their pools via a `nodeSelector` in `config/ We use a custom label `mybinder.org/node-purpose = core | user` to select which node a pod should run on. -## mybinder.org specific extra software +## `mybinder.org` specific extra software -We sometimes want to run additional software for the mybinder deployment that +We sometimes want to run additional software for the deployment that does not already have a chart, or would be too cumbersome to use with a chart. For those cases, we can create kubernetes objects directly from the `mybinder` meta chart. You can see an example of this under `mybinder/templates/redirector` -that is used to set up a simple nginx based HTTP redirector. +that is used to set up a simple NGINX based HTTP redirector. ## Related repositories @@ -277,15 +284,3 @@ People who currently have the git-crypt secret include: - _add yourself here if you have it_ Contact one of them if you need access to the git-crypt key. - -[mybinder.org-deploy]: https://github.com/jupyterhub/mybinder.org-deploy -[prod]: https://mybinder.org -[mybinder.org]: https://mybinder.org -[staging.mybinder.org]: https://staging.mybinder.org -[staging]: https://staging.mybinder.org -[binderhub]: https://github.com/jupyterhub/binderhub -[`jupyterhub/binderhub`]: https://github.com/jupyterhub/binderhub -[binderhub documentation]: https://binderhub.readthedocs.io/en/latest/ -[repo2docker]: https://github.com/jupyterhub/repo2docker -[git-crypt]: https://github.com/AGWA/git-crypt -[ssh-vault]: https://github.com/ssh-vault/ssh-vault diff --git a/docs/source/getting_started/terminology.md b/docs/source/getting_started/terminology.md new file mode 100644 index 000000000..e28103031 --- /dev/null +++ b/docs/source/getting_started/terminology.md @@ -0,0 +1,13 @@ +# Terminology for the deployment + +This page contains common words or phrases in the dev-ops community that will be useful in understanding and maintaining the `mybinder.org` deployment. + +(term-cordoning)= + +## "cordoning" a node + +[Kubernetes page on cordoning](https://kubernetes.io/docs/concepts/architecture/nodes/#manual-node-administration). + +Sometimes you want to ensure that **no new pods** will be started on a given node. Usually this is because you suspect the node has a problem with it, or you wish to remove the node but need it to be free of pods first. + +Cordoning the node tells Kubernetes to make it **unschedulable**, which means new pods won\'t start on the node. It is common to wait several hours, then manually remove any remaining pods on the node before removing it manually. diff --git a/docs/source/getting_started/terminology.rst b/docs/source/getting_started/terminology.rst deleted file mode 100644 index fb46a0a5d..000000000 --- a/docs/source/getting_started/terminology.rst +++ /dev/null @@ -1,21 +0,0 @@ -============================== -Terminology for the deployment -============================== - -This page contains common words or phrases in the dev-ops community that will -be useful in understanding and maintaining the ``mybinder.org`` deployment. - -.. _term-cordoning: - -"cordoning" a node ------------------- - -`Kubernetes page on cordoning `_. - -Sometimes you want to ensure that **no new pods** will be started on a given -node. Usually this is because you suspect the node has a problem with it, or -you wish to remove the node but need it to be free of pods first. - -Cordoning the node tells Kubernetes to make it **unschedulable**, which means new -pods won't start on the node. It is common to wait several hours, then manually -remove any remaining pods on the node before removing it manually. diff --git a/docs/source/hyperlink-targets.md b/docs/source/hyperlink-targets.md new file mode 100644 index 000000000..4f60bd422 --- /dev/null +++ b/docs/source/hyperlink-targets.md @@ -0,0 +1,13 @@ +[`binderhub` source code]: https://github.com/jupyterhub/binderhub +[`git-crypt`]: https://github.com/AGWA/git-crypt +[`mybinder.org-deploy`]: https://github.com/jupyterhub/mybinder.org-deploy +[`repo2docker` source code]: https://github.com/jupyterhub/repo2docker +[`ssh-vault`]: https://github.com/ssh-vault/ssh-vault +[BinderHub documentation]: https://binderhub.readthedocs.io/en/latest/ +[GitHub Actions]: https://docs.github.com/en/actions +[Helm]: https://helm.sh +[Jupyter instance of Zulip]: https://jupyter.zulipchat.com/ +[Kubernetes]: https://kubernetes.io/ +[mybinder.org]: https://mybinder.org +[prod]: https://mybinder.org +[staging.mybinder.org]: https://staging.mybinder.org diff --git a/docs/source/incident-reports/index.md b/docs/source/incident-reports/index.md new file mode 100644 index 000000000..b9d0edfdf --- /dev/null +++ b/docs/source/incident-reports/index.md @@ -0,0 +1,26 @@ +(incident-reporting)= + +# Incident reporting + +This page contains information and guidelines for how the Binder team handles incidents and incident reports. Remember, **incidents are opportunities to learn**! + +## Principles and guidelines for incident reporting + +- Inspiration for our guidelines: [Google SRE guide, Managing Incidents](https://sre.google/sre-book/managing-incidents/). +- Team management and takeaways from incidents: [Etsy Debriefing Facilitation Guide](https://extfiles.etsy.com/DebriefingFacilitationGuide.pdf). + +## Example template for incident report + +- [Example template for incident report](./template-incident-report.md) + +## Incident history + +(in reverse chronological order) + +```{toctree} +:maxdepth: 1 +:glob: +:reversed: true +./2* +template-incident-report +``` diff --git a/docs/source/incident-reports/index.rst b/docs/source/incident-reports/index.rst deleted file mode 100644 index 1ed3700a0..000000000 --- a/docs/source/incident-reports/index.rst +++ /dev/null @@ -1,32 +0,0 @@ -.. _incident-reporting: - -================== -Incident reporting -================== - -This page contains information and guidelines for how the Binder team handles -incidents and incident reports. Remember, **incidents are opportunities to learn**! - -Principles and guidelines for incident reporting ------------------------------------------------- - -- Inspiration for our guidelines: `Google SRE guide, Managing Incidents `_. -- Team management and takeaways from incidents: `Etsy Debriefing Facilitation Guide `_. - -Example template for incident report ------------------------------------- - -- :doc:`Example template for incident report ` - -Incident history ----------------- - -(in reverse chronological order) - -.. toctree:: - :maxdepth: 1 - :glob: - :reversed: - - ./2* - template-incident-report diff --git a/docs/source/index.md b/docs/source/index.md new file mode 100644 index 000000000..fff643d94 --- /dev/null +++ b/docs/source/index.md @@ -0,0 +1,46 @@ +# Site Reliability Guide for `mybinder.org` + +This site is a collection of wisdom, tools, and other helpful +information to assist in the maintenance and team-processes around the +[BinderHub] deployment at . + +```{tip} +If you are looking for documentation on how to use , [see +the mybinder.org user documentation](https://docs.mybinder.org). +``` + +```{tip} +If you are looking for information on deploying your own BinderHub, +[see the BinderHub documentation][BinderHub]. +``` + +## What is `mybinder.org`? + + is a federation, named [the `mybinder.org` federation](mybinder-federation), of public deployments of [BinderHub]. + acts as a proxy to computational resources donated by federation members. + +## What is the `mybinder.org` operations team? + +Behind the is a team of contributors that donate +their time to keeping running smoothly. This role is often +called a [Site Reliability +Engineer](https://en.wikipedia.org/wiki/Site_Reliability_Engineering) +(or SRE). We informally call this team the "`mybinder.org` operators". + +**If you are interested in helping the `mybinder.org` operations team**, +first check out ["The Operators (no Binder isn’t forming a rock band)"](https://discourse.jupyter.org/t/the-operators-no-binder-isnt-forming-a-rock-band/694) on Jupyter instance of Discourse. +To show your interest in helping, please reach out to the operations +team via ["Interested in joining the mybinder.org operations team?"](https://discourse.jupyter.org/t/interested-in-joining-the-mybinder-org-operations-team/761) thread on Jupyter instance of Discourse. + +```{toctree} +:maxdepth: 2 +:caption: Guide +getting_started/index.md +deployment/index.rst +operation_guide/index.rst +components/index.rst +analytics/index.rst +incident-reports/index.rst +``` + +[BinderHub]: https://binderhub.readthedocs.io/ diff --git a/docs/source/index.rst b/docs/source/index.rst deleted file mode 100644 index ee2d9e7e3..000000000 --- a/docs/source/index.rst +++ /dev/null @@ -1,48 +0,0 @@ -Site Reliability Guide for mybinder.org -======================================= - -This site is a collection of wisdom, tools, and other helpful information -to assist in the maintenance and team-processes around the BinderHub deployment -at `mybinder.org `_. - -If you are looking for documentation on how to use mybinder.org, -`see the mybinder.org user documentation `_. If you are looking -for information on deploying your own BinderHub -`see the BinderHub documentation `_. - - -What is the mybinder.org operations team? ------------------------------------------ - -Behind the mybinder.org deployment is a team of contributors that -donate their time to keeping mybinder.org running smoothly. This -role is often called a -`Site Reliability Engineer `_ -(or SRE). We informally call this team the "mybinder.org operators". - -This site is a collection of wisdom, tools, and other helpful information that the -mybinder.org operations team uses for maintenance and team-processes -around the BinderHub deployment at `mybinder.org `_. - -**If you are interested in helping the mybinder.org operations team**, first -check out `this post on what an operator does `_. -To show your interest in helping, please reach out to the operations team -via `this Discourse thread `_. - - -.. toctree:: - :maxdepth: 2 - - getting_started/index - deployment/index - operation_guide/index - components/index - analytics/index - incident-reports/index - -Indices and tables ------------------- - -* :ref:`genindex` -* :ref:`modindex` -* :ref:`search` diff --git a/docs/source/operation_guide/federation.md b/docs/source/operation_guide/federation.md new file mode 100644 index 000000000..b127e9f33 --- /dev/null +++ b/docs/source/operation_guide/federation.md @@ -0,0 +1,67 @@ +(mybinder-federation)= + +# The `mybinder.org` Federation + +The current status of the `mybinder.org` federation can be found +[here](https://mybinder.readthedocs.io/en/latest/about/status.html). + +## Adding or removing a federation member + +The following files contain references to the federation, and should be +updated when a federation member is added or removed: + +1. pages for : + [status](https://github.com/jupyterhub/mybinder.org-user-guide/blob/HEAD/doc/about/status.rst) + and [federation + info](https://github.com/jupyterhub/mybinder.org-user-guide/blob/HEAD/doc/_data/support/federation.yml) + +2. [deployment to the + cluster](https://github.com/jupyterhub/mybinder.org-deploy/blob/main/.github/workflows/cd.yml) + +3. [testing of the cluster + configuration](https://github.com/jupyterhub/mybinder.org-deploy/blob/main/.github/workflows/test-helm-template.yaml) + +4. membership in [federationRedirect.hosts config for + prod](https://github.com/jupyterhub/mybinder.org-deploy/blob/7aa58e033efe1ed1cee1b5cb7e789c1296deb36a/config/prod.yaml#L220) + +5. add/remove data source for the cluster\'s prometheus at + + +6. **if outside the default Google Cloud project, make sure launches are published to the events archive:** + - If not deployed from this repo, publishing events to the + archive is configured + [here](https://github.com/jupyterhub/mybinder.org-deploy/blob/339ccb1de8107dc7854cac45f0a5b6e99937a91b/mybinder/values.yaml#L200-L219) + - GKE clusters don\'t need further configuration, but outside + GKE (or outside our GCP project, maybe?) need a service + account. These accounts are configured [in + terraform](https://github.com/jupyterhub/mybinder.org-deploy/blob/339ccb1de8107dc7854cac45f0a5b6e99937a91b/terraform/gcp/prod/main.tf#L17), + and can be retrieved via [terraform output + events_archiver_keys]{.title-ref}. For OVH, a secret is + added to the chart + [here](https://github.com/jupyterhub/mybinder.org-deploy/blob/main/mybinder/templates/events-archiver/secret.yaml) + and mounted in the binder pod + [here](https://github.com/jupyterhub/mybinder.org-deploy/blob/339ccb1de8107dc7854cac45f0a5b6e99937a91b/config/ovh2.yaml#L25-L34) + (in our chart, the secret itself is added to + [eventsArchiver.serviceAccountKey](https://github.com/jupyterhub/mybinder.org-deploy/blob/339ccb1de8107dc7854cac45f0a5b6e99937a91b/mybinder/values.yaml#L555-L557) + helm config, in secrets/config/ovh2.yaml). + +## Temporarily removing a federation member from rotation + +There are a few reasons why you may wish to remove a Federation member +from rotation. For example, maintenance work, a problem with the +deployment, and so on. + +There are 3 main files you may wish to edit in order to remove a cluster +from the Federation: + +1. _Required._ Set the `binderhub.config.BinderHub.pod_quota` key to + `0` in the cluster\'s config file under the + [config](https://github.com/jupyterhub/mybinder.org-deploy/tree/HEAD/config) + directory +2. _Recommended._ Set the `weight` key for the cluster to `0` in the + [helm chart values + file](https://github.com/jupyterhub/mybinder.org-deploy/blob/7aa58e033efe1ed1cee1b5cb7e789c1296deb36a/config/prod.yaml#L220) + in order to remove it from the redirector\'s pool +3. _Optional._ Comment out the cluster from the [continuous + deployment](https://github.com/jupyterhub/mybinder.org-deploy/blob/4f42d791f92dcb3156e7c4ea92a236246bbf9135/.github/workflows/cd.yml#L168) + file diff --git a/docs/source/operation_guide/federation.rst b/docs/source/operation_guide/federation.rst deleted file mode 100644 index f61c34641..000000000 --- a/docs/source/operation_guide/federation.rst +++ /dev/null @@ -1,45 +0,0 @@ -.. _mybinder-federation: - -=========================== -The mybinder.org Federation -=========================== - -The current status of the mybinder.org federation can be found `here `__. - - -Adding or removing a federation member --------------------------------------- - -The following files contain references to the federation, -and should be updated when a federation member is added or removed: - -#. pages for https://mybinder.readthedocs.io: `status `_ and `federation info `_ -#. `deployment to the cluster `_ -#. `testing of the cluster configuration `_ -#. membership in `federationRedirect.hosts config for prod `__ -#. add/remove data source for the cluster's prometheus at https://grafana.mybinder.org -#. if outside the default Google Cloud project, make sure launches are published to the events archive: - - If not deployed from this repo, publishing events to the archive is configured `here `__ - - GKE clusters don't need further configuration, but outside GKE (or outside our GCP project, maybe?) need a service account. - These accounts are configured `in terraform `__, and can be retrieved via `terraform output events_archiver_keys`. - For OVH, a secret is added to the chart `here `__ and mounted in the binder pod `here `__ (in our chart, the secret itself is added to `eventsArchiver.serviceAccountKey `__ helm config, in secrets/config/ovh2.yaml). - - -Temporarily removing a federation member from rotation ------------------------------------------------------- - -There are a few reasons why you may wish to remove a Federation member from -rotation. For example, maintenance work, a problem with the deployment, and so -on. - -There are 3 main files you may wish to edit in order to remove a cluster from the Federation: - -#. *Required.* Set the ``binderhub.config.BinderHub.pod_quota`` key to ``0`` in the - cluster's config file under the `config `_ - directory -#. *Recommended.* Set the ``weight`` key for the cluster to ``0`` in the - `helm chart values file `_ - in order to remove it from the redirector's pool -#. *Optional.* Comment out the cluster from the - `continuous deployment `_ - file diff --git a/docs/source/operation_guide/index.md b/docs/source/operation_guide/index.md new file mode 100644 index 000000000..618a35ed7 --- /dev/null +++ b/docs/source/operation_guide/index.md @@ -0,0 +1,11 @@ +# Operations Guide + +Team processes as well as useful information about what you might run into when maintaining mybinder.org. + +```{toctree} +:maxdepth: 2 +common_problems.md +command_snippets.md +grafana_plots.md +federation.md +``` diff --git a/docs/source/operation_guide/index.rst b/docs/source/operation_guide/index.rst deleted file mode 100644 index 65c1d9aa0..000000000 --- a/docs/source/operation_guide/index.rst +++ /dev/null @@ -1,14 +0,0 @@ -================ -Operations Guide -================ - -Team processes as well as useful information about what you might -run into when maintaining mybinder.org. - -.. toctree:: - :maxdepth: 2 - - common_problems - command_snippets - grafana_plots - federation