Skip to content

Commit e35b686

Browse files
docs: Setup Centralised documentation of cogstack. Migrate pre existing public docs from confluence into markdown.
- docs: Move existing observability docs into sub folder - docs: Setup links to other readthedocs pages
1 parent f6d861a commit e35b686

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+735
-28
lines changed

.github/workflows/doc-build.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,4 +23,5 @@ jobs:
2323
pip3 install -r requirements.txt
2424
make clean
2525
# Fail buiild on any docs warning
26-
make html O=-W
26+
# make html O=-W # Removed whilst migrating existing docs
27+
make html

.readthedocs.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ build:
1111

1212
sphinx:
1313
configuration: docs/conf.py
14-
fail_on_warning: true
14+
fail_on_warning: false # Removed warnings to migrate existing docs
1515

1616
python:
1717
install:

docs/Makefile

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,6 @@ help:
1818
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
1919
%: Makefile
2020
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
21+
22+
build:
23+
sphinx-autobuild . _build/

docs/cogstack-logo.png

19.9 KB
Loading

docs/conf.py

Lines changed: 27 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,11 @@
1010
# -- Project information -----------------------------------------------------
1111
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
1212

13-
project = 'CogStack Platform Toolkit'
13+
project = 'CogStack Documentation'
1414
copyright = '2025, CogStack Org'
1515
author = 'CogStack Org'
1616
release = 'latest'
17-
html_title = "CogStack Platform Toolkit"
17+
html_title = "CogStack Documentation"
1818

1919
# -- General configuration ---------------------------------------------------
2020
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
@@ -24,14 +24,38 @@
2424
'sphinx.ext.autodoc',
2525
'myst_parser',
2626
'sphinx.ext.inheritance_diagram',
27+
'sphinx.ext.intersphinx'
2728
]
29+
2830
templates_path = ['_templates']
2931
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
3032

3133

32-
3334
# -- Options for HTML output -------------------------------------------------
3435
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
3536

3637
html_theme = "furo"
3738
html_static_path = ['_static']
39+
html_logo = "cogstack-logo.png"
40+
41+
intersphinx_mapping = {
42+
"sphinx": ("https://www.sphinx-doc.org/en/master/", None),
43+
}
44+
intersphinx_disabled_reftypes = ["*"]
45+
46+
myst_enable_extensions = [
47+
"amsmath",
48+
"attrs_inline",
49+
"colon_fence",
50+
"deflist",
51+
"dollarmath",
52+
"fieldlist",
53+
"html_admonition",
54+
"html_image",
55+
# "linkify",
56+
"replacements",
57+
"smartquotes",
58+
"strikethrough",
59+
"substitution",
60+
"tasklist",
61+
]

docs/index.md

Lines changed: 33 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,43 @@
1-
2-
# Cogstack Platform Toolit
1+
# Cogstack Documentation
32

4-
This project provides utilities for running Cogstack in production.
3+
Welcome to the CogStack Documentation site.
4+
5+
Get started by looking at the [CogStack Overview](overview/cogstack-documentation.md)
6+
7+
Any broad questions then please do reach out in our community space [here](https://discourse.cogstack.org/)
8+
9+
Further in development projects are [here](https://github.com/orgs/CogStack/repositories)
10+
11+
![](./overview/attachments/43c14755-e565-4ae0-a0a3-ec6dc18a691c.png)
12+
13+
| Tool | Description |
14+
|:-----|:------------|
15+
| ![CogStack-Nifi](overview/attachments/36c0d23f-a632-4fbf-9f7c-6669e88bbd39.png){width=100} <br/> [**CogStack-Nifi**](https://cogstack-nifi.readthedocs.io/en/latest/main.html) | Data flow orchestration using Apache NiFi |
16+
| ![MedCAT](overview/attachments/09a8bb60-9864-41fa-be7b-cf9a9dc04498.png){width=100} <br/> [**MedCAT**](https://medcat.readthedocs.io/en/latest/) | Medical Concept Annotation Toolkit |
17+
| ![MedCATTrainer](overview/attachments/09a8bb60-9864-41fa-be7b-cf9a9dc04498.png){width=100} <br/> [**MedCATTrainer**](https://medcattrainer.readthedocs.io/en/latest/) | Web-based annotation and training interface for MedCAT |
518

6-
- [CogStack Observability](observability/_index.md)
719

820
```{toctree}
921
:hidden:
22+
:maxdepth: 5
23+
overview/_index
24+
25+
```
1026

11-
observability/_index
27+
```{toctree}
28+
:hidden:
29+
:caption: CogStack NLP
30+
MedCAT <https://docs.cogstack.org/projects/nlp>
31+
MedCAT Trainer <hhttps://medcattrainer.readthedocs.io//>
1232
1333
```
1434

35+
```{toctree}
36+
:hidden:
37+
:caption: CogStack Platform
38+
39+
NiFi <https://cogstack-nifi.readthedocs.io/en/latest/>
40+
41+
platform/_index
42+
```
1543

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
# CogStack ecosystem (v1)
2+
3+
In this part are covered the available services that can be running in an example CogStack deployment. To such deployment with many running services we refer as an  *ecosystem* or a *platform*. Below is presented a high-level perspective of CogStack platform with the possibilities it enables through many components and services. In practice, many of the functionalities that CogStack platform enables are implemented as separate, but interconnected services working inside the ecosystem.
4+
5+
## Core services
6+
7+
In most scenarios CogStack platform will consist of *core* services tailored to specific use-cases. Additional application and services can be run on top of it, such as [SemEHR](../../CogStack%20General/CogStack%20Wiki/CogStack%20projects/SemEHR.md), [Patient Timeline](../../CogStack%20General/CogStack%20Wiki/CogStack%20projects/Patient%20Timeline.md), Live Alerting (through ElasticSearch plugins) or any other custom developed applications. For an ease-of-use, when deploying a sample CogStack platform, we always emphasise to use Docker Compose (see: [Running CogStack](Running%20CogStack.md)).
8+
9+
Below is presented is one of the most simple and common scenarios when ingesting and processing the EHR data from a proprietary data source.
10+
11+
![](./attachments/5503ea0e-ac74-40ba-936a-1287ad3f1cf5.png)
12+
13+
A CogStack platform presented here consists of such core services:
14+
15+
- *CogStack Pipeline* service for ingesting and processing the EHR data from the source database,
16+
- *CogStack Job Repository* (PostgreSQL database) serving for job status control,
17+
- *ElasticSearch* sink where the processed EHR records are stored,
18+
- (optional) *Kibana* user interface to easily perform exploratory data analysis over the processed records.
19+
20+
It is essential to note that presented is a very simplified scenario, which can be easily deployed even on a local machine with limited resources. We are also using here an optional Kibana as an out-of-the-box and easy to use solution to explore the data, although many other data analysis or BI tools can be used. Moreover, there are also available connectors to ElasticSearch in many languages, such as Java, Python, R or JavaScript allowing for fast development of custom user applications.
21+
22+
:::{tip}
23+
Note
24+
25+
In the picture we only presented ElasticSearch using a single node. However, in practice, one should consider using at least 3 asticSearch nodes deployed as a cluster which greatly improves resilience, query performance and reliability.
26+
Similarly, in the picture we only presented one CogStack Pipeline instance and only one data source. However, in practice, there may be multiple sources available with multiple Pipeline components running in parallel. This is why, when considering deploying CogStack platform in production, one should keep in mind the aspects of the scalability and resilience of the platform and running services.
27+
:::
28+
29+
30+
### CogStack Pipeline
31+
32+
CogStack Pipeline is the main data processing service used inside the CogStack platform. Within the ecosystem it's main responsibilities is to ingest the EHR data from a specified data source, process the data (e.g. by applying the text extraction methods, records de-identification or extracting the NLP annotations) and store the resulting data in the specified sink.
33+
34+
Usually, the sink will be the ElasticSearch store, keeping the processed EHRs which can be ready to use by other applications. However, when performing computationally-expensive processing tasks, such as running OCR-based text extraction from the documents, one may prefer to store the partial results in a cache. In such case, PostgreSQL can be used as a temporary store – [Examples](Examples.md) covers such case.
35+
36+
The information about available data processing components offered by CogStack Pipeline can be found in [CogStack Pipeline](CogStack%20Pipeline.md) part.
37+
38+
:::{ifno}
39+
We recommend using CogStack Pipeline component in the newest version 1.3.0.
40+
:::
41+
42+
---
43+
44+
---
45+
46+
47+
48+
### PostgreSQL
49+
50+
[PostgreSQL](https://www.postgresql.org/) is a widely used object-relational database management system. In CogStack platform it is primarily used as a job repository, for storing the jobs execution status of running CogStack Pipeline instances. However, there may be cases where one may need to store the partial results treating PostgreSQL DB either as a data cache (see: [Examples](Examples.md) ) or an auxiliary data sink.
51+
52+
When used as a job repository, it requires defining appropriate tables with a user that will be used by CogStack Pipeline running instance(s). This schema is defined by [Spring Batch META-DATA schema definition](https://docs.spring.io/spring-batch/trunk/reference/html/metaDataSchema.html) and is also available in `CogStack-Pipeline/examples/docker-common/pgjobrepo/create_repo.sh` script.
53+
54+
:::{Info}
55+
We recommend using PostgreSQL in versions >= 10.
56+
In the [Examples](Examples.md) part we use PostgreSQL in version 11.1.
57+
:::
58+
59+
:::{warning}
60+
Note
61+
62+
PostgreSQL by default has a connection limit of 100.  Since a single CogStack Pipeline instance using multiple processing threads uses a connection pool both for retrieving the EHR data from the database source and to update the job repository, one may need to increase the default connection limit with the available memory buffers. To do so, one may specify parameters: `"-c 'shared_buffers=256MB' -c 'max_connections=1000'"` when initialising the database.
63+
:::
64+
65+
### ElasticSearch
66+
67+
[ElasticSearch](https://www.elastic.co/guide/) is a popular NoSQL search engine based on the Lucene library that provides a distributed full-text search engine storing the data as schema-free JSON documents. Inside CogStack platform it is usually used as a primary data store for processed EHR data by CogStack Pipeline.
68+
69+
Depending on the use-case, the processed EHR data is usually stored in indices as defined in corresponding CogStack Pipeline job description property files (see: [CogStack Pipeline](CogStack%20Pipeline.md)). Once stored, it can be easily queried either by using the own's REST API (see: [ElasticSearch Search API](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html)), queried using [Kibana](#kibana) or queried using a ElasticSearch connector available in many programming languages. ElasticSearch apart from standard functionality and features provided in its open-source free version also offers more advanced ones distributed as [Elastic Stack](https://www.elastic.co/products/stack) (formerly: X-Pack extension) which require license. These include modules for machine learning, alerting, monitoring, security and more.
70+
71+
:::{tip}
72+
In our [Examples](Examples.md) we use the free, open-source version of ElasticSearch without the Elastic Stack modules included. It needs to be noted that in cases when one requires a secure and/or granular access to the processed EHR data in ElasticSearch sink, one should explore the [Security](https://www.elastic.co/guide/en/x-pack/current/elasticsearch-security.html) module (formerly: Shield) offered in the Elastic Stack. Some of the features include (as stated the official website):
73+
- Preventing unauthorised access with password protection, role-based access control (even per index- or single document-level), and IP filtering.
74+
- Preserving the integrity of your data with message authentication and SSL/TLS encryption.
75+
- Maintaining an audit trail so one know who’s doing what to your cluster and the data it stores.
76+
CogStack Pipeline fully supports the functionality provided by the ElasticSearch Security module used to securely access the node(s).
77+
:::
78+
79+
:::{Info}
80+
In our [Examples](Examples.md) we use a simple, single-node ElasticSearch deployment. However, in practice, one should consider using at least 3 ElasticSearch nodes deployed as a cluster which greatly improves resilience, query performance and reliability.
81+
:::
82+
83+
:::{important}
84+
We recommend using ElasticSearch in versions >= 6.0.
85+
:::
86+
87+
88+
:::{warning}
89+
Note
90+
91+
If ElasticSearch service does not start up and such error is reported:
92+
93+
> elasticsearch    | ERROR: [1] bootstrap checks failed
94+
> elasticsearch    | [1]: max file descriptors [4096] for elasticsearch process is too low, increase to at least [65536]
95+
96+
one may need to increase the number of available file descriptors on the **host** machine – please refer to: <https://www.elastic.co/guide/en/elasticsearch/reference/current/file-descriptors.html>
97+
:::
98+
99+
:::{warning}
100+
Note
101+
102+
If ElasticSearch service does not start up and such error is reported:
103+
104+
> elasticsearch    | ERROR: [1] bootstrap checks failed
105+
> elasticsearch    | [1]: max virtual memory areas vm.max\_map\_count [65530] is too low, increase to at least [262144]
106+
107+
one may need to increase the number of available virtual memory on the **host** machine – please refer to: <https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html>
108+
:::
109+
110+
---
111+
112+
---
113+
114+
### Kibana
115+
116+
[Kibana](https://www.elastic.co/products/kibana) is a data visualisation module for ElasticSeach that be easily used to explore and query the data. In sample CogStack platform deployments it can be used as a ready-to-use data exploration tool.
117+
118+
Apart from providing exploratory data analysis functionality it also offers administrative options over the ElasticSearch data store, such as adding/removing/updating the documents using command line or creating/removing indices. Moreover, custom user dashboards can be created according to use-case requirements. For a more detailed description of the available functionality please refer to the [official documentation](https://www.elastic.co/guide/en/kibana/current/introduction.html).
119+
120+
:::{info}
121+
In all our [Examples](Examples.md) we provide ElasticSearch bundled with Kibana.
122+
:::
123+
124+
---
125+
126+
---
127+
128+
### NGINX
129+
130+
NGINX is a popular, open-source web server that can also be used as a reverse proxy, load balancer, HTTP cache and more. In CogStack platform deployments, it can be used as a reverse-proxy and providing a basic security access to the exposed data stores and service endpoints. Some of the functionality may include general user-based authentication, IP filtering and selective service access. A more detailed description of security features offered by NGINX can be found in the [official documentation](https://docs.nginx.com/nginx/admin-guide/security-controls/).
131+
132+
[Examples](Examples.md) covers a simple use-case with NGINX serving as a basic authentication module. The example configuration of NGINX running as a proxy can be found in `CogStack-Pipeline/examples/docker-common/nginx/config/` directory.
133+
134+
:::{info}
135+
It needs to be noted, however, that the security and granularity of access to the data stored in ElasticSearch offered by NGINX is inferior to using the [Security](https://www.elastic.co/guide/en/x-pack/current/elasticsearch-security.html) module from Elastic Stack.
136+
:::
137+
138+
---
139+
140+
---
141+
142+
### Fluentd
143+
144+
[Fluentd](https://www.fluentd.org/) is an open source data collector providing a unified logging layer. In sample CogStack platform deployments it can be used running as a service collecting the logs from all the running services which can be used for auditing.
145+
146+
Fluentd provides a highly configurable and flexible set of rules, filters and plugins that can be used to set the logging for any running service inside the platform. The [official Fluentd documentation](https://docs.fluentd.org/v1.0/articles/quickstart) covers many Fluentd examples with detailed description.
147+
148+
[Examples](Examples.md) covers a simple use-case with using Fluentd for logging. The example configuration file can be found in `CogStack-Pipeline/examples/docker-common/fluentd/conf/` directory.
149+
150+
---
151+
152+
---

0 commit comments

Comments
 (0)