Datashare Server Mode

What

Datashare server mode is used by ICIJ to share document corpuses (or projects) between several users (journalists).

Users are authenticated with OAuth2 or HTTP Basic authentication, and should have in their backend session a list of granted projects.

No external service nor cloud data exchanges are made, except for

datashare docker image downloaded from docker hub
NER models that are downloaded from ICIJ S3 service

Once your container and models are downloaded you can run datashare in an isolated local network.

How

Datashare is launched with --mode SERVER and you have to provide :

the external elasticsearch index address
a Redis store address
a Redis data bus address
a database JDBC URL
an authentication mechanism and its parameters
the host of datashare (for batch search results URL generation)

Example

docker run -ti ICIJ/datashare:version --mode SERVER \
    --redisAddress redis://my.redis-server.org:6379 \
    --elasticsearchAddress https://my.elastic-server.org:9200 \
    --messageBusAddress my.redis-server.org \
    --dataSourceUrl jdbc:postgresql://db-server/ds-database?user=ds-user&password=ds-password \
    --rootHost https://my.datashare-server.org
    # ... +auth parameters (see below)

HTTP Server with Basic Authentication

Basic authentication is a simple protocol that uses the HTTP headers and the browser to authenticate users. User credentials are sent to the server in the header Authorization with user:password base64 encoded :

Authorization: Basic dXNlcjpwYXNzd29yZA==

It is secure as long as the communication to the server is encrypted (with SSL for example).

On the server side, you have to provide a user store fro datashare. For now we are using a Redis data store.

So you have to provision users. The passwords are sha256 hex encoded (for example with python) :

$ python
Python 3.6.9 (default, Jul  3 2019, 15:36:16) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> hashlib.sha256(b"bar").hexdigest()
'fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9'

Then insert the user like this in Redis:

$ redis-cli -h my.redis-server.org
redis-server.org:6379> set foo '{"uid":"foo", "password":"fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9", "datashare_projects":["local-datashare"]}'

Then you should see this popup : login popup

Example

docker run -ti ICIJ/datashare:version --mode SERVER \
    --redisAddress redis://my.redis-server.org:6379 \
    --elasticsearchAddress https://my.elastic-server.org:9200 \
    --messageBusAddress my.redis-server.org \
    --dataSourceUrl jdbc:postgresql://db-server/ds-database?user=ds-user&password=ds-password \
    --rootHost https://my.datashare-server.org \
    --authFilter org.icij.datashare.session.BasicAuthAdaptorFilter

HTTP server with OAuth2

With OAuth2 you will need an authorization service. The workflow is this :

oauth

Example

docker run -ti ICIJ/datashare:version --mode SERVER \
    --oauthClientId 30045255030c6740ce4c95c \
    --oauthClientSecret 10af3d46399a8143179271e6b726aaf63f20604092106 \
    --oauthAuthorizeUrl https://my.oauth-server.org/oauth/authorize \
    --oauthTokenUrl https://my.oauth-server.org/oauth/token \
    --oauthApiUrl https://my.oauth-server.org/api/v1/me.json \
    --oauthCallbackPath /auth/callback

Process data

Now you have a datashare server, but no data in index/database. We will see here how to index files into Elasticsearch in CLI mode.

You'll have to provide the addresses of Redis (used for queuing files), the index, the index name and you can also play with other indexing parameters (queue name, threading pools size, etc.) for your use cases and performance optimization.

Scanning

First you have to fill the queue that will be used for indexing on several threads/machines. The scanning process is not parallelized because the bottleneck is the filesystem read, and we've empirically saw that this stage is not that long to execute even for millions of documents.

Example

docker run -ti -v host/data/path:datashare/container/path ICIJ/datashare:version --mode CLI
    -s SCAN
    -d datashare/container/path
    --redisAddress {{ ds_reddis_url }}
    --queueName {{ ds_queue_name }}

Let's review some parameters :

you have to provide to the container the access to the document folder (the -v host/data/path:datashare/container/path and tell datasahre to use this folder inside docker -d datashare/container/path
queueName is the name of the queue used by datashare/extract in Redis
other parameters are the addresses of ES/Redis bus/Database

Indexing

Once the data is in the Redis queue queueName then we can launch the indexing on several threads and machines (we use ansible to run this task on up to 30 nodes with 32 threads each).

Example

docker run -ti ICIJ/datashare:version --mode CLI
    -s INDEX
    --ocr true
    --parserParallelism {{ processor_count_cmd.stdout }}
    --defaultProject {{ es_index_name }}
    --redisAddress {{ ds_reddis_url }}
    --queueName {{ ds_queue_name }}
    --reportName {{ ds_report_name }}
    --elasticsearchAddress {{ datashare_elasticsearch_url }}
    --messageBusAddress {{ ds_bus_url }}
    --dataSourceUrl {{ datashare_datasource_url }}

Additional parameters in the index stage are the following

you can tell datashare/extract/Tika to do Optical Character Recognition (OCR). OCR will detect text in images but the process is dividing the performance by factor of 5 to 10
parserParallelism is the number of threads that are going to be used for parsing documents
defaultProject is the project name, it will be used as index name for elasticsearch
reportNameis the name of the map used by datashare/extract to store the results of text extraction. It is the way for this stage to be idempotent : if all files have been indexed with success then if you launch this stage a second time with reportName parameter, it won't index any file

Find named entities

To find named entities, we will resume the documents that have not been processed for a given pipeline.

Example

docker run -ti ICIJ/datashare:version --mode CLI
    -s NER
    --nlpp {{ ds_nlpp_pipelines }}
    --resume
    --nlpParallelism {{ processor_count_cmd.stdout }}
    --defaultProject {{ es_index_name }}
    --redisAddress {{ ds_reddis_url }}
    --elasticsearchAddress {{ datashare_elasticsearch_url }}
    --messageBusAddress {{ ds_bus_url }}
    --dataSourceUrl {{ datashare_datasource_url }}

The NER parameters are :

nlpp the pipeline used that could be (CORENLP, OPENNLP, MITIE)
nlpParallelism number of threads used for Named Entity finding
defaultProject is the project name, it will be used as index name for elasticsearch
resume will also bring idem-potency by searching first the documents not processed by the pipeline

Batch searches

To run user batch searches, you can run this command :

docker run --rm icij/datashare:version
    -m BATCH 
    --dataSourceUrl '{{ datashare_datasource_url }}' 
    --elasticsearchAddress '{{ datashare_elasticsearch_url }}'

You can use it in a crontab job for example.

Extensions for other Data Base / Session Store / Data Bus

Datashare is opensource and easily extensible. You can implement your own component for your architecture.

Database

For now, the database in datashare is used to store :

starred documents
tags
batch searches
batch results
access rights for downloading sources of documents

It is implemented for PostgreSQL and Sqlite with Jooq (a quite low-level SQL like Java API). Normally it should work "as is" for other databases (MySQL...) supported by Jooq. But the Repositories integration tests are only run on CI for PostgreSQL/Sqlite.

You can try changing the DB URL parameter : dataSourceUrl

Session Store

Datashare is based on fluent-http. It needs two classes to handle sessions :

Users is the list of referenced users
SessionIdStore that is the list of session ids

We have implemented this with RedisUsers and RedisSessionIdStore

So sessions/users are stored in Redis, but they could be implemented with another persistence backend component.

Shared queues/maps

Here also we used Redis for our needs but in extract there is a MySQL implementation for Queue/Report components. If it is more convenient for you can try to wire this in Datashare and add options for them.

We also implemented memory queues/maps for datashare to be run without dependencies on Redis (only works if run in one machine).

Databus

We are using a small Redis databus for :

the progress of indexing
launching NER finding after indexing

But we already implemented a memory databus for the same reason as above.

That could also be implemented with RabbitMQ or other data buses, take a look at DataBus.

Datashare

Customize (Legacy)

Translations

This project is currently available in English, French, Spanish and Japanese. You can help us to improve and complete translations on Crowdin.

About ICIJ

Datashare is a project by ICIJ, a collective of investigative journalists.

ICIJ Logo

Datashare Server Mode

What

How

Example

HTTP Server with Basic Authentication

Example

HTTP server with OAuth2

Example

Process data

Scanning

Example

Indexing

Example

Find named entities

Example

Batch searches

Extensions for other Data Base / Session Store / Data Bus

Database

Session Store

Shared queues/maps

Databus

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Datashare

Customize (Legacy)

Translations

About ICIJ

Clone this wiki locally