Skip to content

bug: Docker-> Installing extensions and plugins causes Datashare container crash on restart #2015

@sfjhu

Description

@sfjhu

After using the official docker-compose.yml from the documentation and starting a clean install, I am able to use the GUI to install the OPENNLP Pipeline and Neo4J extensions, along with the Neo4J Graph Viewer plugin. I get the message in the GUI about restarting Datashare, so I do a docker compose down, but when I restart the stack, Datashare fails to start with the following error message:

datashare-1  |
datashare-1  | 2026-01-26 20:26:33,050 [main] ERROR DatashareCli - Failed to parse arguments.
datashare-1  | java.lang.NullPointerException: null
datashare-1  |  at java.base/java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1011)
datashare-1  |  at java.base/java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:1006)
datashare-1  |  at java.base/java.util.Properties.put(Properties.java:1301)
datashare-1  |  at java.base/java.util.Properties.setProperty(Properties.java:229)
datashare-1  |  at org.icij.datashare.cli.DatashareCli.parseArguments(DatashareCli.java:76)
datashare-1  |  at org.icij.datashare.Main.main(Main.java:16)
datashare-1  | Usage:
datashare-1  | Option                                   Description
datashare-1  | ------                                   -----------
datashare-1  | -?, -h, --help
datashare-1  | --apiKey <String>                        existing api key for user
datashare-1  | --artifactDir <String>                   Artifact directory for embedded
datashare-1  |                                            caching. If not provided datashare
datashare-1  |                                            will use memory.
datashare-1  | --authFilter <String>                    Server mode auth filter class
datashare-1  | --authUsersProvider <String>             Server mode auth users provider class
datashare-1  | --batchDownloadDir <String>              Directory where Batch Download
datashare-1  |                                            archives are downloaded. (default:
datashare-1  |                                            /home/datashare/.
datashare-1  |                                            local/share/datashare/tmp)
datashare-1  | --batchDownloadEncrypt <Boolean>         Whether Batch download zip files are
datashare-1  |                                            encrypted or not. SmtpUrl should be
datashare-1  |                                            set to send the password. (default
datashare-1  |                                            false)
datashare-1  | --batchDownloadMaxNbFiles <Integer>      Maximum file number that can be
datashare-1  |                                            archived in a zip (Default 10,000)
datashare-1  |                                            (default: 10000)
datashare-1  | --batchDownloadMaxSize <[0-9]+[KMG]?>    Maximum total files size that can be
datashare-1  |                                            zipped. Human readable suffix K/M/G
datashare-1  |                                            for KB/MB/GB (Default 100M)
datashare-1  |                                            (default: 100M)
datashare-1  | --batchDownloadScroll <String>           Scroll duration used for elasticsearch
datashare-1  |                                            scrolls (Batch Download) (default:
datashare-1  |                                            60000ms)
datashare-1  | --batchDownloadScrollSize <Integer>      Scroll size used for elasticsearch
datashare-1  |                                            scrolls (Batch Download) (default:
datashare-1  |                                            1000)
datashare-1  | --batchDownloadTimeToLive <Integer>      Time to live in hour for batch
datashare-1  |                                            download zip files (Default 24)
datashare-1  |                                            (default: 24)
datashare-1  | --batchQueueType <QueueType>             (default: MEMORY)
datashare-1  | --batchSearchMaxTimeSeconds <Integer>    Max time for batch search in seconds
datashare-1  | --batchSearchScroll <String>             Scroll duration used for elasticsearch
datashare-1  |                                            scrolls (Batch Search) (default:
datashare-1  |                                            60000ms)
datashare-1  | --batchSearchScrollSize <Integer>        Scroll size used for elasticsearch
datashare-1  |                                            scrolls (Batch Search) (default:
datashare-1  |                                            1000)
datashare-1  | --batchSize <Integer>                    Batch size of NLP extraction task in
datashare-1  |                                            number of documents. (default: 1024)
datashare-1  | --batchThrottleMilliseconds <Integer>    Throttle for batch in milliseconds
datashare-1  | --browserOpenLink <Boolean>              try to open link in the default
datashare-1  |                                            browser (default: false)
datashare-1  | --busType <QueueType>                    Backend data bus type. (default:
datashare-1  |                                            MEMORY)
datashare-1  | --charset <String>                       Datashare default charset. Example:
datashare-1  |                                            [UTF-8, ISO-8859-1] (default: UTF-8)
datashare-1  | --clusterName <String>                   Cluster name (default: datashare)
datashare-1  | --cors <String>                          CORS headers (needs the web option)
datashare-1  |                                            (default: no-cors)
datashare-1  | --createIndex <String>                   creates an index with the given name
datashare-1  | -d, --dataDir <File>                     Document source files directory
datashare-1  |                                            (default: /home/datashare/Datashare)
datashare-1  | --dataSourceUrl <String>                 Datasource URL. For using memory you
datashare-1  |                                            can use 'jdbc:sqlite:file:memorydb.
datashare-1  |                                            db?mode=memory&cache=shared'
datashare-1  |                                            (default: jdbc:sqlite:file:
datashare-1  |                                            /home/datashare/.
datashare-1  |                                            local/share/datashare/dist/datashare.
datashare-1  |                                            db)
datashare-1  | --deleteApiKey <String>                  Delete api key for user
datashare-1  | --digestAlgorithm <SHA-[1|256|384|512]   (default: SHA-384)
datashare-1  |   or MD5>
datashare-1  | --digestProjectName <String>             Includes the project name in the hash
datashare-1  |                                            of documents when indexing. It is
datashare-1  |                                            set by default to the defaultProject
datashare-1  |                                            value. See noDigestProject option to
datashare-1  |                                            disable it.
datashare-1  | --elasticsearchAddress <String>          Elasticsearch host address (default:
datashare-1  |                                            http://elasticsearch:9200)
datashare-1  | --elasticsearchDataPath <String>         Data path used for embedded
datashare-1  |                                            Elasticsearch (default:
datashare-1  |                                            /home/datashare/.
datashare-1  |                                            local/share/datashare/es)
datashare-1  | --embeddedDocumentDownloadMaxSize <[0-   Maximum download size of embedded
datashare-1  |   9]+[KMG]?>                               documents. Human readable suffix
datashare-1  |                                            K/M/G for KB/MB/GB (Default 1G)
datashare-1  |                                            (default: 1G)
datashare-1  | --ext <String>                           Run CLI extension
datashare-1  | --extensionDelete <String>               Delete extension with its id or base
datashare-1  |                                            directory (needs extensionsDir
datashare-1  |                                            option)
datashare-1  | --extensionInstall <String>              Install extension with either id or
datashare-1  |                                            URL or file path (needs
datashare-1  |                                            extensionsDir option)
datashare-1  | --extensionList [String]                 Extensions list matching provided
datashare-1  |                                            string
datashare-1  | --extensionsDir <String>                 Extensions directory (backend)
datashare-1  |                                            (default: /home/datashare/.
datashare-1  |                                            local/share/datashare/extensions)
datashare-1  | --followSymlinks <Boolean>               Follow symlinks while scanning
datashare-1  |                                            documents (default: true)
datashare-1  | --full-import                            Performs a full import, importing all
datashare-1  |                                            available documents and named
datashare-1  |                                            entities from Datashare to neo4j
datashare-1  | --grantAdmin <String>                    Grant admin policy to user if there is
datashare-1  |                                            none
datashare-1  | --indexTimeout <positive integer>        Time to wait in minutes before
datashare-1  |                                            consumer termination during document
datashare-1  |                                            indexing (Default 30m) (default: 30)
datashare-1  | -k, --createApiKey <String>              Generate and store api key for user
datashare-1  |                                            defaultUser (see opt)
datashare-1  | -l, --language <String>                  Explicitly specify language of indexed
datashare-1  |                                            documents (instead of detecting
datashare-1  |                                            automatically)
datashare-1  | --logLevel <String>                      Sets the log level of Datashare
datashare-1  |                                            ([ERROR, WARN, INFO, DEBUG, TRACE])
datashare-1  |                                            (default: INFO)
datashare-1  | -m, --mode <Mode>                        Datashare run mode [LOCAL, SERVER,
datashare-1  |                                            CLI, NER, TASK_WORKER, EMBEDDED]
datashare-1  |                                            (default: LOCAL)
datashare-1  | --maxContentLength <[0-9]+[KMG]?>        Maximum length (in bytes) of extracted
datashare-1  |                                            text that could be indexed (-1 means
datashare-1  |                                            no limit and value should be less or
datashare-1  |                                            equal than 2G). Human readable
datashare-1  |                                            suffix K/M/G for KB/MB/GB (Default
datashare-1  |                                            20M) (default: 20000000)
datashare-1  | --messageBusAddress <String>             Message bus address (default: redis:
datashare-1  |                                            //redis:6379)
datashare-1  | --neo4jAppLogInJson <Boolean>            Should the Python process log in JSON
datashare-1  |                                            format (default: false)
datashare-1  | --neo4jAppMaxDumpedDocuments <Long>      Maximum number for document nodes
datashare-1  |                                            allowed during export on SERVER mode
datashare-1  |                                            (default: 10000)
datashare-1  | --neo4jAppPort <Integer>                 Python neo4j service port (default:
datashare-1  |                                            8008)
datashare-1  | --neo4jAppStartTimeoutS <Integer>        Python neo4j service start timeout.
datashare-1  |                                            (default: 30)
datashare-1  | --neo4jCliTaskPollIntervalS <Integer>    Interval in second used to poll task
datashare-1  |                                            statuses when in CLI mode (default:
datashare-1  |                                            2)
datashare-1  | --neo4jHost <String>                     Hostname of the neo4j DB. (default:
datashare-1  |                                            127.0.0.1)
datashare-1  | --neo4jPassword <String>                 Password used to connect to the neo4j
datashare-1  |                                            DB (default: please-change-this-
datashare-1  |                                            password)
datashare-1  | --neo4jPort <Integer>                    Port of the neo4j DB. (default: 7687)
datashare-1  | --neo4jProcessInheritOutputs <Boolean>   Should the Python process outputs be
datashare-1  |                                            redirected to the Java process
datashare-1  |                                            outputs ? (default: true)
datashare-1  | --neo4jSingleProject <String>            Name of the single project which will
datashare-1  |                                            be able to user the extension when
datashare-1  |                                            using neo4j Community Edition
datashare-1  |                                            (default: local-datashare)
datashare-1  | --neo4jUriScheme <String>                URI scheme used to connect to the
datashare-1  |                                            neo4j DB (can be: bolt, neo4j,
datashare-1  |                                            bolt+s, neo4j+s, ....) (default:
datashare-1  |                                            bolt)
datashare-1  | --neo4jUser <String>                     User name used to connect to the neo4j
datashare-1  |                                            DB (default: neo4j)
datashare-1  | --nlpParallelism, --np <Integer>         Number of NLP extraction threads per
datashare-1  |                                            pipeline. (default: 1)
datashare-1  | --nlpPipeline, --nlpp <String>           NLP pipeline to be run. (default:
datashare-1  |                                            CORENLP)
datashare-1  | --noDigestProject <Boolean>              Disable the project name in document
datashare-1  |                                            hash processing (only using binary
datashare-1  |                                            contents). (default: false)
datashare-1  | -o, --ocr <Boolean>                      Run optical character recognition at
datashare-1  |                                            file parsing time. (Tesseract must
datashare-1  |                                            be installed beforehand). (default:
datashare-1  |                                            true)
datashare-1  | --oauthApiUrl <String>                   OAuth2 api url
datashare-1  | --oauthAuthorizeUrl <String>             OAuth2 authorize url
datashare-1  | --oauthCallbackPath <String>             OAuth2 callback path (in datashare)
datashare-1  | --oauthClaimIdAttribute <String>         Json field name sent by the Identity
datashare-1  |                                            Provider that contains user
datashare-1  |                                            identifier value.
datashare-1  | --oauthClientId <String>                 OAuth2 client id
datashare-1  | --oauthClientSecret <String>             OAuth2 client secret key
datashare-1  | --oauthDefaultProject <String>           Default project to use for Oauth2 users
datashare-1  | --oauthScope <String>                    Set scope in oauth2 callback url,
datashare-1  |                                            needed for OIDC providers
datashare-1  | --oauthTokenUrl <String>                 OAuth2 token url
datashare-1  | --oauthUserProjectsAttribute <String>    Json field name sent by the Identity
datashare-1  |                                            Provider that contains user
datashare-1  |                                            projects. (default:
datashare-1  |                                            groups_by_applications.datashare)
datashare-1  | --ocrLanguage <String>                   Explicitly specify OCR languages for
datashare-1  |                                            tesseract. 3-character ISO 639-2
datashare-1  |                                            language codes and + sign for
datashare-1  |                                            multiple languages
datashare-1  | --ocrType <String>                       OCR implementation: TESSERACT or
datashare-1  |                                            TESS4J (default: TESSERACT)
datashare-1  | -p, --project <String>                   Name of the datashare project
datashare-1  | --parallelism <Integer>                  Number of threads allocated for task
datashare-1  |                                            management. (default: 16)
datashare-1  | --parserParallelism, --pp <Integer>      Number of file parser threads.
datashare-1  |                                            (default: 1)
datashare-1  | --pluginDelete <String>                  Delete plugin with its id or base
datashare-1  |                                            directory (needs pluginsDir option)
datashare-1  | --pluginInstall <String>                 Install plugin with either id or URL
datashare-1  |                                            or file path (needs pluginsDir
datashare-1  |                                            option)
datashare-1  | --pluginList [String]                    Plugins list matching provided string
datashare-1  | --pluginsDir <String>                    Plugins directory (default:
datashare-1  |                                            /home/datashare/.
datashare-1  |                                            local/share/datashare/plugins)
datashare-1  | --pollingInterval <String>               Queue polling interval. (default: 60)
datashare-1  | --port, --tcpListenPort <Integer>        Port used by the HTTP server (default:
datashare-1  |                                            8080)
datashare-1  | --protectedUriPrefix <String>            Protected URI prefix (default: /api/)
datashare-1  | --queueCapacity <positive integer>       Queue capacity is the size of the
datashare-1  |                                            internal file path buffer used by
datashare-1  |                                            the queue. (default: 1000000)
datashare-1  | --queueName <String>                     Extract queue name (default: extract:
datashare-1  |                                            queue)
datashare-1  | --queueType <QueueType>                  Backend queues and sets type.
datashare-1  |                                            (default: MEMORY)
datashare-1  | -r, --resume                             Resume pending operations
datashare-1  | --redisAddress <String>                  Redis queue address (default: redis:
datashare-1  |                                            //redis:6379)
datashare-1  | --redisPoolSize <Integer>                Pool size for main Redis client
datashare-1  |                                            (default: 5)
datashare-1  | --reportName <String>                    name of the map for the report map
datashare-1  |                                            (where index results are stored). No
datashare-1  |                                            report records are saved if not
datashare-1  |                                            provided
datashare-1  | --rootHost <String>                      Datashare host for urls
datashare-1  | -s, --settings <String>                  Property settings file
datashare-1 exited with code 1

I do have the /home/datashare/.local/share/datashare/extensions and /home/datashare/.local/share/datashare/plugins directories mapped to a volume in my docker compose file, so I would expect the Datashare container to come back up cleanly.

Here are the relevant sections from my docker-compose.yml:

services:
  datashare:
      image: ${DATASHARE_IMAGE}
      hostname: datashare
      ports:
        - 8080:8080
      environment:
        - DS_DOCKER_MOUNTED_DATA_DIR=/home/datashare/data
      volumes:
        - ${DATASHARE_DATA_DIR}:/home/datashare/data
        - datashare-models:/home/datashare/dist
        - datashare-extensions:/home/datashare/.local/share/datashare/extensions
        - datashare-plugins:/home/datashare/.local/share/datashare/plugins
      command: >-
        --dataSourceUrl jdbc:postgresql://postgresql/datashare?user=datashare\&password=password 
        --mode LOCAL
        --tcpListenPort 8080
      depends_on:
        - postgresql
        - redis
        - elasticsearch

...

volumes:
  datashare-models:
  datashare-extensions:
  datashare-plugins:
  elasticsearch-data:
  postgresql-data:
  neo4j_data:
  neo4j_conf:

System specs:

  • Host: Ubuntu 24
  • Datashare version: 20.8.2

Expected behavior
I would expect the Datashare container to be able to be cleanly restarted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions