Skip to content

Commit 79f5cf5

Browse files
authored
Merge pull request #11850 from GlobalDataverseCommunityConsortium/TDL-BigDataDocs
TDL: Provide guidance for site admins w.r.t. big data
2 parents 739052f + 252c9b3 commit 79f5cf5

File tree

7 files changed

+388
-21
lines changed

7 files changed

+388
-21
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
### Big Data Admin Section
2+
3+
- A new section - Scaling Dataverse with Data Size - has been added to the Admin Guide. It is intended to help administrators configure Dataverse appropriately to handle larger amounts of data.

doc/sphinx-guides/source/admin/big-data-administration.rst

Lines changed: 321 additions & 0 deletions
Large diffs are not rendered by default.

doc/sphinx-guides/source/admin/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,3 +35,4 @@ This guide documents the functionality only available to superusers (such as "da
3535
maintenance
3636
backups
3737
troubleshooting
38+
big-data-administration

doc/sphinx-guides/source/developers/big-data-support.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,6 @@ As described in that document, Globus transfers can be initiated by choosing the
196196

197197
An overview of the control and data transfer interactions between components was presented at the 2022 Dataverse Community Meeting and can be viewed in the `Integrations and Tools Session Video <https://youtu.be/3ek7F_Dxcjk?t=5289>`_ around the 1 hr 28 min mark.
198198

199-
See also :ref:`Globus settings <:GlobusSettings>`.
199+
See also :ref:`Globus settings <:GlobusSettings>` and :ref:`globus-stores`.
200200

201201
An alternative, experimental implementation of Globus polling of ongoing upload transfers has been added in v6.4. This framework does not rely on the instance staying up continuously for the duration of the transfer and saves the state information about Globus upload requests in the database. Due to its experimental nature it is not enabled by default. See the ``globus-use-experimental-async-framework`` feature flag (see :ref:`feature-flags`) and the JVM option :ref:`dataverse.files.globus-monitoring-server`.

doc/sphinx-guides/source/installation/config.rst

Lines changed: 57 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1036,15 +1036,18 @@ File Storage
10361036

10371037
By default, a Dataverse installation stores all data files (files uploaded by end users) on the filesystem at ``/usr/local/payara6/glassfish/domains/domain1/files``. This path can vary based on answers you gave to the installer (see the :ref:`dataverse-installer` section of the Installation Guide) or afterward by reconfiguring the ``dataverse.files.\<id\>.directory`` JVM option described below.
10381038

1039-
A Dataverse installation can alternately store files in a Swift or S3-compatible object store, or on a Globus endpoint, and can now be configured to support multiple stores at once. With a multi-store configuration, the location for new files can be controlled on a per-Dataverse collection basis.
1040-
1039+
A Dataverse installation can alternately store files in a Swift or S3-compatible object store, or on a Globus endpoint, and can now be configured to support multiple stores at once.
10411040
A Dataverse installation may also be configured to reference some files (e.g. large and/or sensitive data) stored in a web or Globus accessible trusted remote store.
1041+
With a multi-store configuration, the location for new files can be controlled on a per-Dataverse collection or per-dataset basis.
1042+
:doc:`/admin/big-data-administration` provides more detail about the pros and cons of different types of storage.
10421043

10431044
A Dataverse installation can be configured to allow out of band upload by setting the ``dataverse.files.\<id\>.upload-out-of-band`` JVM option to ``true``.
10441045
By default, Dataverse supports uploading files via the :ref:`add-file-api`. With S3 stores, a direct upload process can be enabled to allow sending the file directly to the S3 store (without any intermediate copies on the Dataverse server).
10451046
With the upload-out-of-band option enabled, it is also possible for file upload to be managed manually or via third-party tools, with the :ref:`Adding the Uploaded file to the Dataset <direct-add-to-dataset-api>` API call (described in the :doc:`/developers/s3-direct-upload-api` page) used to add metadata and inform Dataverse that a new file has been added to the relevant store.
10461047

1047-
The following sections describe how to set up various types of stores and how to configure for multiple stores.
1048+
The following sections describe how to set up various types of stores and how to configure for multiple stores. See also :ref:`choose-store`.
1049+
1050+
.. _multiple-stores:
10481051

10491052
Multi-store Basics
10501053
++++++++++++++++++
@@ -1105,6 +1108,8 @@ File stores have one option - the directory where files should be stored. This c
11051108
11061109
Multiple file stores should specify different directories (which would nominally be the reason to use multiple file stores), but one may share the same directory as "\-Ddataverse.files.directory" option - this would result in temp files being stored in the /temp subdirectory within the file store's root directory.
11071110

1111+
See also :ref:`file-stores`.
1112+
11081113
Swift Storage
11091114
+++++++++++++
11101115

@@ -1200,6 +1205,8 @@ The Dataverse Software S3 driver supports multi-part upload for large files (ove
12001205

12011206
**Note:** The Dataverse Project Team is most familiar with AWS S3, and can provide support on its usage with the Dataverse Software. Thanks to community contributions, the application's architecture also allows non-AWS S3 providers. The Dataverse Project Team can provide very limited support on these other providers. We recommend reaching out to the wider Dataverse Project Community if you have questions.
12021207

1208+
See also :ref:`s3-stores`.
1209+
12031210
First: Set Up Accounts and Access Credentials
12041211
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
12051212

@@ -1430,6 +1437,8 @@ You may provide the values for these via any `supported MicroProfile Config API
14301437
2. A non-empty ``dataverse.files.<id>.profile`` will be ignored when no credentials can be found for this profile name.
14311438
Current codebase does not make use of "named profiles" as seen for AWS CLI besides credentials.
14321439

1440+
.. _s3-compatible:
1441+
14331442
Reported Working S3-Compatible Storage
14341443
######################################
14351444

@@ -1516,29 +1525,33 @@ In addition to having the type "remote" and requiring a label, Trusted Remote St
15161525
These and other available options are described in the table below.
15171526

15181527
Trusted remote stores can range from being a static trusted website to a sophisticated service managing access requests and logging activity
1519-
and/or managing access to a secure enclave. See :doc:`/developers/big-data-support` for additional information on how to use a trusted remote store. For specific remote stores, consult their documentation when configuring the remote store in your Dataverse installation.
1528+
and/or managing access to a secure enclave. See :doc:`/admin/big-data-administration` (specifically :ref:`remote-stores`) and :doc:`/developers/big-data-support` for additional information on how to use a trusted remote store. For specific remote stores, consult their documentation when configuring the remote store in your Dataverse installation.
15201529

1521-
Note that in the current implementation, activites where Dataverse needs access to data bytes, e.g. to create thumbnails or validate hash values at publication will fail if a remote store does not allow Dataverse access. Implementers of such trusted remote stores should consider using Dataverse's settings to disable ingest, validation of files at publication, etc. as needed.
1530+
Note that in the current implementation, activities where Dataverse needs access to data bytes, e.g. to create thumbnails or validate hash values at publication will fail if a remote store does not allow Dataverse access. Implementers of such trusted remote stores should consider using Dataverse's settings to disable ingest, validation of files at publication, etc. as needed.
15221531

15231532
Once you have configured a trusted remote store, you can point your users to the :ref:`add-remote-file-api` section of the API Guide.
15241533

15251534
.. table::
15261535
:align: left
15271536

1528-
=========================================== ================== ========================================================================== ===================
1529-
JVM Option Value Description Default value
1530-
=========================================== ================== ========================================================================== ===================
1531-
dataverse.files.<id>.type ``remote`` **Required** to mark this storage as remote. (none)
1532-
dataverse.files.<id>.label <?> **Required** label to be shown in the UI for this storage. (none)
1533-
dataverse.files.<id>.base-url <?> **Required** All files must have URLs of the form <baseUrl>/* . (none)
1534-
dataverse.files.<id>.base-store <?> **Required** The id of a base store (of type file, s3, or swift). (the default store)
1535-
dataverse.files.<id>.download-redirect ``true``/``false`` Enable direct download (should usually be true). ``false``
1536-
dataverse.files.<id>.secret-key <?> A key used to sign download requests sent to the remote store. Optional. (none)
1537-
dataverse.files.<id>.url-expiration-minutes <?> If direct downloads and using signing: time until links expire. Optional. 60
1538-
dataverse.files.<id>.remote-store-name <?> A short name used in the UI to indicate where a file is located. Optional. (none)
1539-
dataverse.files.<id>.remote-store-url <?> A url to an info page about the remote store used in the UI. Optional. (none)
1537+
======================================================= ================== ========================================================================== ===================
1538+
JVM Option Value Description Default value
1539+
======================================================= ================== ========================================================================== ===================
1540+
dataverse.files.<id>.type ``remote`` **Required** to mark this storage as remote. (none)
1541+
dataverse.files.<id>.label <?> **Required** label to be shown in the UI for this storage. (none)
1542+
dataverse.files.<id>.base-url <?> **Required** All files must have URLs of the form <baseUrl>/* . (none)
1543+
dataverse.files.<id>.base-store <?> **Required** The id of a base store (of type file, s3, or swift). (the default store)
1544+
dataverse.files.<id>.upload-out-of-band ``true`` **Required to be true** Dataverse does not manage file placement ``false``
1545+
dataverse.files.<id>.download-redirect ``true``/``false`` Enable direct download (should usually be true). ``false``
1546+
dataverse.files.<id>.secret-key <?> A key used to sign download requests sent to the remote store. Optional. (none)
1547+
dataverse.files.<id>.public ``true``/``false`` True if the remote store does not enforce Dataverse access controls ``false``
1548+
dataverse.files.<id>.ingestsizelimit <size in bytes> Maximum size of files that should be ingested (none)
1549+
dataverse.files.<id>.url-expiration-minutes <?> If direct downloads and using signing: time until links expire. Optional. 60
1550+
dataverse.files.<id>.remote-store-name <?> A short name used in the UI to indicate where a file is located. Optional. (none)
1551+
dataverse.files.<id>.remote-store-url <?> A URL to an info page about the remote store used in the UI. Optional. (none)
1552+
dataverse.files.<id>.files-not-accessible-by-dataverse ``true``/``false`` True if the file is at the URL provided, false if that is a landing page ``false``
15401553

1541-
=========================================== ================== ========================================================================== ===================
1554+
======================================================= ================== ========================================================================== ===================
15421555

15431556
.. _globus-storage:
15441557

@@ -1578,6 +1591,7 @@ Once you have configured a globus store, or configured an S3 store for Globus ac
15781591
for a managed store) - using a microprofile alias is recommended (none)
15791592
dataverse.files.<id>.reference-endpoints-with-basepaths <?> A comma separated list of *remote* trusted Globus endpoint id/<basePath>s (none)
15801593
dataverse.files.<id>.files-not-accessible-by-dataverse ``true``/``false`` Should be false for S3 Connector-based *managed* stores, true for others ``false``
1594+
dataverse.files.<id>.public ``true``/``false`` True can be used to disable users ability restrict/embargo files ``false``
15811595

15821596
======================================================= ================== ========================================================================== ===================
15831597

@@ -2804,6 +2818,8 @@ when using it to configure your core name!
28042818

28052819
Can also be set via *MicroProfile Config API* sources, e.g. the environment variable ``DATAVERSE_SOLR_PATH``.
28062820

2821+
.. _dataverse.solr.min-files-to-use-proxy:
2822+
28072823
dataverse.solr.min-files-to-use-proxy
28082824
+++++++++++++++++++++++++++++++++++++
28092825

@@ -2815,6 +2831,8 @@ A recommended value would be ~1000 but the optimal value may vary depending on d
28152831

28162832
Can also be set via *MicroProfile Config API* sources, e.g. the environment variable ``DATAVERSE_SOLR_MIN_FILES_TO_USE_PROXY``.
28172833

2834+
.. _dataverse.solr.concurrency.max-async-indexes:
2835+
28182836
dataverse.solr.concurrency.max-async-indexes
28192837
++++++++++++++++++++++++++++++++++++++++++++
28202838

@@ -4447,6 +4465,8 @@ Notes:
44474465

44484466
- For larger file upload sizes, you may need to configure your reverse proxy timeout. If using apache2 (httpd) with Shibboleth, add a timeout to the ProxyPass defined in etc/httpd/conf.d/ssl.conf (which is described in the :doc:`/installation/shibboleth` setup).
44494467

4468+
.. _:MultipleUploadFilesLimit:
4469+
44504470
:MultipleUploadFilesLimit
44514471
+++++++++++++++++++++++++
44524472

@@ -4525,6 +4545,8 @@ Examples:
45254545

45264546
``curl -X PUT -d '{"default":"0", "CSV":"268435456"}' http://localhost:8080/api/admin/settings/:TabularIngestSizeLimit``
45274547

4548+
.. _:ZipUploadFilesLimit:
4549+
45284550
:ZipUploadFilesLimit
45294551
++++++++++++++++++++
45304552

@@ -4543,13 +4565,17 @@ By default your Dataverse installation will attempt to connect to Solr on port 8
45434565

45444566
**Note:** instead of using a database setting, you could alternatively use JVM settings like :ref:`dataverse.solr.host`.
45454567

4568+
.. _:SolrFullTextIndexing:
4569+
45464570
:SolrFullTextIndexing
45474571
+++++++++++++++++++++
45484572

45494573
Whether or not to index the content of files such as PDFs. The default is false.
45504574

45514575
``curl -X PUT -d true http://localhost:8080/api/admin/settings/:SolrFullTextIndexing``
45524576

4577+
.. _:SolrMaxFileSizeForFullTextIndexing:
4578+
45534579
:SolrMaxFileSizeForFullTextIndexing
45544580
+++++++++++++++++++++++++++++++++++
45554581

@@ -4571,12 +4597,15 @@ To enable the setting::
45714597

45724598
curl -X PUT -d true "http://localhost:8080/api/admin/settings/:DisableSolrFacets"
45734599

4600+
.. _:DisableSolrFacetsForGuestUsers:
45744601

45754602
:DisableSolrFacetsForGuestUsers
45764603
+++++++++++++++++++++++++++++++
45774604

45784605
Similar to the above, but will disable the facets for Guest (unauthenticated) users only.
45794606

4607+
.. _:DisableSolrFacetsWithoutJsession:
4608+
45804609
:DisableSolrFacetsWithoutJsession
45814610
+++++++++++++++++++++++++++++++++
45824611

@@ -5079,6 +5108,8 @@ If you don’t want date facets to be sorted chronologically, set:
50795108

50805109
``curl -X PUT -d 'false' http://localhost:8080/api/admin/settings/:ChronologicalDateFacets``
50815110

5111+
.. _:CustomZipDownloadServiceUrl:
5112+
50825113
:CustomZipDownloadServiceUrl
50835114
++++++++++++++++++++++++++++
50845115

@@ -5210,6 +5241,8 @@ A suggested minimum includes author, datasetContact, and contributor, but additi
52105241

52115242
``curl -X PUT -d 'author, datasetContact, contributor, depositor, grantNumber, publication' http://localhost:8080/api/admin/settings/:AnonymizedFieldTypeNames``
52125243

5244+
.. _:DatasetChecksumValidationSizeLimit:
5245+
52135246
:DatasetChecksumValidationSizeLimit
52145247
+++++++++++++++++++++++++++++++++++
52155248

@@ -5225,6 +5258,8 @@ Refer to "Physical Files Validation in a Dataset" API :ref:`dataset-files-valida
52255258

52265259
Also refer to the "Datafile Integrity" API :ref:`datafile-integrity`
52275260

5261+
.. _:DataFileChecksumValidationSizeLimit:
5262+
52285263
:DataFileChecksumValidationSizeLimit
52295264
++++++++++++++++++++++++++++++++++++
52305265

@@ -5486,6 +5521,8 @@ To use the current GDCC version directly:
54865521

54875522
``curl -X PUT -d 'https://gdcc.github.io/dvwebloader/src/dvwebloader.html' http://localhost:8080/api/admin/settings/:WebloaderUrl``
54885523

5524+
.. _:CategoryOrder:
5525+
54895526
:CategoryOrder
54905527
++++++++++++++
54915528

@@ -5495,6 +5532,8 @@ The default is category ordering disabled.
54955532

54965533
``curl -X PUT -d 'Documentation,Data,Code' http://localhost:8080/api/admin/settings/:CategoryOrder``
54975534

5535+
.. _:OrderByFolder:
5536+
54985537
:OrderByFolder
54995538
++++++++++++++
55005539

doc/sphinx-guides/source/installation/prep.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,7 @@ Decisions to Make
117117

118118
Here are some questions to keep in the back of your mind as you test and move into production:
119119

120-
- How much storage do I need?
120+
- How much storage do I need? What is the scale of data I will need to handle (see :doc:`/admin/big-data-administration`)?
121121
- Which features do I want based on :ref:`architecture`?
122122
- How do I want my users to log in to the Dataverse installation? With local accounts? With Shibboleth/SAML? With OAuth providers such as ORCID, GitHub, or Google?
123123
- Do I want to to run my app server on the standard web ports (80 and 443) or do I want to "front" my app server with a proxy such as Apache or nginx? See "Network Ports" in the :doc:`config` section.

doc/sphinx-guides/source/user/dataset-management.rst

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -391,14 +391,17 @@ If the bounding box was successfully populated, :ref:`geospatial-search` should
391391
Compressed Files
392392
----------------
393393

394-
Compressed files in .zip format are unpacked automatically. If a .zip file fails to unpack for whatever reason, it will upload as is. If the number of files inside are more than a set limit (1,000 by default, configurable by the Administrator), you will get an error message and the .zip file will upload as is.
394+
Depending on the configuration, compressed files in .zip format are unpacked automatically. If a .zip file is not unpacked, it will upload as is.
395+
If the number of files inside are more than a set limit (1,000 by default, configurable by the Administrator), you will get an error message and the .zip file will upload as is.
395396

396397
If the uploaded .zip file contains a folder structure, the Dataverse installation will keep track of this structure. A file's location within this folder structure is displayed in the file metadata as the File Path. When you download the contents of the dataset, this folder structure will be preserved and files will appear in their original locations.
397398

398399
These folder names are subject to strict validation rules. Only the following characters are allowed: the alphanumerics, '_', '-', '.' and ' ' (white space). When a zip archive is uploaded, the folder names are automatically sanitized, with any invalid characters replaced by the '.' character. Any sequences of dots are further replaced with a single dot. For example, the folder name ``data&info/code=@137`` will be converted to ``data.info/code.137``. When uploading through the Web UI, the user can change the values further on the edit form presented, before clicking the 'Save' button.
399400

400401
.. note:: If you upload multiple .zip files to one dataset, any subdirectories that are identical across multiple .zips will be merged together when the user downloads the full dataset.
401402

403+
If a .zip file is not unpacked and Zip Previewer is installed (see :ref:`file-previews`), it will be possible for users to view the contents of the zip file and to download individual files from within the .zip.
404+
402405
Other File Types
403406
----------------
404407

0 commit comments

Comments
 (0)