Skip to content

Commit bceda42

Browse files
committed
Merge branch 'develop' into 10909-datacite-oai-harvesting
resolved conflicts: doc/sphinx-guides/source/api/native-api.rst src/main/java/edu/harvard/iq/dataverse/harvest/client/HarvestingClient.java
2 parents 8ec87f1 + 7a29522 commit bceda42

File tree

90 files changed

+1699
-514
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

90 files changed

+1699
-514
lines changed

.env

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
APP_IMAGE=gdcc/dataverse:unstable
22
POSTGRES_VERSION=17
33
DATAVERSE_DB_USER=dataverse
4-
SOLR_VERSION=9.3.0
5-
SKIP_DEPLOY=0
4+
SOLR_VERSION=9.8.0
5+
SKIP_DEPLOY=0

.github/workflows/copy_labels.yml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
name: Copy labels from issue to pull request
2+
3+
on:
4+
pull_request:
5+
types: [opened]
6+
7+
jobs:
8+
copy-labels:
9+
runs-on: ubuntu-latest
10+
name: Copy labels from linked issues
11+
steps:
12+
- name: copy-labels
13+
uses: michalvankodev/[email protected]
14+
with:
15+
repo-token: ${{ secrets.GITHUB_TOKEN }}

.github/workflows/deploy_beta_testing.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ jobs:
6868
overwrite: true
6969

7070
- name: Execute payara war deployment remotely
71-
uses: appleboy/[email protected].0
71+
uses: appleboy/[email protected].1
7272
env:
7373
INPUT_WAR_FILE: ${{ env.war_file }}
7474
with:

conf/solr/schema.xml

Lines changed: 26 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -38,36 +38,37 @@
3838
catchall "text" field, and use that for searching.
3939
-->
4040

41-
<schema name="default-config" version="1.6">
41+
<schema name="default-config" version="1.7">
4242
<!-- attribute "name" is the name of this schema and is only used for display purposes.
43-
version="x.y" is Solr's version number for the schema syntax and
43+
version="x.y" is Solr's version number for the schema syntax and
4444
semantics. It should not normally be changed by applications.
4545
46-
1.0: multiValued attribute did not exist, all fields are multiValued
46+
1.0: multiValued attribute did not exist, all fields are multiValued
4747
by nature
48-
1.1: multiValued attribute introduced, false by default
49-
1.2: omitTermFreqAndPositions attribute introduced, true by default
48+
1.1: multiValued attribute introduced, false by default
49+
1.2: omitTermFreqAndPositions attribute introduced, true by default
5050
except for text fields.
5151
1.3: removed optional field compress feature
5252
1.4: autoGeneratePhraseQueries attribute introduced to drive QueryParser
53-
behavior when a single string produces multiple tokens. Defaults
53+
behavior when a single string produces multiple tokens. Defaults
5454
to off for version >= 1.4
55-
1.5: omitNorms defaults to true for primitive field types
55+
1.5: omitNorms defaults to true for primitive field types
5656
(int, float, boolean, string...)
5757
1.6: useDocValuesAsStored defaults to true.
58+
1.7: docValues defaults to true, uninvertible defaults to false.
5859
-->
5960

6061
<!-- Valid attributes for fields:
6162
name: mandatory - the name for the field
62-
type: mandatory - the name of a field type from the
63+
type: mandatory - the name of a field type from the
6364
fieldTypes section
6465
indexed: true if this field should be indexed (searchable or sortable)
6566
stored: true if this field should be retrievable
6667
docValues: true if this field should have doc values. Doc Values is
6768
recommended (required, if you are using *Point fields) for faceting,
6869
grouping, sorting and function queries. Doc Values will make the index
69-
faster to load, more NRT-friendly and more memory-efficient.
70-
They are currently only supported by StrField, UUIDField, all
70+
faster to load, more NRT-friendly and more memory-efficient.
71+
They are currently only supported by StrField, UUIDField, all
7172
*PointFields, and depending on the field type, they might require
7273
the field to be single-valued, be required or have a default value
7374
(check the documentation of the field type you're interested in for
@@ -82,9 +83,9 @@
8283
given field.
8384
When using MoreLikeThis, fields used for similarity should be
8485
stored for best performance.
85-
termPositions: Store position information with the term vector.
86+
termPositions: Store position information with the term vector.
8687
This will increase storage costs.
87-
termOffsets: Store offset information with the term vector. This
88+
termOffsets: Store offset information with the term vector. This
8889
will increase storage costs.
8990
required: The field is required. It will throw an error if the
9091
value does not exist
@@ -102,10 +103,10 @@
102103
<!-- In this _default configset, only four fields are pre-declared:
103104
id, _version_, and _text_ and _root_. All other fields will be type guessed and added via the
104105
"add-unknown-fields-to-the-schema" update request processor chain declared in solrconfig.xml.
105-
106-
Note that many dynamic fields are also defined - you can use them to specify a
106+
107+
Note that many dynamic fields are also defined - you can use them to specify a
107108
field's type via field naming conventions - see below.
108-
109+
109110
WARNING: The _text_ catch-all field will significantly increase your index size.
110111
If you don't need it, consider removing it and the corresponding copyField directive."
111112
-->
@@ -115,12 +116,12 @@
115116
<field name="_version_" type="plong" indexed="false" stored="false"/>
116117
<field name="_root_" type="string" indexed="true" stored="false" docValues="false" />
117118

118-
119-
120-
121-
122-
<!-- Start: Dataverse-specific -->
123-
119+
120+
121+
122+
123+
<!-- Start: Dataverse-specific -->
124+
124125
<!-- catchall field, containing all other searchable text fields (implemented
125126
via copyField further on in this schema -->
126127
<!-- Dataverse solr 7.3.0: for some reason the old text wasn't working so switched to _text_ for copyfields -->
@@ -216,7 +217,7 @@
216217
<!-- https://redmine.hmdc.harvard.edu/issues/3482 -->
217218
<!-- 'Sorting can be done on the "score" of the document, or on any multiValued="false" indexed="true" field provided that field is either non-tokenized (ie: has no Analyzer) or uses an Analyzer that only produces a single Term (ie: uses the KeywordTokenizer)' http://wiki.apache.org/solr/CommonQueryParameters#sort -->
218219
<!-- http://stackoverflow.com/questions/13360706/solr-4-0-alphabetical-sorting-trouble/13361226#13361226 -->
219-
<field name="nameSort" type="alphaOnlySort" indexed="true" stored="true"/>
220+
<field name="nameSort" type="string" indexed="true" stored="true"/>
220221

221222
<field name="dateSort" type="pdate" indexed="true" stored="true"/>
222223

@@ -785,7 +786,7 @@
785786
<filter class="solr.TrimFilterFactory" />
786787
<!-- The PatternReplaceFilter gives you the flexibility to use
787788
Java Regular expression to replace any sequence of characters
788-
matching a pattern with an arbitrary replacement string,
789+
matching a pattern with an arbitrary replacement string,
789790
which may include back references to portions of the original
790791
string matched by the pattern.
791792
@@ -798,8 +799,8 @@
798799
<!-- https://redmine.hmdc.harvard.edu/issues/3482#note-11 -->
799800
<!-- <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all" /> -->
800801
</analyzer>
801-
</fieldType>
802-
802+
</fieldType>
803+
803804
<!-- The StrField type is not analyzed, but indexed/stored verbatim. -->
804805
<fieldType name="string" class="solr.StrField" sortMissingLast="true" docValues="true" />
805806
<fieldType name="strings" class="solr.StrField" sortMissingLast="true" multiValued="true" docValues="true" />

conf/solr/solrconfig.xml

Lines changed: 21 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -35,52 +35,7 @@
3535
that you fully re-index after changing this setting as it can
3636
affect both how text is indexed and queried.
3737
-->
38-
<luceneMatchVersion>9.7</luceneMatchVersion>
39-
40-
<!-- <lib/> directives can be used to instruct Solr to load any Jars
41-
identified and use them to resolve any "plugins" specified in
42-
your solrconfig.xml or schema.xml (ie: Analyzers, Request
43-
Handlers, etc...).
44-
45-
All directories and paths are resolved relative to the
46-
instanceDir.
47-
48-
Please note that <lib/> directives are processed in the order
49-
that they appear in your solrconfig.xml file, and are "stacked"
50-
on top of each other when building a ClassLoader - so if you have
51-
plugin jars with dependencies on other jars, the "lower level"
52-
dependency jars should be loaded first.
53-
54-
If a "./lib" directory exists in your instanceDir, all files
55-
found in it are included as if you had used the following
56-
syntax...
57-
58-
<lib dir="./lib" />
59-
-->
60-
61-
<!-- A 'dir' option by itself adds any files found in the directory
62-
to the classpath, this is useful for including all jars in a
63-
directory.
64-
65-
When a 'regex' is specified in addition to a 'dir', only the
66-
files in that directory which completely match the regex
67-
(anchored on both ends) will be included.
68-
69-
If a 'dir' option (with or without a regex) is used and nothing
70-
is found that matches, a warning will be logged.
71-
72-
The example below can be used to load a Solr Module along
73-
with their external dependencies.
74-
-->
75-
<!-- <lib dir="${solr.install.dir:../../../..}/modules/ltr/lib" regex=".*\.jar" /> -->
76-
77-
<!-- an exact 'path' can be used instead of a 'dir' to specify a
78-
specific jar file. This will cause a serious error to be logged
79-
if it can't be loaded.
80-
-->
81-
<!--
82-
<lib path="../a-jar-that-does-not-exist.jar" />
83-
-->
38+
<luceneMatchVersion>9.11</luceneMatchVersion>
8439

8540
<!-- Data Directory
8641
@@ -256,16 +211,9 @@
256211
is recommended (see below).
257212
"dir" - the target directory for transaction logs, defaults to the
258213
solr data directory.
259-
"numVersionBuckets" - sets the number of buckets used to keep
260-
track of max version values when checking for re-ordered
261-
updates; increase this value to reduce the cost of
262-
synchronizing access to version buckets during high-volume
263-
indexing, this requires 8 bytes (long) * numVersionBuckets
264-
of heap space per Solr core.
265214
-->
266215
<updateLog>
267216
<str name="dir">${solr.ulog.dir:}</str>
268-
<int name="numVersionBuckets">${solr.ulog.numVersionBuckets:65536}</int>
269217
</updateLog>
270218

271219
<!-- AutoCommit
@@ -360,6 +308,21 @@
360308
-->
361309
<maxBooleanClauses>${solr.max.booleanClauses:1024}</maxBooleanClauses>
362310

311+
<!-- Minimum acceptable prefix-size for prefix-based queries.
312+
313+
Prefix-based queries consume memory in proportion to the number of terms in the index
314+
that start with that prefix. Short prefixes tend to match many many more indexed-terms
315+
and consume more memory as a result, sometimes causing stability issues on the node.
316+
317+
This setting allows administrators to require that prefixes meet or exceed a specified
318+
minimum length requirement. Prefix queries that don't meet this requirement return an
319+
error to users. The limit may be overridden on a per-query basis by specifying a
320+
'minPrefixQueryTermLength' local-param value.
321+
322+
The flag value of '-1' can be used to disable enforcement of this limit.
323+
-->
324+
<minPrefixQueryTermLength>${solr.query.minPrefixLength:-1}</minPrefixQueryTermLength>
325+
363326
<!-- Solr Internal Query Caches
364327
Starting with Solr 9.0 the default cache implementation used is CaffeineCache.
365328
-->
@@ -494,23 +457,6 @@
494457
-->
495458
<queryResultMaxDocsCached>200</queryResultMaxDocsCached>
496459

497-
<!-- Use Filter For Sorted Query
498-
499-
A possible optimization that attempts to use a filter to
500-
satisfy a search. If the requested sort does not include
501-
score, then the filterCache will be checked for a filter
502-
matching the query. If found, the filter will be used as the
503-
source of document ids, and then the sort will be applied to
504-
that.
505-
506-
For most situations, this will not be useful unless you
507-
frequently get the same search repeatedly with different sort
508-
options, and none of them ever use "score"
509-
-->
510-
<!--
511-
<useFilterForSortedQuery>true</useFilterForSortedQuery>
512-
-->
513-
514460
<!-- Query Related Event Listeners
515461
516462
Various IndexSearcher related events can trigger Listeners to
@@ -1015,6 +961,10 @@
1015961
<str name="pattern">[^\w-\.]</str>
1016962
<str name="replacement">_</str>
1017963
</updateProcessor>
964+
<updateProcessor class="solr.NumFieldLimitingUpdateRequestProcessorFactory" name="max-fields">
965+
<int name="maxFields">1000</int>
966+
<bool name="warnOnly">true</bool>
967+
</updateProcessor>
1018968
<updateProcessor class="solr.ParseBooleanFieldUpdateProcessorFactory" name="parse-boolean"/>
1019969
<updateProcessor class="solr.ParseLongFieldUpdateProcessorFactory" name="parse-long"/>
1020970
<updateProcessor class="solr.ParseDoubleFieldUpdateProcessorFactory" name="parse-double"/>
@@ -1061,7 +1011,7 @@
10611011

10621012
<!-- The update.autoCreateFields property can be turned to false to disable schemaless mode -->
10631013
<updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default="${update.autoCreateFields:false}"
1064-
processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
1014+
processor="uuid,remove-blank,field-name-mutating,max-fields,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
10651015
<processor class="solr.LogUpdateProcessorFactory"/>
10661016
<processor class="solr.DistributedUpdateProcessorFactory"/>
10671017
<processor class="solr.RunUpdateProcessorFactory"/>
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
Release Highlights:
2+
An experimental "Archival" metadata block has been added, [downloadable](https://dataverse-guide--10626.org.readthedocs.build/en/10626/user/appendix.html) from the User Guide. The purpose of the metadata block is to enable repositories to register metadata relating to the potential archiving of the dataset at a depositor archive, whether that being your own institutional archive or an external archive, i.e. a historical archive. See also #10626.
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
Solr 9.8.0 is now the version recommended in our installation guides and used with automated testing. Other libraries Dataverse uses have been updated as well.
2+
3+
For the upgrade instructions section:
4+
5+
[note that 6.6 may contain other solr-related changes, so the instructions may need to contain information merged from multiple release notes!]
6+
7+
If you are upgrading Solr:
8+
- Install solr-9.8.0 following the instructions from the Installation guide.
9+
- Run a full reindex to populate the search catalog.
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
- License metadata enhancements (#10883):
2+
- Added new fields to licenses: rightsIdentifier, rightsIdentifierScheme, schemeUri, languageCode
3+
- Updated DataCite metadata export to include rightsIdentifier, rightsIdentifierScheme, and schemeUri consistent with the DataCite 4.5 schema and examples
4+
- Enhanced metadata exports to include all new license fields
5+
- Existing licenses from the example set included with Dataverse will be automatically updated with new fields
6+
- Existing API calls support the new optional fields
7+
8+
Setup: For existing published datasets, the additional license metadata will not be available from DataCite or in metadata exports until the dataset is republished or
9+
- the /api/admin/metadata/{id}/reExportDataset is run for the dataset
10+
- the api/datasets/{id}/modifyRegistrationMetadata API is run for the dataset,
11+
or the global version of these api calls (/api/admin/metadata/reExportAll, /api/datasets/modifyRegistrationPIDMetadataAll) are used.
12+

doc/release-notes/11095-fix-extcvoc-indexing.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,5 +3,5 @@ in indexing failure for the dataset (e.g. when the script tried to index both th
33
Dataverse has been updated to correctly indicate the need for a multi-valued Solr field in these cases in the call to /api/admin/index/solr/schema.
44
Configuring the Solr schema and the update-fields.sh script as usually recommended when using custom metadata blocks will resolve the issue.
55

6-
The overall release notes should include a Solr update (which hopefully is required by an update to 9.7.0 anyway) and our standard instructions
6+
The overall release notes should include a Solr update (which hopefully is required by an update to 9.8.0 anyway) and our standard instructions
77
should change to recommending use of the update-fields.sh script when using custom metadatablocks *and/or external vocabulary scripts*.
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
This release fixes a bug that caused Dataverse to generate unnecessary solr documents for files when a file is added/deleted from a draft dataset. These documents could accumulate and potentially impact performance.
22

3-
Assuming the upgrade to solr 9.7.0 also occurs in this release, there's nothing else needed for this PR. (Starting with a new solr insures the solr db is empty and that a reindex is already required.)
3+
Assuming the upgrade to solr 9.8.0 also occurs in this release, there's nothing else needed for this PR. (Starting with a new solr insures the solr db is empty and that a reindex is already required.)
44

55

0 commit comments

Comments
 (0)