Skip to content

Commit c235c2a

Browse files
authored
Merge pull request #10713 from QualitativeDataRepository/QDR-solr_and_libs_updates
Solr 9.8.0 and other lib updates from QDR
2 parents 095728f + c24b224 commit c235c2a

File tree

13 files changed

+85
-127
lines changed

13 files changed

+85
-127
lines changed

.env

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
APP_IMAGE=gdcc/dataverse:unstable
22
POSTGRES_VERSION=17
33
DATAVERSE_DB_USER=dataverse
4-
SOLR_VERSION=9.3.0
5-
SKIP_DEPLOY=0
4+
SOLR_VERSION=9.8.0
5+
SKIP_DEPLOY=0

conf/solr/schema.xml

Lines changed: 26 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -38,36 +38,37 @@
3838
catchall "text" field, and use that for searching.
3939
-->
4040

41-
<schema name="default-config" version="1.6">
41+
<schema name="default-config" version="1.7">
4242
<!-- attribute "name" is the name of this schema and is only used for display purposes.
43-
version="x.y" is Solr's version number for the schema syntax and
43+
version="x.y" is Solr's version number for the schema syntax and
4444
semantics. It should not normally be changed by applications.
4545
46-
1.0: multiValued attribute did not exist, all fields are multiValued
46+
1.0: multiValued attribute did not exist, all fields are multiValued
4747
by nature
48-
1.1: multiValued attribute introduced, false by default
49-
1.2: omitTermFreqAndPositions attribute introduced, true by default
48+
1.1: multiValued attribute introduced, false by default
49+
1.2: omitTermFreqAndPositions attribute introduced, true by default
5050
except for text fields.
5151
1.3: removed optional field compress feature
5252
1.4: autoGeneratePhraseQueries attribute introduced to drive QueryParser
53-
behavior when a single string produces multiple tokens. Defaults
53+
behavior when a single string produces multiple tokens. Defaults
5454
to off for version >= 1.4
55-
1.5: omitNorms defaults to true for primitive field types
55+
1.5: omitNorms defaults to true for primitive field types
5656
(int, float, boolean, string...)
5757
1.6: useDocValuesAsStored defaults to true.
58+
1.7: docValues defaults to true, uninvertible defaults to false.
5859
-->
5960

6061
<!-- Valid attributes for fields:
6162
name: mandatory - the name for the field
62-
type: mandatory - the name of a field type from the
63+
type: mandatory - the name of a field type from the
6364
fieldTypes section
6465
indexed: true if this field should be indexed (searchable or sortable)
6566
stored: true if this field should be retrievable
6667
docValues: true if this field should have doc values. Doc Values is
6768
recommended (required, if you are using *Point fields) for faceting,
6869
grouping, sorting and function queries. Doc Values will make the index
69-
faster to load, more NRT-friendly and more memory-efficient.
70-
They are currently only supported by StrField, UUIDField, all
70+
faster to load, more NRT-friendly and more memory-efficient.
71+
They are currently only supported by StrField, UUIDField, all
7172
*PointFields, and depending on the field type, they might require
7273
the field to be single-valued, be required or have a default value
7374
(check the documentation of the field type you're interested in for
@@ -82,9 +83,9 @@
8283
given field.
8384
When using MoreLikeThis, fields used for similarity should be
8485
stored for best performance.
85-
termPositions: Store position information with the term vector.
86+
termPositions: Store position information with the term vector.
8687
This will increase storage costs.
87-
termOffsets: Store offset information with the term vector. This
88+
termOffsets: Store offset information with the term vector. This
8889
will increase storage costs.
8990
required: The field is required. It will throw an error if the
9091
value does not exist
@@ -102,10 +103,10 @@
102103
<!-- In this _default configset, only four fields are pre-declared:
103104
id, _version_, and _text_ and _root_. All other fields will be type guessed and added via the
104105
"add-unknown-fields-to-the-schema" update request processor chain declared in solrconfig.xml.
105-
106-
Note that many dynamic fields are also defined - you can use them to specify a
106+
107+
Note that many dynamic fields are also defined - you can use them to specify a
107108
field's type via field naming conventions - see below.
108-
109+
109110
WARNING: The _text_ catch-all field will significantly increase your index size.
110111
If you don't need it, consider removing it and the corresponding copyField directive."
111112
-->
@@ -115,12 +116,12 @@
115116
<field name="_version_" type="plong" indexed="false" stored="false"/>
116117
<field name="_root_" type="string" indexed="true" stored="false" docValues="false" />
117118

118-
119-
120-
121-
122-
<!-- Start: Dataverse-specific -->
123-
119+
120+
121+
122+
123+
<!-- Start: Dataverse-specific -->
124+
124125
<!-- catchall field, containing all other searchable text fields (implemented
125126
via copyField further on in this schema -->
126127
<!-- Dataverse solr 7.3.0: for some reason the old text wasn't working so switched to _text_ for copyfields -->
@@ -216,7 +217,7 @@
216217
<!-- https://redmine.hmdc.harvard.edu/issues/3482 -->
217218
<!-- 'Sorting can be done on the "score" of the document, or on any multiValued="false" indexed="true" field provided that field is either non-tokenized (ie: has no Analyzer) or uses an Analyzer that only produces a single Term (ie: uses the KeywordTokenizer)' http://wiki.apache.org/solr/CommonQueryParameters#sort -->
218219
<!-- http://stackoverflow.com/questions/13360706/solr-4-0-alphabetical-sorting-trouble/13361226#13361226 -->
219-
<field name="nameSort" type="alphaOnlySort" indexed="true" stored="true"/>
220+
<field name="nameSort" type="string" indexed="true" stored="true"/>
220221

221222
<field name="dateSort" type="pdate" indexed="true" stored="true"/>
222223

@@ -785,7 +786,7 @@
785786
<filter class="solr.TrimFilterFactory" />
786787
<!-- The PatternReplaceFilter gives you the flexibility to use
787788
Java Regular expression to replace any sequence of characters
788-
matching a pattern with an arbitrary replacement string,
789+
matching a pattern with an arbitrary replacement string,
789790
which may include back references to portions of the original
790791
string matched by the pattern.
791792
@@ -798,8 +799,8 @@
798799
<!-- https://redmine.hmdc.harvard.edu/issues/3482#note-11 -->
799800
<!-- <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all" /> -->
800801
</analyzer>
801-
</fieldType>
802-
802+
</fieldType>
803+
803804
<!-- The StrField type is not analyzed, but indexed/stored verbatim. -->
804805
<fieldType name="string" class="solr.StrField" sortMissingLast="true" docValues="true" />
805806
<fieldType name="strings" class="solr.StrField" sortMissingLast="true" multiValued="true" docValues="true" />

conf/solr/solrconfig.xml

Lines changed: 21 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -35,52 +35,7 @@
3535
that you fully re-index after changing this setting as it can
3636
affect both how text is indexed and queried.
3737
-->
38-
<luceneMatchVersion>9.7</luceneMatchVersion>
39-
40-
<!-- <lib/> directives can be used to instruct Solr to load any Jars
41-
identified and use them to resolve any "plugins" specified in
42-
your solrconfig.xml or schema.xml (ie: Analyzers, Request
43-
Handlers, etc...).
44-
45-
All directories and paths are resolved relative to the
46-
instanceDir.
47-
48-
Please note that <lib/> directives are processed in the order
49-
that they appear in your solrconfig.xml file, and are "stacked"
50-
on top of each other when building a ClassLoader - so if you have
51-
plugin jars with dependencies on other jars, the "lower level"
52-
dependency jars should be loaded first.
53-
54-
If a "./lib" directory exists in your instanceDir, all files
55-
found in it are included as if you had used the following
56-
syntax...
57-
58-
<lib dir="./lib" />
59-
-->
60-
61-
<!-- A 'dir' option by itself adds any files found in the directory
62-
to the classpath, this is useful for including all jars in a
63-
directory.
64-
65-
When a 'regex' is specified in addition to a 'dir', only the
66-
files in that directory which completely match the regex
67-
(anchored on both ends) will be included.
68-
69-
If a 'dir' option (with or without a regex) is used and nothing
70-
is found that matches, a warning will be logged.
71-
72-
The example below can be used to load a Solr Module along
73-
with their external dependencies.
74-
-->
75-
<!-- <lib dir="${solr.install.dir:../../../..}/modules/ltr/lib" regex=".*\.jar" /> -->
76-
77-
<!-- an exact 'path' can be used instead of a 'dir' to specify a
78-
specific jar file. This will cause a serious error to be logged
79-
if it can't be loaded.
80-
-->
81-
<!--
82-
<lib path="../a-jar-that-does-not-exist.jar" />
83-
-->
38+
<luceneMatchVersion>9.11</luceneMatchVersion>
8439

8540
<!-- Data Directory
8641
@@ -256,16 +211,9 @@
256211
is recommended (see below).
257212
"dir" - the target directory for transaction logs, defaults to the
258213
solr data directory.
259-
"numVersionBuckets" - sets the number of buckets used to keep
260-
track of max version values when checking for re-ordered
261-
updates; increase this value to reduce the cost of
262-
synchronizing access to version buckets during high-volume
263-
indexing, this requires 8 bytes (long) * numVersionBuckets
264-
of heap space per Solr core.
265214
-->
266215
<updateLog>
267216
<str name="dir">${solr.ulog.dir:}</str>
268-
<int name="numVersionBuckets">${solr.ulog.numVersionBuckets:65536}</int>
269217
</updateLog>
270218

271219
<!-- AutoCommit
@@ -360,6 +308,21 @@
360308
-->
361309
<maxBooleanClauses>${solr.max.booleanClauses:1024}</maxBooleanClauses>
362310

311+
<!-- Minimum acceptable prefix-size for prefix-based queries.
312+
313+
Prefix-based queries consume memory in proportion to the number of terms in the index
314+
that start with that prefix. Short prefixes tend to match many many more indexed-terms
315+
and consume more memory as a result, sometimes causing stability issues on the node.
316+
317+
This setting allows administrators to require that prefixes meet or exceed a specified
318+
minimum length requirement. Prefix queries that don't meet this requirement return an
319+
error to users. The limit may be overridden on a per-query basis by specifying a
320+
'minPrefixQueryTermLength' local-param value.
321+
322+
The flag value of '-1' can be used to disable enforcement of this limit.
323+
-->
324+
<minPrefixQueryTermLength>${solr.query.minPrefixLength:-1}</minPrefixQueryTermLength>
325+
363326
<!-- Solr Internal Query Caches
364327
Starting with Solr 9.0 the default cache implementation used is CaffeineCache.
365328
-->
@@ -494,23 +457,6 @@
494457
-->
495458
<queryResultMaxDocsCached>200</queryResultMaxDocsCached>
496459

497-
<!-- Use Filter For Sorted Query
498-
499-
A possible optimization that attempts to use a filter to
500-
satisfy a search. If the requested sort does not include
501-
score, then the filterCache will be checked for a filter
502-
matching the query. If found, the filter will be used as the
503-
source of document ids, and then the sort will be applied to
504-
that.
505-
506-
For most situations, this will not be useful unless you
507-
frequently get the same search repeatedly with different sort
508-
options, and none of them ever use "score"
509-
-->
510-
<!--
511-
<useFilterForSortedQuery>true</useFilterForSortedQuery>
512-
-->
513-
514460
<!-- Query Related Event Listeners
515461
516462
Various IndexSearcher related events can trigger Listeners to
@@ -1015,6 +961,10 @@
1015961
<str name="pattern">[^\w-\.]</str>
1016962
<str name="replacement">_</str>
1017963
</updateProcessor>
964+
<updateProcessor class="solr.NumFieldLimitingUpdateRequestProcessorFactory" name="max-fields">
965+
<int name="maxFields">1000</int>
966+
<bool name="warnOnly">true</bool>
967+
</updateProcessor>
1018968
<updateProcessor class="solr.ParseBooleanFieldUpdateProcessorFactory" name="parse-boolean"/>
1019969
<updateProcessor class="solr.ParseLongFieldUpdateProcessorFactory" name="parse-long"/>
1020970
<updateProcessor class="solr.ParseDoubleFieldUpdateProcessorFactory" name="parse-double"/>
@@ -1061,7 +1011,7 @@
10611011

10621012
<!-- The update.autoCreateFields property can be turned to false to disable schemaless mode -->
10631013
<updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default="${update.autoCreateFields:false}"
1064-
processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
1014+
processor="uuid,remove-blank,field-name-mutating,max-fields,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
10651015
<processor class="solr.LogUpdateProcessorFactory"/>
10661016
<processor class="solr.DistributedUpdateProcessorFactory"/>
10671017
<processor class="solr.RunUpdateProcessorFactory"/>
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
Solr 9.8.0 is now the version recommended in our installation guides and used with automated testing. Other libraries Dataverse uses have been updated as well.
2+
3+
For the upgrade instructions section:
4+
5+
[note that 6.6 may contain other solr-related changes, so the instructions may need to contain information merged from multiple release notes!]
6+
7+
If you are upgrading Solr:
8+
- Install solr-9.8.0 following the instructions from the Installation guide.
9+
- Run a full reindex to populate the search catalog.

doc/release-notes/11095-fix-extcvoc-indexing.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,5 +3,5 @@ in indexing failure for the dataset (e.g. when the script tried to index both th
33
Dataverse has been updated to correctly indicate the need for a multi-valued Solr field in these cases in the call to /api/admin/index/solr/schema.
44
Configuring the Solr schema and the update-fields.sh script as usually recommended when using custom metadata blocks will resolve the issue.
55

6-
The overall release notes should include a Solr update (which hopefully is required by an update to 9.7.0 anyway) and our standard instructions
6+
The overall release notes should include a Solr update (which hopefully is required by an update to 9.8.0 anyway) and our standard instructions
77
should change to recommending use of the update-fields.sh script when using custom metadatablocks *and/or external vocabulary scripts*.
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
This release fixes a bug that caused Dataverse to generate unnecessary solr documents for files when a file is added/deleted from a draft dataset. These documents could accumulate and potentially impact performance.
22

3-
Assuming the upgrade to solr 9.7.0 also occurs in this release, there's nothing else needed for this PR. (Starting with a new solr insures the solr db is empty and that a reindex is already required.)
3+
Assuming the upgrade to solr 9.8.0 also occurs in this release, there's nothing else needed for this PR. (Starting with a new solr insures the solr db is empty and that a reindex is already required.)
44

55

doc/sphinx-guides/source/_static/installation/files/etc/init.d/solr

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
# chkconfig: 35 92 08
66
# description: Starts and stops Apache Solr
77

8-
SOLR_DIR="/usr/local/solr/solr-9.4.1"
8+
SOLR_DIR="/usr/local/solr/solr-9.8.0"
99
SOLR_COMMAND="bin/solr"
1010
SOLR_ARGS="-m 1g"
1111
SOLR_USER=solr

doc/sphinx-guides/source/_static/installation/files/etc/systemd/solr.service

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,9 @@ After = syslog.target network.target remote-fs.target nss-lookup.target
55
[Service]
66
User = solr
77
Type = forking
8-
WorkingDirectory = /usr/local/solr/solr-9.4.1
9-
ExecStart = /usr/local/solr/solr-9.4.1/bin/solr start -m 1g
10-
ExecStop = /usr/local/solr/solr-9.4.1/bin/solr stop
8+
WorkingDirectory = /usr/local/solr/solr-9.8.0
9+
ExecStart = /usr/local/solr/solr-9.8.0/bin/solr start -m 1g
10+
ExecStop = /usr/local/solr/solr-9.8.0/bin/solr stop
1111
LimitNOFILE=65000
1212
LimitNPROC=65000
1313
Restart=on-failure

doc/sphinx-guides/source/developers/classic-dev-env.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,7 @@ On Linux, you should just install PostgreSQL using your favorite package manager
136136
Install Solr
137137
^^^^^^^^^^^^
138138

139-
`Solr <https://lucene.apache.org/solr/>`_ 9.4.1 is required.
139+
`Solr <https://lucene.apache.org/solr/>`_ 9.8.0 is required.
140140

141141
Follow the instructions in the "Installing Solr" section of :doc:`/installation/prerequisites` in the main Installation guide.
142142

doc/sphinx-guides/source/installation/prerequisites.rst

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -163,7 +163,7 @@ The Dataverse software search index is powered by Solr.
163163
Supported Versions
164164
==================
165165

166-
The Dataverse software has been tested with Solr version 9.4.1. Future releases in the 9.x series are likely to be compatible. Please get in touch (:ref:`support`) if you are having trouble with a newer version.
166+
The Dataverse software has been tested with Solr version 9.8.0. Future releases in the 9.x series are likely to be compatible. Please get in touch (:ref:`support`) if you are having trouble with a newer version.
167167

168168
Installing Solr
169169
===============
@@ -178,19 +178,19 @@ Become the ``solr`` user and then download and configure Solr::
178178

179179
su - solr
180180
cd /usr/local/solr
181-
wget https://archive.apache.org/dist/solr/solr/9.4.1/solr-9.4.1.tgz
182-
tar xvzf solr-9.4.1.tgz
183-
cd solr-9.4.1
181+
wget https://archive.apache.org/dist/solr/solr/9.8.0/solr-9.8.0.tgz
182+
tar xvzf solr-9.8.0.tgz
183+
cd solr-9.8.0
184184
cp -r server/solr/configsets/_default server/solr/collection1
185185

186186
You should already have a "dvinstall.zip" file that you downloaded from https://github.com/IQSS/dataverse/releases . Unzip it into ``/tmp``. Then copy the files into place::
187187

188-
cp /tmp/dvinstall/schema*.xml /usr/local/solr/solr-9.4.1/server/solr/collection1/conf
189-
cp /tmp/dvinstall/solrconfig.xml /usr/local/solr/solr-9.4.1/server/solr/collection1/conf
188+
cp /tmp/dvinstall/schema*.xml /usr/local/solr/solr-9.8.0/server/solr/collection1/conf
189+
cp /tmp/dvinstall/solrconfig.xml /usr/local/solr/solr-9.8.0/server/solr/collection1/conf
190190

191191
Note: The Dataverse Project team has customized Solr to boost results that come from certain indexed elements inside the Dataverse installation, for example prioritizing results from Dataverse collections over Datasets. If you would like to remove this, edit your ``solrconfig.xml`` and remove the ``<str name="qf">`` element and its contents. If you have ideas about how this boosting could be improved, feel free to contact us through our Google Group https://groups.google.com/forum/#!forum/dataverse-dev .
192192

193-
A Dataverse installation requires a change to the ``jetty.xml`` file that ships with Solr. Edit ``/usr/local/solr/solr-9.4.1/server/etc/jetty.xml`` , increasing ``requestHeaderSize`` from ``8192`` to ``102400``
193+
A Dataverse installation requires a change to the ``jetty.xml`` file that ships with Solr. Edit ``/usr/local/solr/solr-9.8.0/server/etc/jetty.xml`` , increasing ``requestHeaderSize`` from ``8192`` to ``102400``
194194

195195
Solr will warn about needing to increase the number of file descriptors and max processes in a production environment but will still run with defaults. We have increased these values to the recommended levels by adding ulimit -n 65000 to the init script, and the following to ``/etc/security/limits.conf``::
196196

@@ -209,7 +209,7 @@ Solr launches asynchronously and attempts to use the ``lsof`` binary to watch fo
209209

210210
Finally, you need to tell Solr to create the core "collection1" on startup::
211211

212-
echo "name=collection1" > /usr/local/solr/solr-9.4.1/server/solr/collection1/core.properties
212+
echo "name=collection1" > /usr/local/solr/solr-9.8.0/server/solr/collection1/core.properties
213213

214214
Dataverse collection ("dataverse") page uses Solr very heavily. On a busy instance this may cause the search engine to become the performance bottleneck, making these pages take increasingly longer to load, potentially affecting the overall performance of the application and/or causing Solr itself to crash. If this is observed on your instance, we recommend uncommenting the following lines in the ``<circuitBreaker ...>`` section of the ``solrconfig.xml`` file::
215215

0 commit comments

Comments
 (0)