Skip to content

Commit 988a7bb

Browse files
committed
Merge branch 'develop' into 11391-displayOnCreate-with-template
2 parents 4aad960 + 1d7ea40 commit 988a7bb

25 files changed

+644
-531
lines changed

conf/solr/solrconfig.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -238,7 +238,7 @@
238238
have some sort of hard autoCommit to limit the log size.
239239
-->
240240
<autoCommit>
241-
<maxTime>${solr.autoCommit.maxTime:30000}</maxTime>
241+
<maxTime>${solr.autoCommit.maxTime:300000}</maxTime>
242242
<openSearcher>false</openSearcher>
243243
</autoCommit>
244244

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
### Solr Indexing speed improved
2+
3+
The performance of Solr indexing has been significantly improved, particularly for datasets with many files.
4+
5+
A new dataverse.solr.min-files-to-use-proxy microprofile setting can be used to further improve performance/lower memory requirements for datasets with many files (e.g. 500+) (defaults to Integer.MAX, disabling use of the new functionality)
Lines changed: 13 additions & 0 deletions
Loading

doc/sphinx-guides/source/api/native-api.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1167,7 +1167,7 @@ To set or change the storage allocation quota for a collection:
11671167

11681168
.. code-block::
11691169
1170-
curl -X PUT -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/dataverses/$ID/storage/quota/$SIZE_IN_BYTES"
1170+
curl -X POST -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/dataverses/$ID/storage/quota/$SIZE_IN_BYTES"
11711171
11721172
This is API is superuser-only.
11731173

doc/sphinx-guides/source/installation/config.rst

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2689,6 +2689,17 @@ when using it to configure your core name!
26892689

26902690
Can also be set via *MicroProfile Config API* sources, e.g. the environment variable ``DATAVERSE_SOLR_PATH``.
26912691

2692+
dataverse.solr.min-files-to-use-proxy
2693+
+++++++++++++++++++++++++++++++++++++
2694+
2695+
Specifies when to use a smaller datafile proxy object for the purposes of dataset indexing. This can lower memory requirements
2696+
and improve performance when reindexing large datasets (e.g. those with hundreds or thousands of files). (Creating the proxy may slightly slow indexing datasets with only a few files.)
2697+
2698+
This setting represents a number of files for which the datafile procy should be used. By default, this is set to Interger.MAX which disables using the proxy.
2699+
A recommended value would be ~1000 but the optimal value may vary depending on details of your installation.
2700+
2701+
Can also be set via *MicroProfile Config API* sources, e.g. the environment variable ``DATAVERSE_SOLR_MIN_FILES_TO_USE_PROXY``.
2702+
26922703
dataverse.solr.concurrency.max-async-indexes
26932704
++++++++++++++++++++++++++++++++++++++++++++
26942705

doc/sphinx-guides/source/style/foundations.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -353,3 +353,5 @@ Create both print and web version of the Dataverse collection logo by downloadin
353353

354354
.. |image1| image:: ./img/dataverse-icon.jpg
355355
:class: img-responsive
356+
357+
Here is another vector-based SVG file with three rings: :download:`Dataverse_3ring-brand_icon_EqualSpace.svg <../_static/Dataverse_3ring-brand_icon_EqualSpace.svg>`

doc/sphinx-guides/source/user/tabulardataingest/stata.rst

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,36 @@ Stata
55
:local:
66

77
Of all the third party statistical software providers, Stata does the best job at documenting the internal format of their files, by far. And at making that documentation freely and easily available to developers (yes, we are looking at you, SPSS). Because of that, Stata is the best supported format for tabular data ingest.
8+
9+
Supported Format Versions
10+
=========================
11+
12+
13+
Of the **"New Stata dta"** formats (variations of the format in use since Stata 13) our ingest supports the following:
14+
15+
16+
=================== ================= =================
17+
Stata format name Introduced in Used by
18+
=================== ================= =================
19+
dta_117 Stata 13 Stata 13
20+
dta_118 Stata 14 Stata 14 - 19
21+
dta_119 Stata 15 Stata 15 - 19
22+
=================== ================= =================
23+
24+
This means that, in theory, every dta file produced by Stata v.13 - 17 should be ingestible. (Please see below for more information on Stata 18 and 19). In practice, we cannot *guarantee* that our code will in fact be able to parse any such file. There is always a possibility that we missed a certain way to compose the data that the ingest will fail to understand. So, if you encounter such an error, where Dataverse **tries but fails** to ingest a Stata file in one of these 3 formats, please open a GitHub issue and we will try to address it. Please note that this a different scenario from when Dataverse skips even trying to ingest a file (with no ingest errors shown in the UI). As that will in most cases be the result of the file exceeding the size limit set by the Dataverse instance administrators, or a client uploading the file with a wrong content type attached, so that Dataverse fails to recognize it as Stata.
25+
26+
Please note that there was an issue present in older versions of Dataverse where Stata 13-17 files were not ingested when deposited via direct upload to S3. The issue was accompanied by the confusing error message ``The file is not in a STATA format that we can read or support`` shown in the UI. Fortunately, a case like this can be addressed by running the reIngest API on the affected file.
27+
28+
The following 2 formats have been introduced in 2024 and are **not yet supported**:
29+
30+
=================== ================ =================
31+
Stata format name Introduced in Used by
32+
=================== ================ =================
33+
dta_120 Stata 18 Stata 18 - 19
34+
dta_121 Stata 18 Stata 18 - 19
35+
=================== ================ =================
36+
37+
Please note however, that this does not mean that no files produced by Stata 18 or 19 are ingestable! In reality, in most cases these versions of Stata still save files in the ``dta_118`` (i.e., Stata 14) format, with the later formats only utilized when necessary. When, for example, the number of variables in the datafile exceeds what ``dta_118`` can handle, or when it has "alias variables" introduced in Stata 18. Case in point, in a year since the introduction of these 2 newest formats, it appears that not a single file in either of them has been uploaded on production Dataverse instance at IQSS. We are planning to eventually add support for these formats, but it is not considered a priority as of yet. However, please feel free to open a GitHub issue if this is an important use case for you.
38+
39+
**"Old Stata"**, a distinctly different format used by Stata versions prior to 13 is supported.
40+
However, this functionality is considered legacy code that we no longer actively maintain. If any problems or bugs are found in it, we cannot promise that the core development team will be able to prioritize looking into such. We will of course gladly accept a properly submitted pull request from the user community.

doc/sphinx-guides/source/user/tabulardataingest/supportedformats.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Tabular Data ingest supports the following file formats:
1111
File format Versions supported
1212
================================ ==================================
1313
SPSS (POR and SAV formats) 7 to 22
14-
STATA 4 to 15
14+
STATA 4 to 17 (see the Stata subsection)
1515
R up to 3
1616
Excel XLSX only (XLS is NOT supported)
1717
CSV (comma-separated values) (limited support)

src/main/java/edu/harvard/iq/dataverse/DataFile.java

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
import edu.harvard.iq.dataverse.datasetutility.FileSizeChecker;
1414
import edu.harvard.iq.dataverse.ingest.IngestReport;
1515
import edu.harvard.iq.dataverse.ingest.IngestRequest;
16+
import edu.harvard.iq.dataverse.search.SolrIndexServiceBean;
1617
import edu.harvard.iq.dataverse.util.BundleUtil;
1718
import edu.harvard.iq.dataverse.util.FileUtil;
1819
import edu.harvard.iq.dataverse.util.ShapefileHandler;
@@ -23,6 +24,7 @@
2324
import java.util.Objects;
2425
import java.text.SimpleDateFormat;
2526
import java.util.Arrays;
27+
import java.util.Date;
2628
import java.util.HashMap;
2729
import java.util.Map;
2830
import java.util.Set;
@@ -50,6 +52,26 @@
5052
@NamedQuery(name="DataFile.findDataFileThatReplacedId",
5153
query="SELECT s.id FROM DataFile s WHERE s.previousDataFileId=:identifier")
5254
})
55+
@NamedNativeQuery(
56+
name = "DataFile.getDataFileInfoForPermissionIndexing",
57+
query = "SELECT fm.label, df.id, dvo.publicationDate " +
58+
"FROM filemetadata fm " +
59+
"JOIN datafile df ON fm.datafile_id = df.id " +
60+
"JOIN dvobject dvo ON df.id = dvo.id " +
61+
"WHERE fm.datasetversion_id = ?",
62+
resultSetMapping = "DataFileInfoMapping"
63+
)
64+
@SqlResultSetMapping(
65+
name = "DataFileInfoMapping",
66+
classes = @ConstructorResult(
67+
targetClass = SolrIndexServiceBean.DataFileProxy.class,
68+
columns = {
69+
@ColumnResult(name = "label", type = String.class),
70+
@ColumnResult(name = "id", type = Long.class),
71+
@ColumnResult(name = "publicationDate", type = Date.class)
72+
}
73+
)
74+
)
5375
@Entity
5476
@Table(indexes = {@Index(columnList="ingeststatus")
5577
, @Index(columnList="checksumvalue")

src/main/java/edu/harvard/iq/dataverse/Dataset.java

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,17 +20,20 @@
2020
import java.util.Objects;
2121
import java.util.Set;
2222
import jakarta.persistence.CascadeType;
23+
import jakarta.persistence.ColumnResult;
2324
import jakarta.persistence.Entity;
2425
import jakarta.persistence.Index;
2526
import jakarta.persistence.JoinColumn;
2627
import jakarta.persistence.ManyToOne;
28+
import jakarta.persistence.NamedNativeQuery;
2729
import jakarta.persistence.NamedQueries;
2830
import jakarta.persistence.NamedQuery;
2931
import jakarta.persistence.NamedStoredProcedureQuery;
3032
import jakarta.persistence.OneToMany;
3133
import jakarta.persistence.OneToOne;
3234
import jakarta.persistence.OrderBy;
3335
import jakarta.persistence.ParameterMode;
36+
import jakarta.persistence.SqlResultSetMapping;
3437
import jakarta.persistence.StoredProcedureParameter;
3538
import jakarta.persistence.Table;
3639
import jakarta.persistence.Temporal;
@@ -71,6 +74,23 @@
7174
@NamedQuery(name = "Dataset.countAll",
7275
query = "SELECT COUNT(ds) FROM Dataset ds")
7376
})
77+
@NamedNativeQuery(
78+
name = "Dataset.findAllOrSubsetOrderByFilesOwned",
79+
query = "SELECT DISTINCT CAST(o.id AS BIGINT) as id, COUNT(f.id) as numFiles " +
80+
"FROM dvobject o " +
81+
"LEFT JOIN dvobject f ON f.owner_id = o.id " +
82+
"WHERE o.dtype = 'Dataset' " +
83+
"AND (? = false OR o.indexTime IS NULL) " +
84+
"GROUP BY o.id " +
85+
"ORDER BY numfiles ASC, id",
86+
resultSetMapping = "DatasetIdMapping"
87+
)
88+
@SqlResultSetMapping(
89+
name = "DatasetIdMapping",
90+
columns = {
91+
@ColumnResult(name = "id", type = Long.class)
92+
}
93+
)
7494

7595
/*
7696
Below is the database stored procedure for getting a string dataset id.

0 commit comments

Comments
 (0)