Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3093,6 +3093,8 @@ Please note that this setting is experimental.
The Schema.org metadata and OpenAIRE exports and the Schema.org metadata included in DatasetPages try to infer whether each entry in the various fields (e.g. Author, Contributor) is a Person or Organization. If you are sure that
users are following the guidance to add people in the recommended family name, given name order, with a comma, you can set this true to always assume entries without a comma are for Organizations. The default is false.

Can also be set via *MicroProfile Config API* sources, e.g. the environment variable ``DATAVERSE_PERSONORORG_ASSUMECOMMAINPERSONNAME``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, ASSUMECOMMAINPERSONNAME is a bit hard to read. I understand you're keeping it the same as the old setting: dataverse.personOrOrg.assumeCommaInPersonName.

What sort of options do we have? Can we add underscores like ASSUME_COMMA_IN_PERSON_NAME and still have backward compatibility? (I'm not sure if there's backward compatibility with the ASSUMECOMMAINPERSONNAME anyway.)

Also, can we please have a release note snippet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now changed the names of the two settings to kebab case, which is what they should be called according to the naming conventions (https://guides.dataverse.org/en/latest/developers/configuration.html#adding-a-jvm-setting).

This means they can now be set using DATAVERSE_PERSON_OR_ORG_ASSUME_COMMA_IN_PERSON_NAME and DATAVERSE_PERSON_OR_ORG_ORG_PHRASE_ARRAY. (Or asadmin create-jvm-options '-Ddataverse.person-or-org.assume-comma-in-person-name=true and asadmin create-jvm-options '-Ddataverse.person-or-org.org-phrase-array=Org,GmbH'). I've tested to confirm all of these options work.

For backwards compatibility, adding an alias so that the old name asadmin create-jvm-options '-Ddataverse.personOrOrg.assumeCommaInPersonName=true' can still be used was easy.

However, for the orgPhraseArray setting, the expected value format has changed as well. Previously, a list like ["Org","GmbH"] was expected, but now it's a list like Org,GmbH. Afaik it wouldn't be as easy to still support parsing the old value format. So I haven't added an alias for that option.

Is it OK to drop support for the old orgPhraseArray setting and require admins to migrate option name + value format? Considering it's documented as an experimental setting.

Also, I've just added a release note snippet that documents all this as well.


.. _dataverse.personOrOrg.orgPhraseArray:

dataverse.personOrOrg.orgPhraseArray
Expand All @@ -3102,8 +3104,9 @@ Please note that this setting is experimental.

The Schema.org metadata and OpenAIRE exports and the Schema.org metadata included in DatasetPages try to infer whether each entry in the various fields (e.g. Author, Contributor) is a Person or Organization.
If you have examples where an orgization name is being inferred to belong to a person, you can use this setting to force it to be recognized as an organization.
The value is expected to be a JsonArray of strings. Any name that contains one of the strings is assumed to be an organization. For example, "Project" is a word that is not otherwise associated with being an organization.
The value is expected to be a comma-separated list of strings. Any name that contains one of the strings is assumed to be an organization. For example, "Project" is a word that is not otherwise associated with being an organization.

Can also be set via *MicroProfile Config API* sources, e.g. the environment variable ``DATAVERSE_PERSONORORG_ORGPHRASEARRAY``.

.. _dataverse.api.signature-secret:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -263,6 +263,11 @@ public enum JvmSettings {
//CSL CITATION SETTINGS
SCOPE_CSL(PREFIX, "csl"),
CSL_COMMON_STYLES(SCOPE_CSL, "common-styles"),

// PersonOrOrgUtil SETTINGS
SCOPE_PERSONORORG(PREFIX, "personOrOrg"),
ASSUME_COMMA_IN_PERSON_NAME(SCOPE_PERSONORORG, "assumeCommaInPersonName"),
ORG_PHRASE_ARRAY(SCOPE_PERSONORORG, "orgPhraseArray"),
;

private static final String SCOPE_SEPARATOR = ".";
Expand Down
39 changes: 3 additions & 36 deletions src/main/java/edu/harvard/iq/dataverse/util/PersonOrOrgUtil.java
Original file line number Diff line number Diff line change
@@ -1,15 +1,11 @@
package edu.harvard.iq.dataverse.util;

import java.util.ArrayList;
import java.util.List;
import java.util.logging.Logger;

import jakarta.json.JsonArray;
import edu.harvard.iq.dataverse.settings.JvmSettings;
import jakarta.json.JsonObject;
import jakarta.json.JsonObjectBuilder;
import jakarta.json.JsonString;

import edu.harvard.iq.dataverse.util.json.JsonUtil;
import edu.harvard.iq.dataverse.util.json.NullSafeJsonBuilder;

/**
Expand Down Expand Up @@ -38,14 +34,6 @@ public class PersonOrOrgUtil {

private static final Logger logger = Logger.getLogger(PersonOrOrgUtil.class.getCanonicalName());

static boolean assumeCommaInPersonName = false;
static List<String> orgPhrases;

static {
setAssumeCommaInPersonName(Boolean.parseBoolean(System.getProperty("dataverse.personOrOrg.assumeCommaInPersonName", "false")));
setOrgPhraseArray(System.getProperty("dataverse.personOrOrg.orgPhraseArray", null));
}

/**
* This method tries to determine if a name belongs to a person or an
* organization and, if it is a person, what the given and family names are. The
Expand Down Expand Up @@ -73,6 +61,7 @@ public static JsonObject getPersonOrOrganization(String name, boolean organizati

boolean isOrganization = !isPerson && Organizations.getInstance().isOrganization(name);
if (!isOrganization) {
String[] orgPhrases = JvmSettings.ORG_PHRASE_ARRAY.lookupOptional(String[].class).orElse(new String[]{});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that the shift from a static block to looking these up as needed is just to be able to use the @JvmSetting annotation in tests? Putting the code here means these lookups will be done ~1M times for a /reExportAll command for Harvard Dataverse. Could/should we just call updated versions of the static methods in the tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this PR today and we're still concerted about these ~1 million calls.

@landreev is on vacation for a week or so but we'd like him to take a look when he gets back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should avoid making these lookups in normal operations, if at all possible. Sorry I missed this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vera - can you revert the change here that and go back to having a static initialization and setX methods for use in testing? I'll move this to In Progress until you're finished.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qqmyers done :)

for (String phrase : orgPhrases) {
if (name.contains(phrase)) {
isOrganization = true;
Expand All @@ -99,6 +88,7 @@ public static JsonObject getPersonOrOrganization(String name, boolean organizati
}

} else {
boolean assumeCommaInPersonName = JvmSettings.ASSUME_COMMA_IN_PERSON_NAME.lookupOptional(Boolean.class).orElse(false);
if (assumeCommaInPersonName && !isPerson) {
isOrganization = true;
} else {
Expand Down Expand Up @@ -135,27 +125,4 @@ public static JsonObject getPersonOrOrganization(String name, boolean organizati
return job.build();

}

// Public for testing
public static void setOrgPhraseArray(String phraseArray) {
orgPhrases = new ArrayList<String>();
if (!StringUtil.isEmpty(phraseArray)) {
try {
JsonArray phrases = JsonUtil.getJsonArray(phraseArray);
phrases.forEach(val -> {
JsonString strVal = (JsonString) val;
orgPhrases.add(strVal.getString());
});
} catch (Exception e) {
logger.warning("Could not parse Org phrase list");
}
}

}

// Public for testing
public static void setAssumeCommaInPersonName(boolean assume) {
assumeCommaInPersonName = assume;
}

}
Original file line number Diff line number Diff line change
@@ -1,13 +1,17 @@
package edu.harvard.iq.dataverse.util;

import edu.harvard.iq.dataverse.settings.JvmSettings;
import edu.harvard.iq.dataverse.util.json.JsonUtil;

import edu.harvard.iq.dataverse.util.testing.JvmSetting;
import edu.harvard.iq.dataverse.util.testing.LocalJvmSettings;
import org.junit.jupiter.api.Disabled;
import org.junit.jupiter.api.Test;
import static org.junit.jupiter.api.Assertions.*;

import jakarta.json.JsonObject;

@LocalJvmSettings
public class PersonOrOrgUtilTest {

public PersonOrOrgUtilTest() {
Expand All @@ -26,27 +30,39 @@ public void testOrganizationCOMPLEXName() {
verifyIsOrganization("The Ford Foundation");
verifyIsOrganization("United Nations Economic and Social Commission for Asia and the Pacific (UNESCAP)");
verifyIsOrganization("Michael J. Fox Foundation for Parkinson's Research");
// The next example is one known to be asserted to be a Person without an entry
// in the OrgWordArray
// So we test with it in the array and then when the array is empty to verify
// the array works, resetting the array works, and the problem still exists in
// The next examples are known to be asserted to be a Person without an entry in the OrgWordArray
// So we test when no array is set via JvmSetting to verify the problem still exists in
// the underlying algorithm
PersonOrOrgUtil.setOrgPhraseArray("[\"Portable\"]");
verifyIsOrganization("Portable Antiquities of the Netherlands");
PersonOrOrgUtil.setOrgPhraseArray(null);
JsonObject obj = PersonOrOrgUtil.getPersonOrOrganization("Portable Antiquities of the Netherlands", false, false);
assertTrue(obj.getBoolean("isPerson"));
JsonObject obj2 = PersonOrOrgUtil.getPersonOrOrganization("Max Mustermann GmbH", false, false);
assertTrue(obj2.getBoolean("isPerson"));
}

@Test
@JvmSetting(key = JvmSettings.ORG_PHRASE_ARRAY, value = "Portable,GmbH")
public void testOrganizationWithOrgPhraseArray() {
// The next examples are known to be asserted to be a Person without an entry in the OrgWordArray
// So we test with the array set via JvmSetting to verify the array works
verifyIsOrganization("Portable Antiquities of the Netherlands");
verifyIsOrganization("Max Mustermann GmbH");
}

@Test
public void testOrganizationAcademicName() {
verifyIsOrganization("John Smith Center");
verifyIsOrganization("John Smith Group");
// An example the base algorithm doesn't handle:
JsonObject obj = PersonOrOrgUtil.getPersonOrOrganization("John Smith Project", false, false);
assertTrue(obj.getBoolean("isPerson"));
}

verifyIsOrganization("John Smith Center");
verifyIsOrganization("John Smith Group");
//An example the base algorithm doesn't handle:
PersonOrOrgUtil.setAssumeCommaInPersonName(true);
verifyIsOrganization("John Smith Project");
PersonOrOrgUtil.setAssumeCommaInPersonName(false);
@Test
@JvmSetting(key = JvmSettings.ASSUME_COMMA_IN_PERSON_NAME, value = "true")
public void testOrganizationAcademicNameWithAssumeComma() {
verifyIsOrganization("John Smith Center");
verifyIsOrganization("John Smith Group");
verifyIsOrganization("John Smith Project");
}


Expand Down