You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now more than ever, the international research community are keen to determine whether their findings replicate across different contexts. For instance, if a researcher discovers a potentially important association between two variables, they may wish to see whether this association is present in other populations (e.g. different countries, or different generations). In an ideal world, this would be achieved by conducting follow-up studies that are harmonised by design. In other words, the exact same methodologies and measures would be used in a new sample, in order to determine whether the findings can be replicated. Such direct replication is often challenging however, with research funders often preferring novel lines of inquiry.
11
19
12
20
As an alternative to direct replication, researchers may choose to reach out to others in the field who either have access to, or are in the process of collecting, comparable data. Indeed many researchers, particularly those in the life and social sciences, routinely make use of large, ongoing studies that collect a variety of data for multiple purposes (e.g. [longitudinal](/item-harmonisation/harmony-a-free-ai-tool-to-merge-longitudinal-studies) population studies). In practice however, much of our research is designed and carried out in silos – with different research groups tackling similar research questions using widely different designs and measures. Even if a researcher is successful in identifying data that are similar to their original work, minor differences in the design or measures may limit the comparability. What are researchers to do in such situations?
13
21
14
-
One increasingly popular option is retrospective harmonisation. This involves taking existing data from two or more disparate sources, and transforming the data in some way in order to make it directly comparable across sources. Let’s look at a simple, hypothetical example. Say a researcher wants to examine the relationship between level of [education](/data-harmonisation-in-education) and [depression](/harmonisation-validation/promis-depression-subscale), and whether this varies across two datasets, each from a different country. In dataset A, participants were asked to report their highest qualification out of a list of 10 options ranging from “no formal education” to “doctoral education”, whereas in dataset B there was a simple question that asked participants whether they completed a Bachelor’s degree (yes/no). The 10-option question in dataset A could be recoded to match the variable in dataset B, by collapsing all of the categories above and below Bachelor’s level. In many cases, retrospective harmonisation can be applied on an ad-hoc basis, using simple, logical recoding strategies such as this.
22
+
One increasingly popular option is retrospective [harmonisation](data-harmonisation). This involves taking existing data from two or more disparate sources, and transforming the data in some way in order to make it directly comparable across sources. Let’s look at a simple, hypothetical example. Say a researcher wants to examine the relationship between level of [education](/data-harmonisation-in-education) and [depression](/harmonisation-validation/promis-depression-subscale), and whether this varies across two datasets, each from a different country. In dataset A, participants were asked to report their highest qualification out of a list of 10 options ranging from “no formal education” to “doctoral education”, whereas in dataset B there was a simple question that asked participants whether they completed a Bachelor’s degree (yes/no). The 10-option question in dataset A could be recoded to match the variable in dataset B, by collapsing all of the categories above and below Bachelor’s level. In many cases, retrospective harmonisation can be applied on an ad-hoc basis, using simple, logical recoding strategies such as this.
15
23
16
24
However, not all constructs can be measured with such simple, categorical questions. Take the above outcome variable (depression) for instance. Depression is a complex, heterogeneous experience, characterized by a multitude of symptoms that can be experienced to various degrees and in different combinations. In large-scale surveys, depression is typically measured with standardized questionnaires – participants are asked to report on a range of symptoms, their responses are assigned numerical values, and these are summed to form a “total depression score” for each individual. Although this remains the most viable and plausible strategy for measuring something as complex as depression, there is no “gold standard” questionnaire that is universally adopted by researchers. Instead, there are well over 200 established depression scales. In a [recent review](https://www.closer.ac.uk/wp-content/uploads/210715-Harmonisation-measurement-properties-mental-health-measures-british-cohorts.pdf) (McElroy et al., 2020), we noted that the content of these questionnaires can differ markedly, e.g. different symptoms are assessed, or different response options are used.
Copy file name to clipboardExpand all lines: content/en/blog/data-harmonisation-examples-business.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,7 +3,7 @@ title: "10 Data Harmonisation Examples That Move Businesses and Organisations Fo
3
3
description: "Unlock Data Harmonisation with Harmony: Transform Your Research & Analysis. Explore Harmony for seamless data harmonisation. Dive into our guide on using this tool to enhance research, attract collaborations, and drive insights."
4
4
date: 2024-02-27
5
5
categories: ["data"]
6
-
image: "/images/01- X Data harmonisationexamples that move businessess and organizations forward.svg"
Copy file name to clipboardExpand all lines: content/en/blog/data-harmonisation.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -130,7 +130,7 @@ Data harmonisation is not a theoretical concept but a practical necessity across
130
130
{{< youtube cEZppTBj1NI >}}
131
131
132
132
133
-
Tools like Harmony, designed specifically for the retrospective harmonisation of questionnaire items, are invaluable in research and data analysis. They allow for the comparison and combination of data from different studies or time periods, which is crucial in fields like social sciences, healthcare, and market research.
133
+
Tools like Harmony, designed specifically for the [retrospective harmonisation of questionnaire items](/data-harmonisation/back-to-the-future-retrospectively-harmonising-questionnaire-data/), are invaluable in research and data analysis. They allow for the comparison and combination of data from different studies or time periods, which is crucial in fields like social sciences, healthcare, and market research.
134
134
135
135
**Perspectives from EPAM and TIBCO**
136
136
Companies like EPAM and TIBCO highlight the strategic importance of data harmonisation. They emphasize how it can provide a competitive edge by ensuring data consistency across an organization, improving decision-making, and streamlining operations.
When you input two questionnaires into Harmony, such as the [GAD-7](https://en.wikipedia.org/wiki/Generalized_Anxiety_Disorder_7) and [Beck’s Anxiety Inventory](https://res.cloudinary.com/dpmykpsih/image/upload/great-plains-health-site-358/media/1087/anxiety.pdf), Harmony is able to match similar questions and assign a number to the match. (I have written another blog post on [how we measured Harmony’s performance in terms of AUC](https://harmonydata.ac.uk/measuring-the-performance-of-nlp-algorithms/)).
@@ -120,7 +124,7 @@ With an aim to make our research as accessible to the public as possible, we hav
The questions often come with a set of options such as *definitely not, somewhat anxious*, and the like. These are often a form of [Likert scale](https://en.wikipedia.org/wiki/Likert_scale). We would like to apply the same logic to match the candidate answers in a question, and identify when questions have opposite polarity (*I often feel anxious* vs *I rarely feel anxious*).
Copy file name to clipboardExpand all lines: content/en/blog/how-far-can-we-go-with-harmony-testing-on-kufungisisa-a-cultural-concept-of-distress-from-zimbabwe.md
Many psychologists believe that mental illnesses can vary across cultures. In 1904, [Emil Kraepelin](https://en.wikipedia.org/wiki/Emil_Kraepelin) initiated the field of comparative psychiatry after studying mental health disorders in Java, writing that _“Die Eigenart eines Volkes wird auch in der Häufigkeit und klinischen Gestaltung seiner Geistesstörungen zum Ausdruck kommen,”_ meaning “The peculiarity of a people[ethnic group] will also be expressed in the frequency and clinical form of its mental disorders.”[1]
@@ -59,7 +59,7 @@ Although English is the best-resource language for [natural language processing]
59
59
60
60
Above: the text of the Shona symptom questionnaire for the detection of depression and anxiety.
61
61
62
-
A problem I encountered was that the [transformer model](/nlp-semantic-text-matching-with-deep-learning-transformer-models/) didn’t work for both Shona and English (it’s not multilingual, like Harmony’s default transformer model). I Google translated [GHQ-12](/ghq-12-vs-beck-anxiety-inventory) into Shona as a temporary measure.
62
+
A problem I encountered was that the [transformer model](/nlp-semantic-text-matching/) didn’t work for both Shona and English (it’s not multilingual, like Harmony’s default transformer model). I Google translated [GHQ-12](/ghq-12-vs-beck-anxiety-inventory) into Shona as a temporary measure.
63
63
64
64
Also, the transformer model did not operate as a sentence transformer, but rather as a token-level transformer, so my sentence vectors were made by averaging token vectors across an input.
65
65
@@ -74,7 +74,7 @@ Also, when we are using English and Portuguese texts, which has until now been o
74
74
75
75
## Further reading
76
76
77
-
You may want to read about my [experiments with semantic text matching with deep learning transformer models](/nlp-semantic-text-matching-with-deep-learning-transformer-models/).
77
+
You may want to read about my [experiments with semantic text matching with deep learning transformer models](/nlp-semantic-text-matching/).
@@ -14,7 +20,7 @@ This quote really highlights the importance of effective questionnaires. They're
14
20
15
21
This is where the challenge of non-harmonised data comes in, and it truly can be a problem when you have differently formatted surveys with different questions and scales. The questionnaires might not even be in the same language. Analysing the data straight-up is like trying to complete a jigsaw puzzle where the pieces are from different sets (it’s quite literally impossible). So, we need to get our questionnaires to work in harmony with each other.
16
22
17
-
In this guide, you'll discover 10 practical steps – and we've thrown in an extra one for good measure – to assist you in the harmonisation of questionnaire data. These data harmonisation steps are designed to make the process smoother, so that your collected data is not just abundant but also rich in insights and meaning.
23
+
In this guide, you'll discover 10 practical steps – and we've thrown in an extra one for good measure – to assist you in the harmonisation of questionnaire data. These [data harmonisation](/data-harmonisation/) steps are designed to make the process smoother, so that your collected data is not just abundant but also rich in insights and meaning.
_Harmony was able to reconstruct the matches of the questionnaire harmonisation tool developed by McElroy et al in 2020 with the following AUC scores: childhood **84%**, adulthood **80%**. Harmony was able to match the questions of the English and Portuguese [GAD-7](https://adaa.org/sites/default/files/GAD-7_Anxiety-updated_0.pdf) instruments with AUC **100%** and the Portuguese [CBCL](https://www.apa.org/depression-guideline/child-behavior-checklist.pdf) and [SDQ](/ces-d-vs-sdq) with AUC **89%**. Harmony was also evaluated using a variety of transformer models including MentalBERT, a publicly available pretrained language model for the mental [healthcare](https://fastdatascience.com/the-use-of-ai-in-healthcare) domain._
Semantic text matching is a task in [natural language processing](https://naturallanguageprocessing.com/) involving estimating the semantic [similarity](https://fastdatascience.com/finding-similar-documents-nlp/) between two texts. For example, if we had to quantify the similarity between “I feel nervous” and “I feel anxious”, most people would agree that these are closer together than either sentence is to “I feel happy”. A semantic text matching algorithm would be able to place a number on the similarity, such as 79%.
@@ -35,4 +35,4 @@ Transformer models have proven to be very effective in semantic text matching. T
35
35
36
36
## See also
37
37
38
-
*[Harmony on "kufungisisa": a cultural concept of distress from Zimbabwe](/nlp-semantic-text-matching-with-deep-learning-transformer-models/harmony-on-kufungisisa-a-cultural-concept-of-distress-from-zimbabwe/)
38
+
*[Harmony on "kufungisisa": a cultural concept of distress from Zimbabwe](/nlp-semantic-text-matching/harmony-on-kufungisisa-a-cultural-concept-of-distress-from-zimbabwe/)
0 commit comments