Skip to content

Commit 83d0bf0

Browse files
committed
Add ADR for surfacing Primo CDI records in results
Why these changes are being introduced: The proposed unified search interface would display records from Primo CDI (via Primo Search API) and Alma (via TIMDEX API) in the same results list. TIMDEX UI does not currently have a means to combine results from multiple APIs in this way. Relevant ticket(s): N/A How this addresses that need: This adds an ADR that outlines a proposed solution to this problem, by introducing a search orchestration layer that will handle API calls and results normalization. Side effects of this change: There are additional decisions to be made around the architecture of the search orchestrator, such as how to manage relevance normalization. These decisions are noted in the ADR and will be explored in future ADRs.
1 parent 598e3da commit 83d0bf0

File tree

1 file changed

+135
-0
lines changed

1 file changed

+135
-0
lines changed
Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
# 3. Surface Primo CDI records in results
2+
3+
Date: 2025-08-07
4+
5+
## Status
6+
7+
Accepted
8+
9+
## Context
10+
11+
The Libraries' unified search strategy calls for a discovery interface that surfaces results from
12+
both Primo Central Discovery Index (CDI) and Alma (via TIMDEX), replacing the current [Bento UI](https://github.com/MITLibraries/bento).
13+
In Bento, Alma and CDI results are displayed in separate boxes. The unified interface would
14+
interleave CDI and TIMDEX records in the same results list.
15+
16+
We considered adding a new Primo harvester to our ETL architecture to ingest CDI data into TIMDEX
17+
API. However, this approach is not feasible for many reasons:
18+
19+
- **Cost**: CDI contains over 5 billion records. Harvesting and storing these records would be impractical and expensive, both in terms of financial and compute resources.
20+
- **Performance**: Expanding TIMDEX API at such a scale is likely to dramatically reduce the efficiency of our OpenSearch index.
21+
- **Data availability**: Because Primo does not expose CDI records in OAI-PMH, we would need to harvest using the Primo Search API, making the process needlessly complex and perhaps impossible.
22+
- **Licensing**: Harvesting CDI records for TIMDEX likely has licensing implications. Ex Libris seems to discourage the practice, as Primo does not provide OAI-PMH support, and the Search API caps records per request at 5,000 via the [`offset` parameter](https://developers.exlibrisgroup.com/primo/apis/docs/primoSearch/R0VUIC9wcmltby92MS9zZWFyY2g=/#output:~:text=Note%3A%20The%20Primo%20search%20API%20has%20a%20hardcoded%20offset%20limitation%20parameter%20of%205000.).
23+
24+
## Decision
25+
26+
We will surface CDI results in TIMDEX UI by querying the Primo Search API directly at runtime and
27+
interleaving results with TIMDEX API results in the unified search interface.
28+
29+
To achieve this, we will implement a search orchestrator that receives a query from TIMDEX UI and
30+
dispatches it in parallel to TIMDEX API and Primo Search API. The orchestrator will normalize and
31+
interleave the results before returning them to the UI.
32+
33+
This approach aligns with the unified search strategy's goal to display all known results from
34+
CDI and TIMDEX in the same interface. It also enables us to add the desired intelligent user
35+
guidance, because we can render search interventions from TACOS and other external systems as
36+
needed.
37+
38+
### Proposed architecture
39+
40+
```mermaid
41+
sequenceDiagram
42+
43+
participant UI as TIMDEX UI (frontend)
44+
participant Orchestrator as Search Orchestrator (middleware)
45+
participant TIMDEX as TIMDEX API (OpenSearch)
46+
participant Primo as Primo Search API (CDI)
47+
participant TACOS as TACOS (query enhancer)
48+
49+
UI-->>Orchestrator: User submits search query
50+
UI-->>TACOS: Send query to TACOS
51+
TACOS-->>UI: Return patterns identified in query (e.g., suggested resources, citations, journal titles)
52+
Orchestrator-->>TIMDEX: Send query to TIMDEX API
53+
Orchestrator-->>Primo: Send query to Primo CDI API
54+
TIMDEX-->>Orchestrator: Return TIMDEX results
55+
Primo-->>Orchestrator: Return CDI results
56+
Orchestrator->>Orchestrator: Normalize & interleave results
57+
Orchestrator-->>UI: Return unified result set
58+
UI->>UI: Render interventions based on TACOS response
59+
UI->>UI: Render results in a single list
60+
```
61+
62+
Search form submissions will be sent in parallel to the search orchestrator and TACOS (possibly
63+
using Turbo frames, but implementation details are TBD). This will allow us to continue rendering
64+
TACOS interventions rapidly, likely before results are returned to the UI.
65+
66+
The orchestrator will make asynchronous calls to the TIMDEX and Primo Search APIs. Records in each
67+
response will be normalized and interleaved into a unified set of results, then returned back to
68+
TIMDEX UI. In addition to record metadata, relevance scores must also be normalized due to the
69+
disparate sources. (See 'Relevance normalization' below for more details.)
70+
71+
This architecture abstracts out most of the added complexity to the search orchestrator. The UI
72+
will be responsible only for sending queries to external systems and rendering the returned data.
73+
This abstraction will improve our discovery environment's maintainability by avoiding excessively
74+
complex codebases.
75+
76+
### Relevance normalization
77+
78+
The interleaving of results from TIMDEX and CDI introduces the problem of relevance normalization.
79+
While it is beyond the scope of this ADR to identify a solution this problem, it is something we
80+
should consider as an important future step.
81+
82+
Primo uses an opaque, proprietary relevance algorithm. While the algorithm is
83+
[somewhat customizable](https://knowledge.exlibrisgroup.com/Primo/Product_Documentation/020Primo_VE/Primo_VE_(English)/040Search_Configurations/Configuring_the_Ranking_of_Search_Results_in_Primo_VE),
84+
we cannot assume any correlation between Primo scores and Okapi BM25 scores.
85+
86+
Premature optimization is a risk here. If we normalize scores without understanding what results
87+
are actually useful, we might miss an opportunity to improve the search experience. Therefore, we
88+
should avoid implementing relevance normalization until we have useful analytics. These might
89+
include:
90+
91+
- Score distribution from each source
92+
- User interaction data (e.g., do users click on CDI records more than TIMDEX records?)
93+
- Usability testing data
94+
95+
We could begin by implementing rank-based interleaving (i.e., the first two results in the unified
96+
list would be the first two results from each source). While naive, such an algorithm would provide
97+
an heuristic against which to measure future normalization attempts.
98+
99+
Once we have more information, we could then evaluate different normalization strategies. Techniques
100+
like [min-max](https://opensearch.org/blog/how-does-the-rank-normalization-work-in-hybrid-search/#:~:text=3.%20Min%2Dmax%20normalization%20technique)
101+
or [z-score](https://spotintelligence.com/2025/02/14/z-score-normalization/) would be relatively
102+
easy to implement. However, in order to make scores semantically comparable, it seems likely that we
103+
would need an ML-backed approach that could also help with reranking.
104+
105+
To that end, **we should strongly consider writing the search orchestrator in Python**, due to
106+
greater availability of ML libraries. Alternatively, we can write the orchestrator in Rails and
107+
tack on the normalization component as a Python microservice.
108+
109+
## Consequences
110+
111+
### Pros
112+
113+
- Avoids duplicating CDI data or violating licensing terms.
114+
- Enables real-time access to CDI content via Primo Search API.
115+
- Supports the unified search vision without overloading TIMDEX API.
116+
117+
### Cons
118+
119+
- Requires runtime integration with Primo Search API, which may introduce latency or complexity. (We can mitigate this by implementing a caching strategy similar to that in Bento.)
120+
- Limits computational access to CDI records (no bulk access via TIMDEX).
121+
- Mixed-source results may confuse end users.
122+
123+
### Future Considerations
124+
125+
Usability testing and analytics will inform how we refine this feature. Depending on how users
126+
interact with the single-stream UI, we may need visual clarification of each record's source API, or
127+
separate tabs for TIMDEX and Primo records.
128+
129+
Relevance normalization is a critical issue. We can begin with rank-based interleaving, but we
130+
should not assume this to be a long-term solution.
131+
132+
As previously mentioned, this solution does not provide computational access to CDI records via
133+
TIMDEX. We should connect with the MIT research community to determine whether such access would
134+
be useful. If there is a need, we could consider harvesting a subset of CDI records relevant to the
135+
use case.

0 commit comments

Comments
 (0)