Skip to content

Commit f223828

Browse files
committed
Revisions based on initial feedback
1 parent 83d0bf0 commit f223828

File tree

1 file changed

+51
-40
lines changed

1 file changed

+51
-40
lines changed

docs/architecture-decisions/0003-surface-primo-cdi-records-in-results.md

Lines changed: 51 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -13,22 +13,46 @@ both Primo Central Discovery Index (CDI) and Alma (via TIMDEX), replacing the cu
1313
In Bento, Alma and CDI results are displayed in separate boxes. The unified interface would
1414
interleave CDI and TIMDEX records in the same results list.
1515

16+
## Options considered
17+
18+
### Harvest Primo CDI data
19+
1620
We considered adding a new Primo harvester to our ETL architecture to ingest CDI data into TIMDEX
17-
API. However, this approach is not feasible for many reasons:
21+
API. This would allow us to normalize CDI records as we do with other TIMDEX sources. Querying a
22+
single API for Alma and CDI records would facilitate a single-stream view as desired in the unified
23+
UI. Interleaving would no longer be necessary, as all records would be stored in OpenSearch.
24+
25+
The harvester model would value beyond the scope of the TIMDEX UI redesign. By storing CDI records
26+
in TIMDEX API, we could facilitate computational access to a massive corpus of data.
27+
28+
Unfortunately, this approach is not feasible for many reasons:
1829

1930
- **Cost**: CDI contains over 5 billion records. Harvesting and storing these records would be impractical and expensive, both in terms of financial and compute resources.
2031
- **Performance**: Expanding TIMDEX API at such a scale is likely to dramatically reduce the efficiency of our OpenSearch index.
2132
- **Data availability**: Because Primo does not expose CDI records in OAI-PMH, we would need to harvest using the Primo Search API, making the process needlessly complex and perhaps impossible.
2233
- **Licensing**: Harvesting CDI records for TIMDEX likely has licensing implications. Ex Libris seems to discourage the practice, as Primo does not provide OAI-PMH support, and the Search API caps records per request at 5,000 via the [`offset` parameter](https://developers.exlibrisgroup.com/primo/apis/docs/primoSearch/R0VUIC9wcmltby92MS9zZWFyY2g=/#output:~:text=Note%3A%20The%20Primo%20search%20API%20has%20a%20hardcoded%20offset%20limitation%20parameter%20of%205000.).
2334

24-
## Decision
35+
### Display separate result streams in tabbed views
36+
37+
This option would essentially be a different take on the Bento design. On the results page, a user
38+
could tab between Alma results (labeled 'Books', 'MIT Catalog', etc.) and CDI results ('Articles').
2539

26-
We will surface CDI results in TIMDEX UI by querying the Primo Search API directly at runtime and
27-
interleaving results with TIMDEX API results in the unified search interface.
40+
While arguably an improvement on Bento, this design does not deliver the combined Alma/CDI results
41+
view as envisioned in the unified UI.
2842

29-
To achieve this, we will implement a search orchestrator that receives a query from TIMDEX UI and
30-
dispatches it in parallel to TIMDEX API and Primo Search API. The orchestrator will normalize and
31-
interleave the results before returning them to the UI.
43+
### Implement external search orchestrator
44+
45+
In this approach, we would surface CDI records in TIMDEX UI by querying the Primo Search API
46+
directly at runtime and interleaving results with TIMDEX API results in the unified search
47+
interface.
48+
49+
To achieve this, we would implement a search orchestrator that receives a query from TIMDEX UI and
50+
dispatches it in parallel to TIMDEX API and Primo Search API. The orchestrator would normalize and
51+
interleave the results before returning them to the UI. This would allow us to display Alma and
52+
CDI results in the same results list, without the feasibility concerns inherent in ingesting CDI
53+
records into TIMDEX API.
54+
55+
## Decision
3256

3357
This approach aligns with the unified search strategy's goal to display all known results from
3458
CDI and TIMDEX in the same interface. It also enables us to add the desired intelligent user
@@ -38,35 +62,22 @@ needed.
3862
### Proposed architecture
3963

4064
```mermaid
41-
sequenceDiagram
42-
43-
participant UI as TIMDEX UI (frontend)
44-
participant Orchestrator as Search Orchestrator (middleware)
45-
participant TIMDEX as TIMDEX API (OpenSearch)
46-
participant Primo as Primo Search API (CDI)
47-
participant TACOS as TACOS (query enhancer)
48-
49-
UI-->>Orchestrator: User submits search query
50-
UI-->>TACOS: Send query to TACOS
51-
TACOS-->>UI: Return patterns identified in query (e.g., suggested resources, citations, journal titles)
52-
Orchestrator-->>TIMDEX: Send query to TIMDEX API
53-
Orchestrator-->>Primo: Send query to Primo CDI API
54-
TIMDEX-->>Orchestrator: Return TIMDEX results
55-
Primo-->>Orchestrator: Return CDI results
56-
Orchestrator->>Orchestrator: Normalize & interleave results
57-
Orchestrator-->>UI: Return unified result set
58-
UI->>UI: Render interventions based on TACOS response
59-
UI->>UI: Render results in a single list
65+
flowchart TD
66+
A[User] -->|Submit search query| B[TIMDEX UI]
67+
B -->|Send query| C[TACOS]
68+
B -->|Send query| D[TIMDEX Search Orchestrator]
69+
D -->|Send query| E[TIMDEX API]
70+
D -->|Send query| F[Primo Search API]
71+
E -->|Return results| D
72+
F -->|Return results| D
73+
D -->|Normalize & interleave results| B
74+
C -->|Return interventions| B
6075
```
6176

62-
Search form submissions will be sent in parallel to the search orchestrator and TACOS (possibly
63-
using Turbo frames, but implementation details are TBD). This will allow us to continue rendering
64-
TACOS interventions rapidly, likely before results are returned to the UI.
65-
66-
The orchestrator will make asynchronous calls to the TIMDEX and Primo Search APIs. Records in each
67-
response will be normalized and interleaved into a unified set of results, then returned back to
68-
TIMDEX UI. In addition to record metadata, relevance scores must also be normalized due to the
69-
disparate sources. (See 'Relevance normalization' below for more details.)
77+
The UI will dispatch the query in parallel to TACOS and the search orchestrator. TACOS responses are
78+
then rendered immediately. The orchestrator waits for both TIMDEX and CDI responses, normalizes and
79+
interleaves them, and returns a unified result set. This separation of concerns allows TACOS to
80+
operate independently while the orchestrator handles result merging.
7081

7182
This architecture abstracts out most of the added complexity to the search orchestrator. The UI
7283
will be responsible only for sending queries to external systems and rendering the returned data.
@@ -94,7 +105,7 @@ include:
94105

95106
We could begin by implementing rank-based interleaving (i.e., the first two results in the unified
96107
list would be the first two results from each source). While naive, such an algorithm would provide
97-
an heuristic against which to measure future normalization attempts.
108+
a baseline heuristic against which to measure future normalization attempts.
98109

99110
Once we have more information, we could then evaluate different normalization strategies. Techniques
100111
like [min-max](https://opensearch.org/blog/how-does-the-rank-normalization-work-in-hybrid-search/#:~:text=3.%20Min%2Dmax%20normalization%20technique)
@@ -117,7 +128,7 @@ tack on the normalization component as a Python microservice.
117128
### Cons
118129

119130
- Requires runtime integration with Primo Search API, which may introduce latency or complexity. (We can mitigate this by implementing a caching strategy similar to that in Bento.)
120-
- Limits computational access to CDI records (no bulk access via TIMDEX).
131+
- Limits computational access to CDI records (no bulk access via TIMDEX). While not a TIMDEX UI concern, this is worthy of consideration in the broader context of the TIMDEX ecosystem.
121132
- Mixed-source results may confuse end users.
122133

123134
### Future Considerations
@@ -129,7 +140,7 @@ separate tabs for TIMDEX and Primo records.
129140
Relevance normalization is a critical issue. We can begin with rank-based interleaving, but we
130141
should not assume this to be a long-term solution.
131142

132-
As previously mentioned, this solution does not provide computational access to CDI records via
133-
TIMDEX. We should connect with the MIT research community to determine whether such access would
134-
be useful. If there is a need, we could consider harvesting a subset of CDI records relevant to the
135-
use case.
143+
We should connect with the MIT research community to determine whether computational access to CDI
144+
would be useful. If there is a need, we could consider harvesting a subset of CDI records relevant
145+
to the use case. This would be a significant undertaking beyond the scope of the unified search
146+
interface, but it aligns with the Libraries' mission, vision, and goals.

0 commit comments

Comments
 (0)