Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# 3. Surface Primo CDI records in results

Date: 2025-08-07

## Status

Accepted

## Context

The Libraries' unified search strategy calls for a discovery interface that surfaces results from
both Ex Libris Central Discovery Index (CDI) and Alma (via TIMDEX), replacing the current
[Bento UI](https://github.com/MITLibraries/bento).
In Bento, Alma and CDI results are displayed in separate boxes. The unified interface would
interleave CDI and TIMDEX records in the same results list, providing affordances (likely tabs)
to display CDI or TIMDEX results separately.

## Options considered

### Harvest Primo CDI data

We considered adding a new Primo harvester to our ETL architecture to ingest CDI data into TIMDEX
API. This approach is not feasible for many reasons:

- **Cost**: CDI contains over 5 billion records. Harvesting and storing these records would be impractical and expensive, both in terms of financial and compute resources.
- **Performance**: Expanding TIMDEX API at such a scale is likely to dramatically reduce the efficiency of our OpenSearch index.
- **Data availability**: Because Primo does not expose CDI records in OAI-PMH, we would need to harvest using the Primo Search API, making the process needlessly complex and perhaps impossible.
- **Licensing**: Harvesting CDI records for TIMDEX likely has licensing implications. Ex Libris seems to discourage the practice, as Primo does not provide OAI-PMH support, and the Search API caps records per request at 5,000 via the [`offset` parameter](https://developers.exlibrisgroup.com/primo/apis/docs/primoSearch/R0VUIC9wcmltby92MS9zZWFyY2g=/#output:~:text=Note%3A%20The%20Primo%20search%20API%20has%20a%20hardcoded%20offset%20limitation%20parameter%20of%205000.).

### Display separate result streams in tabbed views
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there is a desire from the UX side to have tabs for "primo" and "timdex" even if we figure out how to merge all the results. We may want to update our decision to reflect having multiple tabs ("everything (which is what I believe the Orchestrator approach would provide), "primo (probably Alma + CDI)", "timdex (everything not in the primo tab)"


This option would essentially be a different take on the Bento design. On the results page, a user
could tab between Alma results (labeled 'Books', 'MIT Catalog', etc.) and CDI results ('Articles').

While arguably an improvement on Bento, this design does not deliver the combined Alma/CDI results
view as envisioned in the unified UI. A superior design would include an 'Everything' tab as the
default, with TIMDEX and CDI tabs for users that want to refine further.

### Implement external search orchestrator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this document, but we'll likely need a separate ADR for the orchestrator if we agree that is the best path forward as I suspect there are a few solid directions to go (Rails, lambda, lambda feeding OpenSearch, etc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I talked a bit in Slack about deferred architectural decisions for the orchestrator, but I'm not sure it translated into the ADR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't caught up on Slack (still haven't fully!). We don't need to update this ADR to note there will be a different ADR with more details on whatever we choose... I was just saying it out loud :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got you. It just felt useful to me to note it in the ADR, because I think at times we start developing before we're done documenting decisions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... that's a very good point.


In this approach, we would surface CDI records in TIMDEX UI by querying the Primo Search API
directly at runtime and interleaving results with TIMDEX API results in the unified search
interface.

To achieve this, we would implement a search orchestrator that receives a query from TIMDEX UI and
dispatches it in parallel to TIMDEX API and Primo Search API. The orchestrator would normalize and
interleave the results before returning them to the UI. This would allow us to display Alma and
CDI results in the same results list, without the feasibility concerns inherent in ingesting CDI
records into TIMDEX API.

## Decision

We will implement an external search orchestrator that interleaves results from CDI and TIMDEX.
This combined results list will become the default display in TIMDEX UI. The UI will also provide
the option to display results from a single source.

This approach aligns with the unified search strategy's goal to display all known results from CDI
and TIMDEX in the same interface. It also enables us to add the desired intelligent user guidance,
because we can render search interventions from TACOS and other external systems as needed.

An overview of the proposed architecture is below, but an additional ADR will be needed to explore
the implementation details.

### Proposed architecture

```mermaid
flowchart TD
A[User] -->|Submit search query| B[TIMDEX UI]
B -->|Send query| C[TACOS]
B -->|Send query| D[TIMDEX Search Orchestrator]
D -->|Send query| E[TIMDEX API]
D -->|Send query| F[Primo Search API]
E -->|Return results| D
F -->|Return results| D
D -->|Normalize & interleave results| B
C -->|Return interventions| B
```

The UI will dispatch the query in parallel to TACOS and the search orchestrator. TACOS responses are
then rendered immediately. The orchestrator waits for both TIMDEX and CDI responses, normalizes and
interleaves them, and returns a unified result set. This separation of concerns allows TACOS to
operate independently while the orchestrator handles result merging.

This architecture abstracts out most of the added complexity to the search orchestrator. The UI
will be responsible only for sending queries to external systems and rendering the returned data.
This abstraction will improve our discovery environment's maintainability by avoiding excessively
complex codebases.

### Relevance normalization

The interleaving of results from TIMDEX and CDI introduces the problem of relevance normalization.
While it is beyond the scope of this ADR to identify a solution this problem, it is something we
should consider as an important future step.

Primo uses an opaque, proprietary relevance algorithm. While the algorithm is
[somewhat customizable](https://knowledge.exlibrisgroup.com/Primo/Product_Documentation/020Primo_VE/Primo_VE_(English)/040Search_Configurations/Configuring_the_Ranking_of_Search_Results_in_Primo_VE),
we cannot assume any correlation between Primo scores and Okapi BM25 scores.

Premature optimization is a risk here. If we normalize scores without understanding what results
are actually useful, we might miss an opportunity to improve the search experience. Therefore, we
should avoid implementing relevance normalization until we have useful analytics. These might
include:

- Score distribution from each source
- User interaction data (e.g., do users click on CDI records more than TIMDEX records?)
- Usability testing data

We could begin by implementing rank-based interleaving (i.e., the first two results in the unified
list would be the first two results from each source). While naive, such an algorithm would provide
a baseline heuristic against which to measure future normalization attempts.

Once we have more information, we could then evaluate different normalization strategies. Techniques
like [min-max](https://opensearch.org/blog/how-does-the-rank-normalization-work-in-hybrid-search/#:~:text=3.%20Min%2Dmax%20normalization%20technique)
or [z-score](https://spotintelligence.com/2025/02/14/z-score-normalization/) would be relatively
easy to implement. However, in order to make scores semantically comparable, it seems likely that we
would need an ML-backed approach that could also help with reranking.

To that end, **we should strongly consider writing the search orchestrator in Python**, due to
greater availability of ML libraries. Alternatively, we can write the orchestrator in Rails and
tack on the normalization component as a Python microservice.

## Consequences

### Pros

- Avoids duplicating CDI data or violating licensing terms.
- Enables real-time access to CDI content via Primo Search API.
- Supports the unified search vision without overloading TIMDEX API.

### Cons

- Requires runtime integration with Primo Search API, which will introduce latency and complexity. (We can mitigate this by implementing a caching strategy similar to that in Bento.)
- Limits computational access to CDI records (no bulk access via TIMDEX). While not a TIMDEX UI concern, this is worthy of consideration in the broader context of the TIMDEX ecosystem.
- Mixed-source results may confuse end users.

### Future Considerations

Usability testing and analytics will inform how we refine this feature. Depending on how users
interact with the single-stream UI, we may need visual clarification of each record's source API, or
separate tabs for TIMDEX and Primo records.

Relevance normalization is a critical issue. We can begin with rank-based interleaving, but we
should not assume this to be a long-term solution.

We should connect with the MIT research community to determine their needs regarding computational
access to library data. While we cannot harvest CDI data for the aforementioned reasons, there may
be an alternative to CDI that could better support our users. Conducting this research would be a
significant undertaking beyond the scope of the unified search interface, but it aligns with the
Libraries' mission, vision, and goals.