|
| 1 | +# 3. Surface Primo CDI records in results |
| 2 | + |
| 3 | +Date: 2025-08-07 |
| 4 | + |
| 5 | +## Status |
| 6 | + |
| 7 | +Accepted |
| 8 | + |
| 9 | +## Context |
| 10 | + |
| 11 | +The Libraries' unified search strategy calls for a discovery interface that surfaces results from |
| 12 | +both Primo Central Discovery Index (CDI) and Alma (via TIMDEX), replacing the current [Bento UI](https://github.com/MITLibraries/bento). |
| 13 | +In Bento, Alma and CDI results are displayed in separate boxes. The unified interface would |
| 14 | +interleave CDI and TIMDEX records in the same results list. |
| 15 | + |
| 16 | +We considered adding a new Primo harvester to our ETL architecture to ingest CDI data into TIMDEX |
| 17 | +API. However, this approach is not feasible for many reasons: |
| 18 | + |
| 19 | +- **Cost**: CDI contains over 5 billion records. Harvesting and storing these records would be impractical and expensive, both in terms of financial and compute resources. |
| 20 | +- **Performance**: Expanding TIMDEX API at such a scale is likely to dramatically reduce the efficiency of our OpenSearch index. |
| 21 | +- **Data availability**: Because Primo does not expose CDI records in OAI-PMH, we would need to harvest using the Primo Search API, making the process needlessly complex and perhaps impossible. |
| 22 | +- **Licensing**: Harvesting CDI records for TIMDEX likely has licensing implications. Ex Libris seems to discourage the practice, as Primo does not provide OAI-PMH support, and the Search API caps records per request at 5,000 via the [`offset` parameter](https://developers.exlibrisgroup.com/primo/apis/docs/primoSearch/R0VUIC9wcmltby92MS9zZWFyY2g=/#output:~:text=Note%3A%20The%20Primo%20search%20API%20has%20a%20hardcoded%20offset%20limitation%20parameter%20of%205000.). |
| 23 | + |
| 24 | +## Decision |
| 25 | + |
| 26 | +We will surface CDI results in TIMDEX UI by querying the Primo Search API directly at runtime and |
| 27 | +interleaving results with TIMDEX API results in the unified search interface. |
| 28 | + |
| 29 | +To achieve this, we will implement a search orchestrator that receives a query from TIMDEX UI and |
| 30 | +dispatches it in parallel to TIMDEX API and Primo Search API. The orchestrator will normalize and |
| 31 | +interleave the results before returning them to the UI. |
| 32 | + |
| 33 | +This approach aligns with the unified search strategy's goal to display all known results from |
| 34 | +CDI and TIMDEX in the same interface. It also enables us to add the desired intelligent user |
| 35 | +guidance, because we can render search interventions from TACOS and other external systems as |
| 36 | +needed. |
| 37 | + |
| 38 | +### Proposed architecture |
| 39 | + |
| 40 | +```mermaid |
| 41 | +sequenceDiagram |
| 42 | +
|
| 43 | +participant UI as TIMDEX UI (frontend) |
| 44 | +participant Orchestrator as Search Orchestrator (middleware) |
| 45 | +participant TIMDEX as TIMDEX API (OpenSearch) |
| 46 | +participant Primo as Primo Search API (CDI) |
| 47 | +participant TACOS as TACOS (query enhancer) |
| 48 | +
|
| 49 | +UI-->>Orchestrator: User submits search query |
| 50 | +UI-->>TACOS: Send query to TACOS |
| 51 | +TACOS-->>UI: Return patterns identified in query (e.g., suggested resources, citations, journal titles) |
| 52 | +Orchestrator-->>TIMDEX: Send query to TIMDEX API |
| 53 | +Orchestrator-->>Primo: Send query to Primo CDI API |
| 54 | +TIMDEX-->>Orchestrator: Return TIMDEX results |
| 55 | +Primo-->>Orchestrator: Return CDI results |
| 56 | +Orchestrator->>Orchestrator: Normalize & interleave results |
| 57 | +Orchestrator-->>UI: Return unified result set |
| 58 | +UI->>UI: Render interventions based on TACOS response |
| 59 | +UI->>UI: Render results in a single list |
| 60 | +``` |
| 61 | + |
| 62 | +Search form submissions will be sent in parallel to the search orchestrator and TACOS (possibly |
| 63 | +using Turbo frames, but implementation details are TBD). This will allow us to continue rendering |
| 64 | +TACOS interventions rapidly, likely before results are returned to the UI. |
| 65 | + |
| 66 | +The orchestrator will make asynchronous calls to the TIMDEX and Primo Search APIs. Records in each |
| 67 | +response will be normalized and interleaved into a unified set of results, then returned back to |
| 68 | +TIMDEX UI. In addition to record metadata, relevance scores must also be normalized due to the |
| 69 | +disparate sources. (See 'Relevance normalization' below for more details.) |
| 70 | + |
| 71 | +This architecture abstracts out most of the added complexity to the search orchestrator. The UI |
| 72 | +will be responsible only for sending queries to external systems and rendering the returned data. |
| 73 | +This abstraction will improve our discovery environment's maintainability by avoiding excessively |
| 74 | +complex codebases. |
| 75 | + |
| 76 | +### Relevance normalization |
| 77 | + |
| 78 | +The interleaving of results from TIMDEX and CDI introduces the problem of relevance normalization. |
| 79 | +While it is beyond the scope of this ADR to identify a solution this problem, it is something we |
| 80 | +should consider as an important future step. |
| 81 | + |
| 82 | +Primo uses an opaque, proprietary relevance algorithm. While the algorithm is |
| 83 | +[somewhat customizable](https://knowledge.exlibrisgroup.com/Primo/Product_Documentation/020Primo_VE/Primo_VE_(English)/040Search_Configurations/Configuring_the_Ranking_of_Search_Results_in_Primo_VE), |
| 84 | +we cannot assume any correlation between Primo scores and Okapi BM25 scores. |
| 85 | + |
| 86 | +Premature optimization is a risk here. If we normalize scores without understanding what results |
| 87 | +are actually useful, we might miss an opportunity to improve the search experience. Therefore, we |
| 88 | +should avoid implementing relevance normalization until we have useful analytics. These might |
| 89 | +include: |
| 90 | + |
| 91 | +- Score distribution from each source |
| 92 | +- User interaction data (e.g., do users click on CDI records more than TIMDEX records?) |
| 93 | +- Usability testing data |
| 94 | + |
| 95 | +We could begin by implementing rank-based interleaving (i.e., the first two results in the unified |
| 96 | +list would be the first two results from each source). While naive, such an algorithm would provide |
| 97 | +an heuristic against which to measure future normalization attempts. |
| 98 | + |
| 99 | +Once we have more information, we could then evaluate different normalization strategies. Techniques |
| 100 | +like [min-max](https://opensearch.org/blog/how-does-the-rank-normalization-work-in-hybrid-search/#:~:text=3.%20Min%2Dmax%20normalization%20technique) |
| 101 | +or [z-score](https://spotintelligence.com/2025/02/14/z-score-normalization/) would be relatively |
| 102 | +easy to implement. However, in order to make scores semantically comparable, it seems likely that we |
| 103 | +would need an ML-backed approach that could also help with reranking. |
| 104 | + |
| 105 | +To that end, **we should strongly consider writing the search orchestrator in Python**, due to |
| 106 | +greater availability of ML libraries. Alternatively, we can write the orchestrator in Rails and |
| 107 | +tack on the normalization component as a Python microservice. |
| 108 | + |
| 109 | +## Consequences |
| 110 | + |
| 111 | +### Pros |
| 112 | + |
| 113 | +- Avoids duplicating CDI data or violating licensing terms. |
| 114 | +- Enables real-time access to CDI content via Primo Search API. |
| 115 | +- Supports the unified search vision without overloading TIMDEX API. |
| 116 | + |
| 117 | +### Cons |
| 118 | + |
| 119 | +- Requires runtime integration with Primo Search API, which may introduce latency or complexity. (We can mitigate this by implementing a caching strategy similar to that in Bento.) |
| 120 | +- Limits computational access to CDI records (no bulk access via TIMDEX). |
| 121 | +- Mixed-source results may confuse end users. |
| 122 | + |
| 123 | +### Future Considerations |
| 124 | + |
| 125 | +Usability testing and analytics will inform how we refine this feature. Depending on how users |
| 126 | +interact with the single-stream UI, we may need visual clarification of each record's source API, or |
| 127 | +separate tabs for TIMDEX and Primo records. |
| 128 | + |
| 129 | +Relevance normalization is a critical issue. We can begin with rank-based interleaving, but we |
| 130 | +should not assume this to be a long-term solution. |
| 131 | + |
| 132 | +As previously mentioned, this solution does not provide computational access to CDI records via |
| 133 | +TIMDEX. We should connect with the MIT research community to determine whether such access would |
| 134 | +be useful. If there is a need, we could consider harvesting a subset of CDI records relevant to the |
| 135 | +use case. |
0 commit comments