Skip to content

MarcAggArchitecture

Chris Delis edited this page Oct 15, 2015 · 1 revision

This document goes into great detail about the implementation of the MARC Aggregation Service (abbreviated MAS below). See the intro page for a basic overview of the MARC Aggregration Service.


Current status of this document

  • Under review by Dave

TODOs

  • address holdings embedded in bibs (could be multiple)

Rules

* ### Transitive Relation ###
  * if A matches B and B matchs C, then A and C are considered matches regardless of whether A and C match on their own.  (see [wikipedia](http://en.wikipedia.org/wiki/Transitive_relation)).

Processing Steps

* if [in\_process\_record](#in_process_record) is a [bib](GeneralGlossary#bibliographic_record_(bib).md), then:
  * if [in\_process\_record](#in_process_record) is marked as [deleted](http://www.openarchives.org/OAI/openarchivesprotocol.html#DeletedRecords) (according to [oai-pmh](GeneralGlossary#oai-pmh)):
    * if the in\_process\_record's oai-id does not match any of the identifiers in the [processed\_records](#processed_records) set (this is determined by checking the [pred2succ map](#pred2succ_map)), then:
      * discard the in\_process\_record and continue on to the next.
    * otherwise we can infer that an earlier version of this [in\_process\_record](#in_process_record) had previously been processed by the [MAS](#MAS)
      * _note: an [oai-pmh\_harvester](GeneralGlossary#oai-pmh_harvester) will not be informed of this holdings' change in status until the incoming holding is marked as deleted.  It does means that this record will not be included in any subsequent oai-pmh\_harvests._
      * if the in\_process\_record has not been merged with any other [processed\_records](#processed_records) (determined by checking [merged\_records\_map](#merged_records_map)), then:
        * mark the [existing\_output\_record](#existing_output_record) as deleted
      * otherwise the in\_process\_record must be [unmerged](#Unmerge)
  * otherwise the in\_process\_record is not deleted, but active
    * if in\_process\_record is an update (determined by checking the [pred2succ map](#pred2succ_map)), then:
      * determine what the previous match points were by querying the [match\_point tables](#matchpoint_tracking) in real time
      * [determine matches](#Determine_matches) and [merge](#Merge) those records accordingly
      * if there were prior record matches that no longer match, those records must be [unmerged](#Unmerge)
    * otherwise
      * [determine matches](#Determine_matches) and [merge](#Merge) those records accordingly
    * pull out and preserve the [dynamic\_content](#dynamic_content)
    * if in\_process\_record is record\_of\_source, pull out and preserve [static\_content](#static_content)
* otherwise the [in\_process\_record](#in_process_record) is a [holdings](GeneralGlossary#holdings_record_(hold).md):
  * otherwise this holding

Determine matches

* Determine all [match points](#match_point) of the [in\_process\_record](#in_process_record).
* For each [match\_point](#match_point) of the [in\_process\_record](#in_process_record), find which [processed\_records](#processed_records) have [match\_points](#match_point) equal to the in\_process\_record.  This is determined by checking the [match\_point caches](#match_point_maps).  We'll call this record set [processed\_records\_with\_common\_match\_points](#processed_records_with_common_match_points).
* Run each [match\_rule](#match_rule) to determine if there are any [processed\_records\_with\_common\_match\_points](#processed_records_with_common_match_points) that match the in\_process\_record.  The match\_rules have to be run because having equivalent match points doesn't guarantee a match.  For example the 010a might match another record, but if the 020a doesn't match as well, then the records are not considered a match.  (see [step 2a](MarcAggMatchPointsAndErrorCases#Step_2A:)).
* if there are [matched\_records](#matched_records), then [merge](#Merge)
  * ensure that the in\_process\_record still matches each record it previously matched.
    * if there are records that no longer match, then they need to be [unmerged](#Unmerge)
* otherwise create a new output\_record

Merge

Unmerge

  • For records that need to be unmerged, simply delete all remnants of previously matching records that no longer match.
    • delete them out of the in-memory data structures and the db
    • reprocess each record

Matching Implementation

The Match Points document describes the rules for determining whether 2 records should be considered a match. Each match_point will implement the FieldMatcher interface. A couple of stub implementations are:

The above classes demonstrate how to determine matches on specific fields. Each of these matchers will be assigned names in file: custom.properties (the standard config file for the service) These 4 matchers are used currently:

matchers.value=SystemControlNumber, Lccn, ISBN, ISSN, x024a

The match_rules will also be declared in a config file. Those declarations will look something like this:

file: custom.properties (the standard config file for the service)

match.rules.value=Step1a, Step2abc
  • This means that we have 2 enabled match rules.

Merging Implementation

  • The merging algorithm will follow the logic set out in scenario #2.

Data Structures

(what follows is largely not how its implemented, but saving it because it could be the source of good ideas).

Match point values are stored in memory as much possible. Integral values are easy. String fields that are enumerated are easy (eg OCLC, NRU). Text fields for exact matching are a little more difficult. Text fields for fuzzy matches are even more difficult, but might be w/in our grasp if we hit a home-run with exact text field matching. See here for generic platform structures.

  • in-memory data structures

each of these will also be persisted so that they be fully loaded again on successive runs. * #### pred2succ_map #### * purpose: to determine if there is an existing_output_record associated with the in_process_record * key: input_record_id * value: output_record_id * #### merged_records_map #### * purpose: to determine (combined with the pred2succ_map) what records have been merged with a particular record. * key: output_record_id * value: input_record_ids * implementation: this will be a wrapper of multiple maps * #### merge_scores #### * purpose: this allows us to easily determine whether the in_process_record should be the record_of_source * key: output record id * value: the existing_output_record's record_of_source merge_score * #### bibs2holdings #### * purpose: when a bib is merged, we need to know if there are any holdings that already point to it. If so, we'll need to update that holding. * key: 001 & 003 * value: output holding id * #### match_point_maps #### * purpose: each FieldMatcher implementation has it's own structure because the match point might be a single field, a combination of field/subfield, or some other combination. * examples: matchpoints_035, matchpoints_010a * key: matchpoint key * value: input_record_id

  • Diagrams

    • see here for generic platform structures.
* see [here](ServicesExplained#Platform_Data_Structures) for generic platform structures. * ### db tables ### * #### merge\_tracking #### * #### merged\_records #### * purpose: provides a mapping of input records to output records. This allows for 2 paths: ``` ------------------------------------------------------------------- | given | can be determined | |-------------------------------------------------------------------| | an output_record_id | all the input_records that have been merged | | | together to create this output_record | |-------------------------------------------------------------------| | an input_record_id | all the other input_records that have been | | | merged with this input_record and the | | | corresponding output_record | ------------------------------------------------------------------- ``` * maps to: [pred2succ\_map](#pred2succ_map) and [merged\_records\_map](#merged_records_map) * #### merge\_scores (db) #### * purpose: this allows us to easily determine whether the in\_process\_record should be the [record\_of\_source](#record_of_source) * maps to: [merge\_scores](#merge_scores) * #### holdings\_activation #### * #### bibs2holdings #### * purpose: when a bib is merged, we need to know if there are any holdings that already point to it. If so, we'll need to update that holding. * desc: the 001 is most likely always a numeric, however, the MAS shouldn't fail if an alpha is present. For this reason, I'm allocating 2 columns of different types for the 001. * #### matchpoint\_tracking #### * purpose: these tables provide a mapping from various matchpoits to input\_records * #### dynamic\_record\_data #### * description: these tables comprise the [dynamic\_content](#dynamic_content) of output\_records * #### merged\_035 #### * purpose: when a merge happens, additional 035s need to be added to the output\_record. This table prevents us from reading, parsing, and re-writing the full record xml when that happens. * #### merged\_904 #### * purpose: when a merge happens, the holdings that reference an existing\_output\_record that is to be deleted needs to be updated. This table prevents us from reading, parsing, and re-writing the full record xml when that happens. * ### lucene ### * purpose: to support the text matching (exact and possibly fuzzy) of the FieldMatchers. * as I implement each specific matcher, I'll fill this out in more detail, but for now one example is the 245a (http://www.loc.gov/marc/bibliographic/bd245.html) * feasibility * first iteration * store full text unadulterated in a lucene index containing only 245a's. I don't know how fast this will be. This is a lookup that needs to be done for every single record, so even 1-2 ms might be too slow. It's worth trying to see how fast it'll be. * second iteration (if first isn't fast enough) * I could split each field-specific index into 2 separate indexes: 1. the first index only indexes the first x chars and can fit entirely in memory. When a match is found and it's size x, then we need to check the second index 1. the second index contains the full text field and can't fit in memory. It will be queried much less frequently, but can answer definitively. * third iteration (if second isn't fast enough) * try the same approach with MySQL. * fourth iteration (if third isn't fast enough) * go back to drawing board)

Implementation plans

  • Testing correctness of output
    • This depends on getting examples.
    • For the first iteration, having some examples for just a few of the field types would be enough.
  • Testing performance
    • I will test performance as soon as possible to determine the feasibility of the design.

tricky scenarios

  • matchpoints change scenarios
    1. A is processed. A is updated and no longer has matchpoint MP1. How to reflect this in the data structures (remove MP1)?
    • will query the necessary db tables live. updates do not need to be quite as optimized and this query should only take 1-2 millis.
  • merge_score scenarios
    1. A and B are merged. B has a higher merge_score than A and is kept as the record_of_source. B is updated and now has a lower merge_score than A.
    • B will remain record_of_source. An input record never stops being the record_of_source unless another merged_record is processed with a higher merge_score.
  • unmerge scenarios
    1. A is processed. B is processed. C is processed and matches A and B and all 3 are merged. C is updated and matches neither A nor B. All 3 records must be unmerged.
    2. A is processed. B is processed and matches A. C is processed and matches B. C is updated and no longer matches B. The above solution works.
    3. A is processed. B is processed and matches A. C is processed and matches B. C is updated and still longer matches B.

Glossary

  • dynamic_content

In contrast to static_content, dynamic_content is the portion of content in an output record that originates from more than one input record. The obviously necessary fields are those that contain ids and references. In the case of bibs, the MAS moves 001s and 003s to 035as. In the case of holdings, the MAS moves 004s and 014s to 904s. In both of these cases, the fields from all of the input records are preserved as dynamic_content in the associated output record. These id/reference fields plus the keep_fields comprise the dynamic_content.

  • held_records

The set of holdings records in the set of processed_records that are still waiting for their referenced bibs to arrive.

  • in_process_record

The input record that is currently being processed by the MAS.

  • keep_fields

dynamic_content that is guaranteed to be preserved in the output record. Besides these fields, the only content guaranteed to be preserved in the output record is the content of the 001 and 003 (for bibs) and the content of the 001, 003, 004, and 014s for holdings. All the other content is typically lost for all merged records other than the record_of_source. keep_fields allow the ability to truly merge the content of multiple input records into one output record.

  • MAS

the MARC Aggregation Service which is an MST service.

  • match_point

a specific field/subfield in a record which is used as a basis for determining a match. (see this page for more).

  • match_rule

a conditional statement that when true denotes that two records match. A match_rule is made up of one or more match_points. (see Step 2A)

  • matched_records

the set of processed_records that match the in_process_record.

  • merge_score

the computed score of an input record according to this rule (2nd scenario). This computation is made for all matched_records and the one with the highest score becomes the record_of_source.

  • output_records

the set of records in the output repository of the service (and available for oai-pmh harvesting).

  • existing_output_record

an output record which is a successor either of the in_process_record (in the case of an update) or a record from the processed_records that matches the in_process_record.

  • processed_records

the set of input records that the MAS has already processed.

  • processed_records_match_points

the set of all match points of all processed_records.

  • processed_records_with_common_match_points

the set all records that have common match points with the in_process_record (not necessarily matches). The distinction between this set and matched_records is that having equivalent match points doesn't guarantee a match. For example the 010a might match another record, but if the 020a doesn't match as well, then the records are not considered a match. (see step 2a). * DL- what is the difference between having common match points, and matches? I think this tripped me up in the other document and I commented there as well. Perhaps you could carefully define these two "states". Dave, is this clear now?

  • record_of_source

the particular input record out of the set of matched_records that is used as the basis for the static_content of the output record. This is the record with the highest merge_score out of the set of a particular set of matched_records

  • static_content

the static content of an output record is the portion of the record_of_source that is copied unadulterated from the record_of_source to the output record. For the current implementation, this is the entire record minus the 001, 003, 035a, 004, and 014 fields. If there is content from all merged records that needs to be preserved, it should be done using keep_fields. The static_content is stored in the xml column of the records_xml table (found in the "Repository Core" grouping of tables in this diagram

Clone this wiki locally