Skip to content

CLS General Functions Metadata Normalization

Megan Lohnash edited this page Mar 10, 2021 · 9 revisions

Normalizing and Reviewing metadata

Ensuring that metadata entering, or re-entering, the repository is well-formed and conforms to California Revealed's metadata guidelines is one of the key responsibilities of California Revealed's Staff.

Malformed metadata entry ingest, or retention, leads to a host of issues down the line for Project Staff, Partners and site users: processing record items into the repository can be hampered, pollution of controlled vocabularies, complicating metadata transmission to vendors, impeding upload to the Internet Archive, and cause the public access site to display scrambled text due to ingest issues.

To prevent these types of issues, it is imperative that every time records are imported by CA-R Staff that the Staff member conducts a close and thorough review of metadata being ingested, or re-ingested, into the repository.

Working with metadata spreadsheets before ingest:

  • For the working version of a file save it in a file format other than .csv, such as .xlsx.
    • The .csv file format does not save formatting, highlighting, tabs, or filtering (which may be used to expedite this process) when files are saved and re-opened as .csv.
    • One does not have to worry about this if using GoogleSheets.
  • If starting with exported data in .csv format, before opening the file one must either 1) work with the data in GoogleSheets or 2) import the data into Excel using their data import wizard to preserve any UTF-8 characters present in the file.
    • Use GoogleSheets to work on the .csv
      • Upload the .csv file to your google account
      • (suggested) Rename the file something easy to find/remember/related to the task
      • Work on the file in Google Sheets
      • Export as .csv when you are finished
    • Import .csv file into Excel using their “Data” tab or File menu
      • Open Excel
      • Click “File” and “New”
      • Click on the “Data” tab
      • Click “From Text/CSV” and select the CSV file
      • For “File origin”, select “65001 : Unicode (UTF-8)”
      • Click “Transform Data” – this will open MS – Power Query which will show you a preview of the data.
      • Click “Close & Load”
      • Save file as .xlsx
  • Before importing into the Repository, or saving as .csv, ensure that spreadsheets do not have any blank rows as this will create blank records (with no title or institution) in the Repository.
  • After normalizing and mapping metadata, save the file as a "CSV UTF-8 (Comma Delimited) (.csv)" . The file name should follow our standard file naming convention of MARC_GrantCycle_Stream_filetype_date.extension.
    • For example: car_2018-2019_PT_NominationsImport_2019-11-04.xlsx.

General notes for ingesting metadata spreadsheets:

  • Checking that all required fields are complete and correctly formatted according to the Metadata Guidelines.
  • If they are not it becomes difficult to manually edit records in the repository and requires users to backfill missing required fields during processing and/or QC.
  • Unknown is an acceptable placeholder for most required fields
  • Name of institution provided matches the name of the institution in the Repository. - If there is a discrepancy, email the partner to clarify their preference; default should be the institution name provided in the application. - Institution name must match for the records to properly associate with the Partner in the Repository.
  • Call number or Temporary ID present
  • Main or Supplied Title is present and unique - Titles display online as a list and it's easier for users to search or browse unique titles that are not overly long (i.e. 50 characters or less) - If titles are not unique and need to be updated/differentiated this should be requested at the time of Award by the stream manager.
    • The person processing the collection should check to see if request has been meet.
    • If the partner has not updated the titles with unique entries, the processor should email the partner with suggestions, and/or ask for permission to update titles. - If titles are extremely long request that the partner move dates or other description out of the Title field into other appropriate fields as needed (such as Description, Alternative Title, Series Title, Collection Title, Temporal Coverage, etc.) - All newspaper titles must be formatted with serial title and issue date, i.e., Richmond Record Herald 1941-12-07
    • Newspapers must also have a Series Title that corresponds with the authoritative name of the Newspaper across time. for details please contact the Newspaper Preservation Manager.
  • The names of Creators, Contributors, Copyright Holders, Publishers, and Distributors, and Subject Entities follow the Library of Congress Name Authority File (LCNAF) format. - For individuals this is generally constructed as Last Name, First Name. For example: Anderson, Julie.
    • Some names are fairly common so one may also see a Middle Name following the First name to help differentiate people with similar names. For example, Anderson, Megan Kirsten vs Anderson, Megan Kristin.
    • If additional specification is needed one may see a birth/death year attached, so Last Name, First Name Middle Name, Birth year - Death year (if known/applicable). For example Anderson, Robert Lowell, 1955 - vs Anderson, Robert Lowell 1929 - 2002. - LCNAF formatting rules also apply to group entities, such as families, corporations, or gatherings of people. - Normalizing these at the time of initial ingest is vital for maintaining the health of our various controlled vocabularies and prevents the need to go back and clean up/merge terms later.
  • Created Date is the only required date field. This follows the Library of Congress Extended Date and Time Format(EDTF) and is subject to our date validation described below.
  • Ensure that Copyright Statement(s) are complete and generally comply with our Permission Guidelines. - "Copyrighted" statements must include the name of the copyright holder. Follow up with partner if needed. - "Copyright status unknown" statements must include an institutional/evergreen email address. Follow up with partner if needed.
  • Gauge/Format conforms to our controlled vocabularies. Check the vocabulary lists (AV List; Print List) in the Repository.
  • Extent number of parts is formatted, e.g. 1 Page of 1; 2 Reels of 2; 3 Tapes of 3.
  • Extent (Dimensions) (Print only field) are formatted as "in." for inches and "cm." for centimeters. Use fractions for inches and decimals for centimeters. If unknown, ask partner the partner for information or a best guess. Used to determine estimated pricing for print based materials
  • Ensure that controlled vocabulary terms are standardized for subject topics and spatial coverage/publication location according to Library of Congress Subject Headings (LCSH). - For extremely local terms that are not in LCSH, format them following LCSH conventions. - Normalizing these at the time of initial ingest is important for maintaining the health of our various controlled vocabularies and prevents the need to go back and clean up/merge terms later.
  • Project Note: California Revealed
  • Media Type: Depends on format. Image, Text, Moving Image or Sound
  • Condition present
  • Production Stream: AV, DG, NP, OS, or PT. - Default for AV content type items is AV. This will need to be manually set for DG steam items - either via import or record updates. - Default for Still Image/Text content type is OS (Onsite). This will need to be manually set for DG, NP (Newspaper), or PT (print items sent offsite) items. - If unsure of which production stream designation to use reach out to production stream manager(s).
  • Grant Cycle: Current/correct round. See controlled vocabulary.
  • Price Bundle: Based on gauge/format. See controlled vocabulary. - Value for all DG collections is: Digital collections handled by CA-R staff. - Values for Print steams are determined by format and size as provided by partner. - Values for AV stream are determined by format and duration (or assumed duration) as provided by partner.
  • Correct grammatical, spelling, and formatting errors as you encounter them.
  • While we do not require that partners use AACR2 or RDA cataloging rules when entering metadata into the Repository many of them use these cataloging rules when creating internal records, and that is reflected in records entered into our Repository.
    • This means that for uncertain information for fields like Creator or Publisher the partner may add a "?" to the end of the name to indicate that they are unsure if the person supplied is correct as the resource/supporting documentation may not make it clear.
    • One may also encounter supplied information for fields like Creator or Publisher in which the partner has enclosed the name in brackets ([]) to indicate that this information was supplied by the cataloger, either based on knowledge of the material or collection not immediately evident in the item submitted.
  • For fields where the final piece of information is/may be an exposed number (like Object ID, Copyright Holder Information, or Subject Entity) one needs to be careful not to accidentally changing the number by dragging the value down in spreadsheet software.
    • For example, I have 20 items that I am applying the Subject Entity value of Warren, Earl, 1891-1974 as the only term in the cell. When I drag this value down in Excel the program may try to create a series out of the final date, so instead of 20 instances of Warren, Earl, 1891-1974 I have Warren, Earl, 1891-1974; Warren, Earl, 1891-1975; Warren, Earl, 1891-1976; etc.
    • To prevent this one can look at the copy settings in Excel/GoogleSheets and change the option from "Fill Series" to "Copy Cells"
  • Special Handling: Based on condition. See controlled vocabulary.
  • Be sure to format all date columns as text so Excel does not auto-change the format.
  • All date fields are governed by date validation rules which conform to Library of Congress Extended Date and Time Format.
  • If the values do not conform the importer will throw errors up. This will not prevent ingest, but it will prevent users from saving the record in the repository until they are properly formatted. Please correct them as needed in the repository.
  • Publication Date field must be numeric characters only, as this is a controlled field. This field is required for serial publications.
  • Duration is formatted as HH:MM:DD. If unknown, leave blank.
  • Internet Archive URL should have the https prefix rather than http.
  • The Internet Archive URL identifier portion matches the object identifier.
    • There are very few exceptions to this rule (generally DG collections where items are already on IA), and one should know if the collection they are reviewing/processing is one such exception.
  • Reduce acronyms, jargon, and abbreviations. Spell out words and institutions. These may shift over time or be renamed in the future. Also it may not be intuitive for users who are not intimately familiar with the institution or field of study represented by an object.
  • Batch replace ampersands "&" for "and".
  • Batch replace smart quotes " with straight quotes. To find and replace in excel:
    • find opening smart quotes: alt + [
    • find closing smart quotes alt + shift + [

Export re-ingest specific steps:

  • If using a Print data or AV data export to re-ingest records, ensure that all fields are present and accounted for in the import sheet -- newly emptied, or left out, columns will be overwritten as blank.
  • This means one will have to review fields one does not anticipate editing in order to prevent ingest errors.
  • Check for, and correct, odd characters exported due to UTF-8 export encoding issues.
  • This commonly happens to smart quotes (single and double), n-dashes, symbols (i.e. degree, ampersands or copyright symbols), letter accents and other special characters.
    • If these errors get re-ingested not only do we lose the original character, but it also imports a messy string of characters that may not be noticed for a while.
    • This requires that the staff member skim through titles, descriptions, and other text heavy fields to ensure that odd characters do not creep (back) into the repository and end up on CaliforniaReveaeld.org or on IA.

After normalizing and mapping, save the file as a CSV UTF-8 (Comma Delimited) (.csv). The file name should follow our standard file naming convention of MARC_GrantCycle_Stream_filetype_date.extension.

  • For example: car_2018-2019_PT_NominationsImport_2019-11-04.xlsx.

Once the normalized sheet is created follow the directions for the specific WFS being worked on.

Clone this wiki locally