Skip to content

Title disambiguation #11

@cverluise

Description

@cverluise

A given title (in title_j, title_m) can appear under different forms in the database. This might be due to typos (e.g Ibm Tchnical Disclosure Bulletin), abbreviations (Ibm Tdb), parsing error (Ibm Tech-Nical Disclosure Bulletin, Ibm Corp) etc

Example ⬇️

Details
SELECT
  DISTINCT(title_j)
FROM
  `npl-parsing.patcit.beta`
WHERE
  LOWER(title_j) LIKE "%ibm%"
ORDER BY
  title_j DESC
title_j
Ibme Technical Disclosure Bulletin
Ibm-Tdb
Ibm Tecnical Disclosure Bulletin
Ibm Technical Dosclosure Bulletin
Ibm Technical Document
Ibm Technical Dislosure Bulletin
Ibm Technical Disclusure Bulletin
Ibm Technical Disclosures Bulletin
Ibm Technical Disclosure Bulleting
Ibm Technical Disclosure Bulletin; 'Improved First-In First-Out'
Ibm Technical Disclosure Bulletin, Ref. No. Xp
Ibm Technical Disclosure Bulletin, Nn Corp., Us
Ibm Technical Disclosure Bulletin, Ibm Corp. Ny
Ibm Technical Disclosure Bulletin, Ibm Corp
Ibm Technical Disclosure Bulletin Ibm
Ibm Technical Disclosure Bulletin
Ibm Technical Disclosure Bullentin
Ibm Technical Disclosure Bulle
Ibm Technical Disclossure Bulletin
Ibm Techn.Discl.Mag
Ibm Techn. Discl. Bull
Ibm Tech-Nical Disclosure Bulletin, Ibm Corp
Ibm Tech Disc Bulletin
Ibm Tdb
Ibm Tchnical Disclosure Bulletin
Ibm Disclosure Bulletin

Feature description

Title variables are useful to many use-cases. A clean and transparent disambiguation would definitely be a strong plus.

  • At this point, I have no particular idea on the most appropriate tools/algos to be used in the disambiguation process. Anyone should feel free to contribute.
  • Ultimately, we want a correspondence table between a "unique identifier" (e.g "Ibm Technical Disclosure Bulletin") and all the related variations.
  • The output of the disambiguation could be used to propagate ISSN(e)s (see issue Multiple title_j for the same ISSN/ISSNe #6 )

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions