Repeatedly query all known sources for metadata during enrichment

Description

None

Environment

None

Linked work items

relates to

MCR-2305

Enrichment Resolver does not query newly found identifiers

Web links

Commit - Updated mycore commit to '4dd52fb8'
Commit - Updated mycore commit to '4dd52fb8'

Activity

Silvio Hermann
November 9, 2022 at 8:57 AM

Silvio Hermann mentioned this issue in a commit of ThULB / ansible / ubo on branch main:

Updated mycore commit to '4dd52fb8'

Frank Lützenkirchen
June 29, 2022 at 5:02 PM

Given your configuration: When I start with a scopus ID, Scopus is queried once again with the DOI it returns itself? When that scopus data contains a pubmed ID, PubMed is queried twice with the DOI and that pmid? As a compromise, why not make it configurable at least per data source?
MCR.MODS.EnrichmentResolver.DataSource.Scopus.QueryAllIdentifiers=true|false

There also could be a global default
MCR.MODS.EnrichmentResolver.QueryAllIdentifiers.Default=true|false

You would set this to true and get the behavior you want.
MCR.MODS.EnrichmentResolver.QueryAllIdentifiers.Default=false
could be the default to be backwards compatible.
Others could do fine tuning by configuring it for each data source.

Kai Brandhorst
June 7, 2022 at 7:34 AM

Of course the time for the enrichment is increasing, but I’ve been using this implementation for more than a year now, and haven’t had any complaints so far. My current configuration of datasources is as follows:

MCR.MODS.EnrichmentResolver.DataSource.CrossRef.IdentifierTypes=doi
MCR.MODS.EnrichmentResolver.DataSource.Unpaywall.IdentifierTypes=doi
MCR.MODS.EnrichmentResolver.DataSource.DataCite.IdentifierTypes=doi
MCR.MODS.EnrichmentResolver.DataSource.Scopus.IdentifierTypes=doi scopus
MCR.MODS.EnrichmentResolver.DataSource.PubMed.IdentifierTypes=doi pubmed pubmedcentral
MCR.MODS.EnrichmentResolver.DataSource.PubMedCentral.IdentifierTypes=pubmed pubmedcentral doi
MCR.MODS.EnrichmentResolver.DataSource.GBV.IdentifierTypes=issn isbn doi zdb dnb ppn oclc
MCR.MODS.EnrichmentResolver.DataSource.ZDB.IdentifierTypes=issn zdb dnb
MCR.MODS.EnrichmentResolver.DataSource.IEEE.IdentifierTypes=ieee doi isbn
MCR.MODS.EnrichmentResolver.DataSource.ORCID.IdentifierTypes=doi scopus isbn pubmed pubmedcentral

Since dois should be (and usually are) unique per publication, datasources like CrossRef, Unpaywall, DataCite, Scopus Pubmed and Pubmedcentral get queried only once anyway. Only GBV, ZDB, IEEE and ORCID are queried repeatedly if more than one issn, isbn, oclc, ppn, zdb are found.

Although I agree, that a configuration option like the one you’ve suggested would be the best solution, I still think that all ids found during enrichment deserve to be queried at least once. This, however, is not the case in the current implementation.

Frank Lützenkirchen
June 7, 2022 at 7:02 AM

We should have this configurable per data source and per identifier, e.g. “enable querying ZDB with ISSN repeatedly”.

I fear the request time and the the number of calls massively increasing.

Application developer should have the choice for what data sources and which identifiers…

Kai Brandhorst
June 7, 2022 at 6:53 AM

Well, in my opinion the current implementation of querying data sources only once is a limitation, since journals e.g. usually have more than one issn, a bunch load of oclc-ids, books often have more than one isbn, to name just a few. All of these ids often link to different records and should thus be queried during enrichment, and therefore data sources need to be queried more than once. Except for making redundant calls - which of course should be avoided for apis like scopus due to their rate limits - this is an improvement with no drawbacks since redundant information is merged anyway.

Resize issue view side panel

Fixed

Details

Assignee

Frank Lützenkirchen

Reporter

Kai Brandhorst

Components

mycore-mods

Fix versions

2021.06.2

2022.06.0

2022.08

Priority

Medium

Created June 3, 2022 at 1:24 PM

Updated November 9, 2022 at 8:57 AM

Resolved September 22, 2022 at 10:43 AM