Repeatedly query all known sources for metadata during enrichment

Activity

Silvio Hermann 
November 9, 2022 at 8:57 AM

Silvio Hermann mentioned this issue in a commit of ThULB / ansible / ubo on branch main:

Updated mycore commit to '4dd52fb8'

Frank Lützenkirchen 
June 29, 2022 at 5:02 PM

Given your configuration: When I start with a scopus ID, Scopus is queried once again with the DOI it returns itself? When that scopus data contains a pubmed ID, PubMed is queried twice with the DOI and that pmid? As a compromise, why not make it configurable at least per data source?
MCR.MODS.EnrichmentResolver.DataSource.Scopus.QueryAllIdentifiers=true|false

There also could be a global default
MCR.MODS.EnrichmentResolver.QueryAllIdentifiers.Default=true|false

You would set this to true and get the behavior you want.
MCR.MODS.EnrichmentResolver.QueryAllIdentifiers.Default=false
could be the default to be backwards compatible.
Others could do fine tuning by configuring it for each data source.

Kai Brandhorst 
June 7, 2022 at 7:34 AM

Of course the time for the enrichment is increasing, but I’ve been using this implementation for more than a year now, and haven’t had any complaints so far. My current configuration of datasources is as follows:

MCR.MODS.EnrichmentResolver.DataSource.CrossRef.IdentifierTypes=doi MCR.MODS.EnrichmentResolver.DataSource.Unpaywall.IdentifierTypes=doi MCR.MODS.EnrichmentResolver.DataSource.DataCite.IdentifierTypes=doi MCR.MODS.EnrichmentResolver.DataSource.Scopus.IdentifierTypes=doi scopus MCR.MODS.EnrichmentResolver.DataSource.PubMed.IdentifierTypes=doi pubmed pubmedcentral MCR.MODS.EnrichmentResolver.DataSource.PubMedCentral.IdentifierTypes=pubmed pubmedcentral doi MCR.MODS.EnrichmentResolver.DataSource.GBV.IdentifierTypes=issn isbn doi zdb dnb ppn oclc MCR.MODS.EnrichmentResolver.DataSource.ZDB.IdentifierTypes=issn zdb dnb MCR.MODS.EnrichmentResolver.DataSource.IEEE.IdentifierTypes=ieee doi isbn MCR.MODS.EnrichmentResolver.DataSource.ORCID.IdentifierTypes=doi scopus isbn pubmed pubmedcentral

Since dois should be (and usually are) unique per publication, datasources like CrossRef, Unpaywall, DataCite, Scopus Pubmed and Pubmedcentral get queried only once anyway. Only GBV, ZDB, IEEE and ORCID are queried repeatedly if more than one issn, isbn, oclc, ppn, zdb are found.

Although I agree, that a configuration option like the one you’ve suggested would be the best solution, I still think that all ids found during enrichment deserve to be queried at least once. This, however, is not the case in the current implementation.

Frank Lützenkirchen 
June 7, 2022 at 7:02 AM

We should have this configurable per data source and per identifier, e.g. “enable querying ZDB with ISSN repeatedly”.

I fear the request time and the the number of calls massively increasing.

Application developer should have the choice for what data sources and which identifiers…

Kai Brandhorst 
June 7, 2022 at 6:53 AM

Well, in my opinion the current implementation of querying data sources only once is a limitation, since journals e.g. usually have more than one issn, a bunch load of oclc-ids, books often have more than one isbn, to name just a few. All of these ids often link to different records and should thus be queried during enrichment, and therefore data sources need to be queried more than once. Except for making redundant calls - which of course should be avoided for apis like scopus due to their rate limits - this is an improvement with no drawbacks since redundant information is merged anyway.

Fixed

Details

Assignee

Reporter

Components

Priority

Created June 3, 2022 at 1:24 PM
Updated November 9, 2022 at 8:57 AM
Resolved September 22, 2022 at 10:43 AM