Repeatedly query all known sources for metadata during enrichment
Description
Environment
relates to
Activity
Silvio Hermann November 9, 2022 at 8:57 AM
Frank Lützenkirchen June 29, 2022 at 5:02 PM
Given your configuration: When I start with a scopus ID, Scopus is queried once again with the DOI it returns itself? When that scopus data contains a pubmed ID, PubMed is queried twice with the DOI and that pmid? As a compromise, why not make it configurable at least per data source?MCR.MODS.EnrichmentResolver.DataSource.Scopus.QueryAllIdentifiers=true|false
There also could be a global defaultMCR.MODS.EnrichmentResolver.QueryAllIdentifiers.Default=true|false
You would set this to true and get the behavior you want.MCR.MODS.EnrichmentResolver.QueryAllIdentifiers.Default=false
could be the default to be backwards compatible.
Others could do fine tuning by configuring it for each data source.
Kai Brandhorst June 7, 2022 at 7:34 AM
Of course the time for the enrichment is increasing, but I’ve been using this implementation for more than a year now, and haven’t had any complaints so far. My current configuration of datasources is as follows:
MCR.MODS.EnrichmentResolver.DataSource.CrossRef.IdentifierTypes=doi
MCR.MODS.EnrichmentResolver.DataSource.Unpaywall.IdentifierTypes=doi
MCR.MODS.EnrichmentResolver.DataSource.DataCite.IdentifierTypes=doi
MCR.MODS.EnrichmentResolver.DataSource.Scopus.IdentifierTypes=doi scopus
MCR.MODS.EnrichmentResolver.DataSource.PubMed.IdentifierTypes=doi pubmed pubmedcentral
MCR.MODS.EnrichmentResolver.DataSource.PubMedCentral.IdentifierTypes=pubmed pubmedcentral doi
MCR.MODS.EnrichmentResolver.DataSource.GBV.IdentifierTypes=issn isbn doi zdb dnb ppn oclc
MCR.MODS.EnrichmentResolver.DataSource.ZDB.IdentifierTypes=issn zdb dnb
MCR.MODS.EnrichmentResolver.DataSource.IEEE.IdentifierTypes=ieee doi isbn
MCR.MODS.EnrichmentResolver.DataSource.ORCID.IdentifierTypes=doi scopus isbn pubmed pubmedcentral
Since dois should be (and usually are) unique per publication, datasources like CrossRef, Unpaywall, DataCite, Scopus Pubmed and Pubmedcentral get queried only once anyway. Only GBV, ZDB, IEEE and ORCID are queried repeatedly if more than one issn, isbn, oclc, ppn, zdb are found.
Although I agree, that a configuration option like the one you’ve suggested would be the best solution, I still think that all ids found during enrichment deserve to be queried at least once. This, however, is not the case in the current implementation.
Frank Lützenkirchen June 7, 2022 at 7:02 AM
We should have this configurable per data source and per identifier, e.g. “enable querying ZDB with ISSN repeatedly”.
I fear the request time and the the number of calls massively increasing.
Application developer should have the choice for what data sources and which identifiers…
Kai Brandhorst June 7, 2022 at 6:53 AM
Well, in my opinion the current implementation of querying data sources only once is a limitation, since journals e.g. usually have more than one issn, a bunch load of oclc-ids, books often have more than one isbn, to name just a few. All of these ids often link to different records and should thus be queried during enrichment, and therefore data sources need to be queried more than once. Except for making redundant calls - which of course should be avoided for apis like scopus due to their rate limits - this is an improvement with no drawbacks since redundant information is merged anyway.
Silvio Hermann mentioned this issue in a commit of ThULB / ansible / ubo on branch main: