Semantic Scholar "server" improvement
Our Semantic Scholar configuration allows choice of "servers" to import from: these are essentially the major publishers.
We carrently have the problem that papers returned by Semantic Scholar don't necessarily name the server that they come from; the "venue" field seems typically to be a journal name. And there's no easy way to determine the publisher/server from the journal name.
There are two possible options, and we may not always have both these options:
- Determine the server from the URI. E.g. for the "Springer" server, URI domains appear to be limited to
springer.com
,nature.com
,biomedcentral.com
andscientificamerican.com
(each with various subdomains). - Determine from Journal titles: Using Chrome's "Web Scraper" extension I was able in a few minutes to scrape a full list of the SpringerLink journal names. We could load lists of journal titles into memory at time of import, lowercase them and remove diacritics, double-spaces etc, and use these for mapping to server names. This map may be a largish object, so we should ensure it is garbage-collected after import is complete.
Clearly, option 1 is preferable whenever this is possible; also because we don't want to have to update our lists frequently to account for newly added journals.