Here's the latest from Marie:
I've just spent some time going through search results, starting with Lesley's string and tweaking to maximize the relevant preprints and minimize the total number it pulls in. In addition to the search having a character limit, the way it works is a little opaque to me and seems to have some idiosyncrasies, so I've come up with something I hope will be useful. The string at the bottom gives me back 197 results starting from the end of March, and includes 13/15 of the preprints that I identified manually during that time period, in addition to several relevant ones that I missed. If you look at the string this seems like kind of a miracle, but it's true.
I imagine if I had access to the full breadth of what you guys can do with the API and more hours I could improve on this, but it doesn't seem too bad to me for now. Let me know if anything is unclear or you'd like me to try again; I've run out of time for now.
"transporter*" "pump*" "gpcr" "gating" "-gated" "-selective" "*-pumping" "protein translocation"
This means the string "transporter*" "pump*" "gpcr" "gating" "*-gated" "*-selective" "*-pumping" "protein translocation"
is one to start with. The limitation she mentions on the string side is in the UI at bioRxiv, but we don't have that limiting in the API so could iterate on this with her.
I get roughly the same results as Marie with this call:
curl --location --request GET 'https://api.biorxiv.org/fulltext?server=all&terms="transporter*" "pump*" "gpcr" "gating" "*-gated" "*-selective" "*-pumping" "protein translocation"&flag=any&date_from=2022-04-01&date_to=2022-04-26'
Having spoken to the team again just now I've managed to find a way for them to easily "play" with the search terms as the API corresponds almost to the Advanced Search page on bioRxiv here: https://www.biorxiv.org/search
Anything you enter in the "Full Text or Abstract or Title" box corresponds to the API query, and the flags for any/all/phrase correspond too. Lesley found that separating terms by spaces and selecting "any", while omitting some of their search string actually returned a decent result.
Here is the first attempt which brings back a good amount of relevant papers
This corresponds to the call curl --location --request GET 'https://api.biorxiv.org/fulltext?server=all&terms="transporter*" OR "pump*" OR "gpcr*" OR "ligand*" OR "exchange*" OR "uniport*" OR "symport*" OR "antiport*" OR "solute carrier*"&flag=any&date_from=2022-01-01&date_to=2022-04-26'
Marie is currently attempting to find the best string and will send it our way once she has.
On speaking to the team they expect around 2-3 preprints added per day as their field is relatively small, and their search query should be quite specific.
I've been trying to wrangle a good search query using the bioRxiv API using the advanced search documentation here:
https://www.biorxiv.org/content/search-tips
It says in simple search that parentheses are not supported in Boolean expressions but it appears they might be in advanced search - this previously didn't work (was returning a PHP error) but now I think that might have been serverside not my query.
I've been trying this in Postman and think I have it working - the collection is attached with the longer query being the one that seems to work. However, it returns lots of results, which I think is natural as it'll match the full text. Narrowing it by time still returns around 50 per day which suggests to me that the keywords are not narrowing it enough.
bioRxiv.postman_collection.json
Or in curl
:
curl --location --request GET 'https://api.biorxiv.org/fulltext?server=all&terms=("membrane protein*" OR "ion channel*") AND ("transporter*" OR "pump*" OR "gpcr*" OR "ligand*" OR "exchange*" OR "uniport*" OR "symport*" OR "antiport*" OR "solute carrier*" OR "atpase*" OR "atp synthase*" OR "rhodopsin" OR "patch*" OR "voltage*" OR "single-channel*" OR "anion*" OR "cation*" OR "gating" OR "activat*" OR "inactivat*" OR "selectiv*")&flag=any&cursor='