Script: Capture DOI field on PDF and run AutoFill

DrJJWMac · Post by **DrJJWMac** » Sun Oct 15, 2023 5:12 pm

I am often adding a PDF to Bookends where the PDF has the DOI link but Bookends does not capture it. I have created this script below to help.

* Select the reference.
* Open the PDF.
* Select the DOI text field on the PDF. The text field must have one of these two "delimiters" - doi.org/ or DOI:
* Run the script.

The script will capture the DOI to the clipboard, paste it back to the reference, and run an AutoFill from Internet command.

CAVEATS - The script requires that you cannot have the reference in the left pane as an active selection. The reference must be highlighted gray not blue. Otherwise, the script captures the wrong information to the clipboard. I would appreciate any insights on how to "deselect" the left pane and reselect it after capturing to the clipboard.

Code: Select all

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions
(**
this parses the DOI on a PDF, fills the DOI into the reference, and runs AutoFill
version 2023-10-15
author jjw
**)
on run
	
	tell application "Bookends"
		activate
		set theWindowList to the name of every library window as list
		tell application "System Events" to keystroke "c" using command down
		set theText to (the clipboard as text)
		set theReference to the selected publication items of the front library window
		set theDOI to the get_DOI(theText) of me as text
		set the doi of item 1 of theReference to theDOI
		tell application "System Events" to keystroke "a" using {shift down, command down}
	end tell
	
end run

on get_DOI(DOItext)
	set theCD to AppleScript's text item delimiters as text
	set AppleScript's text item delimiters to "doi.org/"
	set theItemList to every text item of DOItext as list
	try
		set theReturn to text items 2 thru end of theItemList as text
	on error
		set AppleScript's text item delimiters to "doi:"
		set theItemList to every text item of DOItext as list
		set theReturn to text items 2 thru end of theItemList as text
	end try
	set AppleScript's text item delimiters to theCD
	return theReturn
end get_DOI

Post by **Jon** » Sun Oct 15, 2023 5:36 pm

Since you're using System Events you can tell it to click specific locations in the window. Google returns many hits about that, here's one

https://stackoverflow.com/questions/570 ... control-if

Please send me one PDF where Bookends can't find the DOI in the text. Note that the DOI must be registered with Crossref or Bookends won't recognize it as legitimate.

Jon
Sonny Software

iandol · Post by **iandol** » Sun Oct 15, 2023 8:49 pm

One clear example is Annual Reviews:

: Screenshot 2023-10-16 at 08.38.07.png (11.35 KiB) Viewed 8959 times

The text copies as:

Code: Select all

https://doi.org/10.1146/annurev-vision-090721-

110411

Yes, this is the reason PDF is such a horrendous archival format -- the linebreak used for layout splits the semantic content URL

— you can work around that with some sort of heuristic but there are many possible edge cases (maybe the url was only on 1 line, the following text is something else etc.). For Ann. Rev. this is always broken on import at least for all the PDFs I've added, and line breaks may be a common condition for other journals, but this is all fuzzy.

DrJJWMac -- does your script solve this edge case? It may be easier for a manual script as we can assume the user has correctly delineated the URL content then just remove newlines or other whitespace...

------
EDIT:

Testing with the hexapdf tool (uses Ruby), I can find a complete URL in object 157 of that PDF, so at least with a PDF parsing library, find page 1 (object 137), then find annotations and get the URL:

Code: Select all

cmd> r 157
<<
  /A <<
    /S /URI
    /URI (https://doi.org/10.1146/annurev-vision-090721-110411)
  >>
  /Border [0 0 0 ]
  /C [0 1 1 ]
  /H /I
  /QuadPoints [45.354 190.752 190.072 190.752 190.072 199.994 45.354 199.994 45.354 181.785 69.054 181.785 69.054 191.027 45.354 191.027 ]
  /Rect [44.357 180.789 191.068 200.99 ]
  /Subtype /Link
  /Type /Annot

DrJJWMac · Post by **DrJJWMac** » Sun Oct 15, 2023 11:25 pm

> Please send me one PDF where Bookends can't find the DOI in the text.

My fault. The articles I have to parse were sent to Bookends using iPadOS from the Share Sheet going out of Safari. I find articles from my library's search engine, view the PDF in Safari, and share out to bookends. As such, the article enters Bookends iPadOS as a generically named PDF with no other processing. I sync the iPadOS library to macOS and complete the article processing in Bookends macOS.

> DrJJWMac -- does your script solve this edge case? It may be easier for a manual script as we can assume the user has correctly delineated the URL content then just remove newlines or other whitespace...

This function should handle a case with a carriage return. Presuming as you note that the user has copied it properly to the clipboard. It might be possible to trim this script down but ... if it works, I'm not going to fuss with it. Substitute in the above main code.

Code: Select all

on get_DOI(DOItext)
	set theCD to AppleScript's text item delimiters as text
	set AppleScript's text item delimiters to "doi.org/"
	set theItemList to every text item of DOItext as list
	try
		set theReturn to text items 2 thru end of theItemList as text
	on error
		set AppleScript's text item delimiters to "doi:"
		set theItemList to every text item of DOItext as list
		set theReturn to text items 2 thru end of theItemList as text
	end try
	set AppleScript's text item delimiters to {return, linefeed, return & linefeed}
	set theReturnParsed to text items of theReturn
	set AppleScript's text item delimiters to theCD
	set theReturn to theReturnParsed as text
	return theReturn
end get_DOI

Post by **Jon** » Mon Oct 16, 2023 8:04 am

@iandol

Yes, the problem is the internal formatting in the PDF. A DOI cannot contain a newline (or a number of other characters), so as far as Bookends is concerned it is complete at that point. This isn't a problem in the vast majority of cases, but it does happen (and if a publisher consistently does this, of course it happens for all of their articles). When it does happen, one can do an Autocomplete Paper and copy/paste the DOI into the DOI field (of course then deleting the return -- I suppose Bookends could strip out such characters after the paste, or when the search is initiated). The hexapdf tool is obviously reading the raw data (bytes), which I suspect would have it's own edge case problems if one were trying to automate DOI detection. I can take a look at that, though, and see if there are immutable rules in such PDFs that could be used.

Jon
Sonny Software

Sonny Software

Script: Capture DOI field on PDF and run AutoFill

Script: Capture DOI field on PDF and run AutoFill

Re: Script: Capture DOI field on PDF and run AutoFill

Re: Script: Capture DOI field on PDF and run AutoFill

Re: Script: Capture DOI field on PDF and run AutoFill

Re: Script: Capture DOI field on PDF and run AutoFill