Corruption of OCR'd PDFs

A place for users to ask each other questions, make suggestions, and discuss Bookends.
atimm
Posts: 17
Joined: Thu Jan 04, 2018 2:44 pm

Re: Corruption of OCR'd PDFs

Post by atimm »

Thanks edwinc for confirming that this is happening. Now if we could only get a solution!
Annette

Jon
Site Admin
Posts: 8633
Joined: Tue Jul 13, 2004 6:27 pm
Location: Bethesda, MD
Contact:

Re: Corruption of OCR'd PDFs

Post by Jon »

Maybe Catalina? Please try when that's out.

Jon
Sonny Software

atimm
Posts: 17
Joined: Thu Jan 04, 2018 2:44 pm

Re: Corruption of OCR'd PDFs

Post by atimm »

Crossing my fingers for Catalina! In the meantime, I found a temporary fix to make sure more PDFs don't lose their text layer. I do not pretend to know why this works, and I found it through random poking around. But if I take the documents that I originally OCR'd with Adobe Acrobat and re-OCR them with ABBYY Finereader for Mac, they are protected from corruption. Who knows why, but it does work. There is just the minor problem of now having to buy a licensed version of ABBYY Finereader!
Annette

Jon
Site Admin
Posts: 8633
Joined: Tue Jul 13, 2004 6:27 pm
Location: Bethesda, MD
Contact:

Re: Corruption of OCR'd PDFs

Post by Jon »

There were serious problems with PDFKit and OCR mangling text layers in previous versions of macOS. Is it possible these PDFs were OCRd then? Try OCRing them again now with Adobe Acrobat and see if they're stable.

Jon
Sonny Software

atimm
Posts: 17
Joined: Thu Jan 04, 2018 2:44 pm

Re: Corruption of OCR'd PDFs

Post by atimm »

I am just reporting back that this problem still persists. I just took a corrupted document, saved it as .JPGs, and OCR'd it using Acrobat Pro 2017. I then highlighted some text in Bookends. The text layer was immediately corrupted. So I have to stick to FineReader for OCR'ing.

Jon
Site Admin
Posts: 8633
Joined: Tue Jul 13, 2004 6:27 pm
Location: Bethesda, MD
Contact:

Re: Corruption of OCR'd PDFs

Post by Jon »

Are you saying that if you OCR in FineReader then you can markup the PDF in Bookends without a problem? Or you didn't try?

Jon
Sonny Software

atimm
Posts: 17
Joined: Thu Jan 04, 2018 2:44 pm

Re: Corruption of OCR'd PDFs

Post by atimm »

I did more testing, and it's actually worse than I thought. It is not only a problem with PDFs that I have OCRd myself. I downloaded an article that was already OCRd from a database, imported it into Bookends, and then highlighted a passage. Although the text was initially perfectly visible in the notes field, as soon as I moved away from the document and then came back to it, it was corrupted.

To make things more easily visible, I used PDF expert to demonstrate the change in the document. The first screenshot shows how things looked when I downloaded the PDF from the database and highlighted it in PDF Expert. The second shows the same document after importing it into Bookends and highlighting it there. The last screenshot shows how these notes look in Bookends.
Attachments
1 downloaded from database.png
1 downloaded from database.png (138.42 KiB) Viewed 4154 times
2 imported into Bookends & highlighted.png
2 imported into Bookends & highlighted.png (135.09 KiB) Viewed 4154 times
3 notes view from Bookends.png
3 notes view from Bookends.png (13.06 KiB) Viewed 4154 times

atimm
Posts: 17
Joined: Thu Jan 04, 2018 2:44 pm

Re: Corruption of OCR'd PDFs

Post by atimm »

I am currently doing more testing to see if I can fix this with FineReader, but the fact that I might have to do this for everything I download (since I can't know how the OCRing was performed) represents a very serious problem.

atimm
Posts: 17
Joined: Thu Jan 04, 2018 2:44 pm

Re: Corruption of OCR'd PDFs

Post by atimm »

Bad news. FineReader does not fix anything. I even saved a PDF as JPGs. Assembling and OCRing in FineReader worked fine, but the text layer immediately got corrupted in Bookends.

atimm
Posts: 17
Joined: Thu Jan 04, 2018 2:44 pm

Re: Corruption of OCR'd PDFs

Post by atimm »

Yet more testing reveals that if I confine myself to highlighting outside of Bookends, the PDF text layer stays intact. It is clear that PDFkit problems persist, but I would like to find a way of making sure that any PDFs in Bookends are not in danger of being inadvertently corrupted. It only takes one errant mouse click. Would it be possible to add a setting to Bookends that disables saving from within the program until Apple gets its act together on this? Failing that, can anyone tell me whether they are experiencing similar problems in DevonThink? (A search of their forum, where I will also post a query, has proven inconclusive.) Is it possible to use a combination of DT and Bookends to get around this problem?

Jon
Site Admin
Posts: 8633
Joined: Tue Jul 13, 2004 6:27 pm
Location: Bethesda, MD
Contact:

Re: Corruption of OCR'd PDFs

Post by Jon »

That's not a good solution. You can lock the PDF in the Finder, in which case I don't believe Bookends can do anything with it.

Jon
Sonny Software

atimm
Posts: 17
Joined: Thu Jan 04, 2018 2:44 pm

Re: Corruption of OCR'd PDFs

Post by atimm »

OK, that's what I'll do. I also ran another interesting test. I OCR'd a document using ocrmypdf on my raspberry pi. Then I imported it into Bookends and highlighted it. This instantly corrupted the text layer.

nickharambee
Posts: 20
Joined: Wed Feb 27, 2019 11:48 am

Re: Corruption of OCR'd PDFs

Post by nickharambee »

Just to say that I am experiencing similar issues with PDFs OCR'd with ABBY FineReader. I am using the "Highlight Selection and Make Bookends Notecard" feature in Bookends and would like to keep using this.

Is there any application on the Mac that would OCR PDFs in a different way meaning that I wouldn't experience this issue?

Thanks,

Nick

atimm
Posts: 17
Joined: Thu Jan 04, 2018 2:44 pm

Re: Corruption of OCR'd PDFs

Post by atimm »

I have also not yet found a solution to this! Glad to hear I'm not alone, though.

Post Reply