Page 2 of 3

Re: Corruption of OCR'd PDFs

Posted: Sat Jul 27, 2019 7:00 pm
by atimm
Thanks edwinc for confirming that this is happening. Now if we could only get a solution!
Annette

Re: Corruption of OCR'd PDFs

Posted: Sat Jul 27, 2019 7:21 pm
by Jon
Maybe Catalina? Please try when that's out.

Jon
Sonny Software

Re: Corruption of OCR'd PDFs

Posted: Sat Jul 27, 2019 11:17 pm
by atimm
Crossing my fingers for Catalina! In the meantime, I found a temporary fix to make sure more PDFs don't lose their text layer. I do not pretend to know why this works, and I found it through random poking around. But if I take the documents that I originally OCR'd with Adobe Acrobat and re-OCR them with ABBYY Finereader for Mac, they are protected from corruption. Who knows why, but it does work. There is just the minor problem of now having to buy a licensed version of ABBYY Finereader!
Annette

Re: Corruption of OCR'd PDFs

Posted: Sun Jul 28, 2019 10:31 am
by Jon
There were serious problems with PDFKit and OCR mangling text layers in previous versions of macOS. Is it possible these PDFs were OCRd then? Try OCRing them again now with Adobe Acrobat and see if they're stable.

Jon
Sonny Software

Re: Corruption of OCR'd PDFs

Posted: Wed Feb 05, 2020 7:59 pm
by atimm
I am just reporting back that this problem still persists. I just took a corrupted document, saved it as .JPGs, and OCR'd it using Acrobat Pro 2017. I then highlighted some text in Bookends. The text layer was immediately corrupted. So I have to stick to FineReader for OCR'ing.

Re: Corruption of OCR'd PDFs

Posted: Wed Feb 05, 2020 8:10 pm
by Jon
Are you saying that if you OCR in FineReader then you can markup the PDF in Bookends without a problem? Or you didn't try?

Jon
Sonny Software

Re: Corruption of OCR'd PDFs

Posted: Thu Feb 06, 2020 8:34 pm
by atimm
I did more testing, and it's actually worse than I thought. It is not only a problem with PDFs that I have OCRd myself. I downloaded an article that was already OCRd from a database, imported it into Bookends, and then highlighted a passage. Although the text was initially perfectly visible in the notes field, as soon as I moved away from the document and then came back to it, it was corrupted.

To make things more easily visible, I used PDF expert to demonstrate the change in the document. The first screenshot shows how things looked when I downloaded the PDF from the database and highlighted it in PDF Expert. The second shows the same document after importing it into Bookends and highlighting it there. The last screenshot shows how these notes look in Bookends.

Re: Corruption of OCR'd PDFs

Posted: Thu Feb 06, 2020 8:35 pm
by atimm
I am currently doing more testing to see if I can fix this with FineReader, but the fact that I might have to do this for everything I download (since I can't know how the OCRing was performed) represents a very serious problem.

Re: Corruption of OCR'd PDFs

Posted: Thu Feb 06, 2020 8:51 pm
by atimm
Bad news. FineReader does not fix anything. I even saved a PDF as JPGs. Assembling and OCRing in FineReader worked fine, but the text layer immediately got corrupted in Bookends.

Re: Corruption of OCR'd PDFs

Posted: Fri Feb 07, 2020 4:18 pm
by atimm
Yet more testing reveals that if I confine myself to highlighting outside of Bookends, the PDF text layer stays intact. It is clear that PDFkit problems persist, but I would like to find a way of making sure that any PDFs in Bookends are not in danger of being inadvertently corrupted. It only takes one errant mouse click. Would it be possible to add a setting to Bookends that disables saving from within the program until Apple gets its act together on this? Failing that, can anyone tell me whether they are experiencing similar problems in DevonThink? (A search of their forum, where I will also post a query, has proven inconclusive.) Is it possible to use a combination of DT and Bookends to get around this problem?

Re: Corruption of OCR'd PDFs

Posted: Fri Feb 07, 2020 4:27 pm
by Jon
That's not a good solution. You can lock the PDF in the Finder, in which case I don't believe Bookends can do anything with it.

Jon
Sonny Software

Re: Corruption of OCR'd PDFs

Posted: Sat Feb 08, 2020 12:35 am
by atimm
OK, that's what I'll do. I also ran another interesting test. I OCR'd a document using ocrmypdf on my raspberry pi. Then I imported it into Bookends and highlighted it. This instantly corrupted the text layer.

Re: Corruption of OCR'd PDFs

Posted: Wed Jun 10, 2020 1:18 pm
by nickharambee
Just to say that I am experiencing similar issues with PDFs OCR'd with ABBY FineReader. I am using the "Highlight Selection and Make Bookends Notecard" feature in Bookends and would like to keep using this.

Is there any application on the Mac that would OCR PDFs in a different way meaning that I wouldn't experience this issue?

Thanks,

Nick

Re: Corruption of OCR'd PDFs

Posted: Tue Jun 30, 2020 12:39 am
by atimm
I have also not yet found a solution to this! Glad to hear I'm not alone, though.

Re: Corruption of OCR'd PDFs

Posted: Mon Mar 29, 2021 11:51 am
by Jon
I'm bumping this thread because I've found that a test PDF I have that was corrupted by adding highlights in Mojave is not corrupted when the same is done in Catalina. To the posters in this thread, have you tried the same in Catalina (and/or Big Sur)? And do these OS's fix for you the problem originally posted here?

Jon
Sonny Software