Corruption of OCR'd PDFs

A place for users to ask each other questions, make suggestions, and discuss Bookends.
atimm
Posts: 17
Joined: Thu Jan 04, 2018 2:44 pm

Corruption of OCR'd PDFs

Post by atimm »

I am using Bookends 13.1.4

I have a problem that seems somewhat related to the "Damaging changes to PDF after highlighting text" post (viewtopic.php?f=2&t=4682&p=21238&hilit= ... ted#p21238) but also different enough to warrant a separate post.

I often OCR scanned documents (both published sources and archival documents) using first ScanTailor (a nifty shareware program that automates the process of formatting scans) and then Adobe Acrobat Pro 2017. I then import the PDFs into Bookends and use the highlight tool. This works perfectly for awhile, nicely putting the text of the highlights into the notes. Then things suddenly go horribly wrong. The text layer in the PDF seems to get corrupted. When I try to keep highlighting, there is only gobbledygook in the note stream:
The attachment fig 1.png is no longer available
Copying and pasting the text into a word processor now looks like this:􀁃􀃢􀁦􀂝􀃢􀁳􀁸􀁸􀂲􀂗􀃗􀃢􀂇􀂶􀁦􀃁􀁹􀂄􀃇􀂗􀃢􀁠􀁮 etc. (This despite the fact that it did actually render as text when I first started highlighting in Bookends.) Once this change has occurred, the PDF also slows down Bookends to a crawl. It takes minutes (literally) to click away from this entry and open another one. Sometimes the program crashes altogether. If I restart the program, I can open that file again eventually, but it takes forever to open and the corruption remains. Highlights made before the corruption are there, but any new ones only produce garbage in the note stream.

In some files, I also get the distorted text described in the above-mentioned post, though it's even worse than what wendyerb describes -- completely unreadable.
fig 1.png
fig 1.png (7.43 KiB) Viewed 5165 times
If I open the file in Acrobat, the text looks fine. I can select text but it copies only as gobbledygook. In other words, the text layer is permanently damaged. I should note that I did follow the instructions given to wendyerb about disabling any Adobe plugins. I had absolutely no plugins enabled when these problems occurred.

This is quite obviously a deal-breaker for me in terms of being able to use the program, so I'd really appreciate some advice about how to fix it. Perhaps I can save the PDFs with different settings before importing them?

Thanks!

Jon
Site Admin
Posts: 8665
Joined: Tue Jul 13, 2004 6:27 pm
Location: Bethesda, MD
Contact:

Re: Corruption of OCR'd PDFs

Post by Jon »

Yes, scanned text seems to particularly have problems with PDFKit annotations. I'm afraid there's nothing I can offer as a workaround, since Bookends simply calls the macOS PDFKit APIs to write annotations. I wonder if the same thing would happen with Preview? All I can suggest is that you use the "open in..." function (Shift-Command-O) to open the PDF in your preferred PDFKit-compliant PDF reader/editor when you want to add annotations. The changes will be reflected in the Bookends note stream when the reference is refreshed.

Jon
Sonny Software

atimm
Posts: 17
Joined: Thu Jan 04, 2018 2:44 pm

Re: Corruption of OCR'd PDFs

Post by atimm »

Thank you for the quick reply. It is a relief to know that I can get notes imported from another PDF reader. I hadn't thought of that simple solution. I'll also keep experimenting, since some scanned files that I have gotten from other sources behave perfectly well. A different OCR process than the one I am currently using seems to work much better. I wonder, for instance, if what used to be called "Clear scan" in Acrobat (and now is now bafflingly labeled with various other settings in the most recent Mac version) is causing problems. If I stumble open an output format that improves the situation, I'll post my findings here.
Annette

atimm
Posts: 17
Joined: Thu Jan 04, 2018 2:44 pm

Re: Corruption of OCR'd PDFs

Post by atimm »

Despite extensive googling, I continue to struggle with this problem and really need some more specific advice so that I don't keep corrupting my scanned PDFs. I thought I had understood what you said above, but it turns out that I have no idea what you mean by "PDFkit compliant." This term has not made it possible for me to pick which highlighting app I can safely use. I have also now educated myself about the problems that Apple created with PDFkit, so I do recognize that your software is not the problem here. (For those following this thread, you can find an explanation here: https://appleinsider.com/articles/17/01 ... -data-loss. I have tried using "Open in" and then apps like Highlights and PDF Expert to highlight and annotate documents that are stored in Bookends. Everything looks wonderful, and then I go back to find more corruption (first a destroyed text layer and eventually garbled text in the viewer, which is at first temporary and then becomes permanent). I could keep experimenting with different secondary apps, but it would be wonderful to hear whether anyone has found one that works, because every experiment costs me a lot of time and risks doing permanent damage. Do I have to stop storing OCRd and highlighted PDFs in Bookends until Apple fixes these problems?

Jon
Site Admin
Posts: 8665
Joined: Tue Jul 13, 2004 6:27 pm
Location: Bethesda, MD
Contact:

Re: Corruption of OCR'd PDFs

Post by Jon »

By PDFKit-compliant I mean any app that uses PDFKit to render and edit PDFs. Most are -- Adobe Reader, of course, as well as Preview and the two apps you tried. Are you saying that just opening them in Bookends (after they were edited in, say PDF Expert) corrupts the data?

Jon
Sonny Software

atimm
Posts: 17
Joined: Thu Jan 04, 2018 2:44 pm

Re: Corruption of OCR'd PDFs

Post by atimm »

Sorry for letting this thread drop. Life got busy. Yes, I am saying that it doesn't seem to matter where I do the highlighting (Highlights, Acrobat, or Preview) or whether I have imported an already highlighted text or used "Open in" to do it from Bookends. When I just view that document in Bookends (without doing any more highlighting or commenting) the text layer is already corrupted. The image of the PDF moves back and forth between appearing fine and appearing garbled. But the text layer is definitively gone. Text is selectable, but when I copy and paste it, it is no longer text but random symbols. I probably have not tested every scenario, but I'm surprised that no one else seems to be encountering this. Could it be my use of ScanTailor (http://scantailor.org/) to do the image processing before OCR'ing?

Jon
Site Admin
Posts: 8665
Joined: Tue Jul 13, 2004 6:27 pm
Location: Bethesda, MD
Contact:

Re: Corruption of OCR'd PDFs

Post by Jon »

If you don't annotate a PDF in Bookends it doesn't save it to disk. That is, if you just view it the file is not touched by Bookends.

Jon
Sonny Software

atimm
Posts: 17
Joined: Thu Jan 04, 2018 2:44 pm

Re: Corruption of OCR'd PDFs

Post by atimm »

Then I remain baffled. If you have any suggestions for message boards I could visit that might explain this to me, I'd be grateful. My various search terms are not turning up very much.

Jon
Site Admin
Posts: 8665
Joined: Tue Jul 13, 2004 6:27 pm
Location: Bethesda, MD
Contact:

Re: Corruption of OCR'd PDFs

Post by Jon »

There are known issues with certain OCRd PDFs and PDFKit. What I'm saying is that just opening a PDF and then closing it in Bookends won't modify it. If you see differently send me a PDF that is "OK" in Preview but is not after opening it and closing it in Bookends (which annotating it).

Jon
Sonny Software

joao
Posts: 29
Joined: Fri Jun 17, 2016 4:23 am

Re: Corruption of OCR'd PDFs

Post by joao »

Jon wrote:
Thu Nov 29, 2018 3:46 pm
By PDFKit-compliant I mean any app that uses PDFKit to render and edit PDFs. Most are
Hi Jon.
Could you tell us which apps use PDFKit (google was not very helpful here) and might be prone to ruining the ocr text layer?
You seem to say that PDF Expert and Adobe Reader use it, but I was under the impression they didn't (at least not for annotations). I gather Dellu who is a member of this forum also thought the same way (https://dellu.wordpress.com/2017/03/19/ ... df-expert/).

Either way, for those that are being bitten by this bug, Skim is a good alternative (https://skim-app.sourceforge.io). As long as you don't export over the original file, you can simply save the file while you work and export the annotations at the very end from within Skim. You will get other issues however (only Skim can read the annotations in the pdf and you may lose them if you use a non-mac compliant sync service - i.e. the original file remains untouched). It can also be helpful if you are reading PDFs from archive.org or from certain copiers/scanners that implement heavy compression (as Skim only touches the file's extended attributes - not the file itself - you will not get the other issue that PDFKit suffers from which is a huge increase in file size after editing it).

Joao

Jon
Site Admin
Posts: 8665
Joined: Tue Jul 13, 2004 6:27 pm
Location: Bethesda, MD
Contact:

Re: Corruption of OCR'd PDFs

Post by Jon »

I really don't know more about this issue with PDFKit and certain OCR's PDFs than you do. Frankly, the person reporting it in this thread is the first one who has contacted me about it wrt Bookends. I think it's pretty rare, especially with Mojave. But that's all I can say.

Jon
Sonny Software

joao
Posts: 29
Joined: Fri Jun 17, 2016 4:23 am

Re: Corruption of OCR'd PDFs

Post by joao »

Thanks Jon.
Good to know. I'm on High Sierra and haven't had any issues since I updated to it.

atimm
Posts: 17
Joined: Thu Jan 04, 2018 2:44 pm

Re: Corruption of OCR'd PDFs

Post by atimm »

I'll do some testing in the next few days (perhaps a week) and report back on this thread. If I'm the only one with this problem, then I suspect it has something to do with the precise way that I'm getting from JPGs to an OCR'd PDF. I'll have to be more systematic about testing. I agree with joao that it is actually really difficult to figure out which programs use PDFKit.
Annette

edwinc
Posts: 16
Joined: Tue Jun 26, 2018 12:12 am

Re: Corruption of OCR'd PDFs

Post by edwinc »

Hi,

I'm getting a similar problem as described above. It has happened a few times. Though I am still unsure how to reproduce it, I can say that it does happen within Bookends. I have just had the issue again, while reading a PDF in Bookends and using the "Make quoted PDF highlight from selection" function. After going back and forth in the PDF, Bookends suddenly doesn't recognize the scanned text anymore and it just shows illegible characters. I've tried opening the PDF on preview and even re-optimizing it on Acrobat, but the text becomes unrecognizable. The file becomes corrupted and there is no way to recover the text, only by downloading or scanning a new PDF.

Let's see if a solution comes up. I understand it may have to do with PDFKit, but it does affect the performance of Bookends nevertheless.

Thanks

edwinc
Posts: 16
Joined: Tue Jun 26, 2018 12:12 am

Re: Corruption of OCR'd PDFs

Post by edwinc »

Hi,

I was able to pin down the sequence that is corrupting my PDF files. I think it can be of use for others to know what to avoid:

1. Select pdf text using the mouse. Note that the highlighting option is off.
2. Press Ctrl+Cmd+Q
3. In order to be able to carry on reading, you have to click on the highlight and “confirm” the text in the PDF note. You need to press “Escape”, if you press anything else then the text is substituted by it. If you don’t do this confirmation process, i.e. by clicking on another reference, the pdf gets corrupted.

If you remember to always "confirm" the text in the PDF note, then there shouldn't be a problem.

UPDATE: Sometimes, even if you confirm the text in your PDF notes, moving to another reference can corrupt your file. It just seems that "Make Quoted PDF Highlight from Selection" is not reliable enough.

I'm also looking to see if the problem is with the PDFs themselves, but haven't had such issues reading and annotating with Preview (it definitely has other issues, though) and I have been able to reproduce the problem with different PDFs (not all of them).

Post Reply