Page 1 of 1

Re: Find similar (but not identical) PDFs ?

Posted: Tue Feb 01, 2022 8:40 am
by Jon
Sorry, I don't have an answer for you. I would note that *anything* that involves examining 25K PDFs is going to take a long time. I did a quick search and found lots of ways to compare 2 PDFs, but not many at once (and they seem to be focused on showing the diffs to the user, not judging if they're the same PDF). Frankly, what you want is a very intelligent AI designed for this purpose. Maybe there is something out there, but... I did run across a company called CopyLeaks that says something about comparing multiple PDFs, but I doubt it's up to this task. You can look, of course.

One tip that may help. If your references have DOIs, do a Remove Duplicates just for that. Then do NOT remove the dups, but compare the ones with the same DOI that have attachments. The PDFs will either be the same one attached twice, or two different versions of the same paper.

Jon
Sonny Software