Page 1 of 1

Find similar (but not identical) PDFs ?

Posted: Mon Oct 26, 2020 4:37 pm
by DRNash
I have a large collection of references (>25000) in my primary bookends file, most of them with attached PDF files. Inevitably there are some duplicates there. Finding and weeding-out identical PDFs isn't a problem, but there are also quite a few PDFs of the same reference that are not quite identical (e.g. some library sites add a text stamp with the download date to the margin of the paper, "in-press" PDFs have different page numbers than the published versions etc.). I have been trying to figure out a way of comparing the ca. 22000 PDF files to identify those that are very similar (but not identical).

I don't expect to find a solution within bookends (but of course it would be a nice feature to have - albeit very tricky to code), but I'm sure that I'm not the only one who would like to be able to search through my attachment folder and try to identify near-duplicates.

So, does anyone have a suggestion for a Mac software solution to this? Ideally something that will produce a similarity score for each pair of files and then present this as either a sortable list or something like a cluster-diagram?

The semi-working solution I have is to index the files in "Media Pro" (used to be iView multimedia), which produces a thumbnail of each PDF, and allows one to find similar thumbnails...which has allowed me to find and deal with some files that are versions of the same reference. Unfortunately this only works with the first page of the PDF, and so is thrown off for any PDFs that have a cover page (these tend to be similar between papers from the same journal, regardless of the content of the paper). I can see that converting PDF content to text and then comparing texts is a possibility, but will take a very long time with this number of PDFs (and most text-comparison programs only work on one pair of files at a time). I'm not interested in what exactly the differences are, just getting an overall similarity score which will allow near duplicates to be identified.

Many thanks for any suggestions!