Page 1 of 1

Locate PDFs That Can't B Searched

Posted: Wed Jan 13, 2021 3:48 pm
by MikeA01730
Hi,

I have a folder that contains several hundred PDFs. Most of them can be searched for text but a few were scanned before home scanners provided OCR and so those PDFs are not searchable.

I'd like to identify the unsearchable PDFs. I tried searching for !content:"e" but after about 4 hrs. I canceled. Any ideas on how to find these files?

Thanks,
Mike

Re: Locate PDFs That Can't B Searched

Posted: Wed Jan 13, 2021 4:17 pm
by NotNull
N ot (easily) doable in Everything.

I would write a script for that (couple of lines of code)
When identified, what should happen to those files? Create a list? Rename them like 'NeedOCR_original filename.pdf'? Move them to a different folder? or ... ?

If you answer that question (and also if the PDF's are in one folder or also in subfolders of taht folder), I will write a script for you that will automate all steps.

Re: Locate PDFs That Can't B Searched

Posted: Wed Jan 13, 2021 4:42 pm
by froggie
This thread might be a place to start:

https://stackoverflow.com/questions/7740883/how-to-identify-pdf-files-that-need-ocr

Re: Locate PDFs That Can't B Searched

Posted: Wed Jan 13, 2021 4:43 pm
by NotNull
Script is written (moves PDF's that need OCR to a different folder). Waiting for your input ...

Re: Locate PDFs That Can't B Searched

Posted: Wed Jan 13, 2021 4:44 pm
by NotNull
froggie wrote:
Wed Jan 13, 2021 4:42 pm
This thread might be a place to start:

https://stackoverflow.com/questions/7740883/how-to-identify-pdf-files-that-need-ocr
That is what I used! :D (the mentioned tool; not that thread ...)

Re: Locate PDFs That Can't B Searched

Posted: Wed Jan 13, 2021 5:12 pm
by NotNull
The idea is to use a command-line utility that converts your PDF's (temporarily) to text. If the size of that is less than 25 bytes, this is considered a 'graphic' PDF that needs OCR to be readable.
Those files are moved to a new subfolder NeedOCR.


Here's how:
Only run this if all PDF's are in one folder and not in subfolders of that folder.

  • Download the Xpdf command line tools from here (name is similar to xpdf-tools-win-4.02.zip)
  • Extract pdftotext.exe from the bin64 folder
  • Put it in the folder where all your PDF's are
  • Save the following code as Move_NeedOCR.cmd in that same folder
  • Double-click Move_NeedOCR.cmd in File Explorer
  • Wait (or check how the NeedOCR folder gets filled if you are bored)
  • Done


All 'graphic' PDF's are now in the newly created NeedOCR folder


Move_NeedOCR.cmd

Code: Select all

@echo off
setlocal
rem echo on
pushd "%~dp0"
cls
::____________________________________________________________
::
::				SETTINGS
::____________________________________________________________
::

	set OUT-FOLDER=.\NeedOCR


::____________________________________________________________
::
::				ACTION!
::____________________________________________________________
::

	if not exist "%OUT-FOLDER%" md "%OUT-FOLDER%"

	for %%X in (*.pdf) do (
		echo.    [%%X]
		pdftotext.exe -simple "%%X" .\checkthis.txt
		for %%C in (checkthis.txt) DO if %%~zC LSS 25 ( move "%%X" "%OUT-FOLDER%" )
		del checkthis.txt
	)
pause
goto :EOF

Re: Locate PDFs That Can't B Searched

Posted: Wed Jan 13, 2021 8:48 pm
by MikeA01730
NotNull,

Thanks! I appreciate your making the effort to create that script.

I just gave it a try and it works well. The only issue is that it appears that pdftotext.exe sometimes gets confused when reading a pdf and issues error messages like these:
Syntax Warning: May not be a PDF file (continuing anyway)
Syntax Error: Couldn't read xref table
Syntax Warning: PDF file is damaged - attempting to reconstruct xref table...
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't read xref table

Everything that causes an error is pre-2007 and at least one file is corrupted. Happily it appears (haven't fully checked this out) that because of the way your test works the files that got errors were copied to the NeedOCR folder which is the right thing for me. I've captured a log of the command file run so I can check them individually.

Also thanks for pointing me to the Xpdf tools. They could be useful for a lot of things.

Regards,
Mike

Re: Locate PDFs That Can't B Searched

Posted: Wed Jan 13, 2021 9:11 pm
by NotNull
Thanks for the feedback, Mike.

So, my "job" is done here? No changes to the code needed?

Re: Locate PDFs That Can't B Searched

Posted: Wed Jan 13, 2021 9:33 pm
by MikeA01730
NotNull,

Job accomplished! I got what I need and I'm ready to figure out the best procedure to use to go through the unsearchable PDFs to make them searchable. I have the tools I need so it's just matter of figuring it out and freeing the time.

Thanks again.

Mike