Locate PDFs That Can't B Searched

If you are experiencing problems with "Everything", post here for assistance.
Post Reply
MikeA01730
Posts: 11
Joined: Mon Aug 17, 2020 7:42 pm

Locate PDFs That Can't B Searched

Post by MikeA01730 »

Hi,

I have a folder that contains several hundred PDFs. Most of them can be searched for text but a few were scanned before home scanners provided OCR and so those PDFs are not searchable.

I'd like to identify the unsearchable PDFs. I tried searching for !content:"e" but after about 4 hrs. I canceled. Any ideas on how to find these files?

Thanks,
Mike
NotNull
Posts: 5167
Joined: Wed May 24, 2017 9:22 pm

Re: Locate PDFs That Can't B Searched

Post by NotNull »

N ot (easily) doable in Everything.

I would write a script for that (couple of lines of code)
When identified, what should happen to those files? Create a list? Rename them like 'NeedOCR_original filename.pdf'? Move them to a different folder? or ... ?

If you answer that question (and also if the PDF's are in one folder or also in subfolders of taht folder), I will write a script for you that will automate all steps.
froggie
Posts: 297
Joined: Wed Jun 12, 2013 10:43 pm

Re: Locate PDFs That Can't B Searched

Post by froggie »

This thread might be a place to start:

https://stackoverflow.com/questions/7740883/how-to-identify-pdf-files-that-need-ocr
NotNull
Posts: 5167
Joined: Wed May 24, 2017 9:22 pm

Re: Locate PDFs That Can't B Searched

Post by NotNull »

Script is written (moves PDF's that need OCR to a different folder). Waiting for your input ...
NotNull
Posts: 5167
Joined: Wed May 24, 2017 9:22 pm

Re: Locate PDFs That Can't B Searched

Post by NotNull »

froggie wrote: Wed Jan 13, 2021 4:42 pm This thread might be a place to start:

https://stackoverflow.com/questions/7740883/how-to-identify-pdf-files-that-need-ocr
That is what I used! :D (the mentioned tool; not that thread ...)
NotNull
Posts: 5167
Joined: Wed May 24, 2017 9:22 pm

Re: Locate PDFs That Can't B Searched

Post by NotNull »

The idea is to use a command-line utility that converts your PDF's (temporarily) to text. If the size of that is less than 25 bytes, this is considered a 'graphic' PDF that needs OCR to be readable.
Those files are moved to a new subfolder NeedOCR.


Here's how:
Only run this if all PDF's are in one folder and not in subfolders of that folder.

  • Download the Xpdf command line tools from here (name is similar to xpdf-tools-win-4.02.zip)
  • Extract pdftotext.exe from the bin64 folder
  • Put it in the folder where all your PDF's are
  • Save the following code as Move_NeedOCR.cmd in that same folder
  • Double-click Move_NeedOCR.cmd in File Explorer
  • Wait (or check how the NeedOCR folder gets filled if you are bored)
  • Done


All 'graphic' PDF's are now in the newly created NeedOCR folder


Move_NeedOCR.cmd

Code: Select all

@echo off
setlocal
rem echo on
pushd "%~dp0"
cls
::____________________________________________________________
::
::				SETTINGS
::____________________________________________________________
::

	set OUT-FOLDER=.\NeedOCR


::____________________________________________________________
::
::				ACTION!
::____________________________________________________________
::

	if not exist "%OUT-FOLDER%" md "%OUT-FOLDER%"

	for %%X in (*.pdf) do (
		echo.    [%%X]
		pdftotext.exe -simple "%%X" .\checkthis.txt
		for %%C in (checkthis.txt) DO if %%~zC LSS 25 ( move "%%X" "%OUT-FOLDER%" )
		del checkthis.txt
	)
pause
goto :EOF
MikeA01730
Posts: 11
Joined: Mon Aug 17, 2020 7:42 pm

Re: Locate PDFs That Can't B Searched

Post by MikeA01730 »

NotNull,

Thanks! I appreciate your making the effort to create that script.

I just gave it a try and it works well. The only issue is that it appears that pdftotext.exe sometimes gets confused when reading a pdf and issues error messages like these:
Syntax Warning: May not be a PDF file (continuing anyway)
Syntax Error: Couldn't read xref table
Syntax Warning: PDF file is damaged - attempting to reconstruct xref table...
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't read xref table

Everything that causes an error is pre-2007 and at least one file is corrupted. Happily it appears (haven't fully checked this out) that because of the way your test works the files that got errors were copied to the NeedOCR folder which is the right thing for me. I've captured a log of the command file run so I can check them individually.

Also thanks for pointing me to the Xpdf tools. They could be useful for a lot of things.

Regards,
Mike
NotNull
Posts: 5167
Joined: Wed May 24, 2017 9:22 pm

Re: Locate PDFs That Can't B Searched

Post by NotNull »

Thanks for the feedback, Mike.

So, my "job" is done here? No changes to the code needed?
MikeA01730
Posts: 11
Joined: Mon Aug 17, 2020 7:42 pm

Re: Locate PDFs That Can't B Searched

Post by MikeA01730 »

NotNull,

Job accomplished! I got what I need and I'm ready to figure out the best procedure to use to go through the unsearchable PDFs to make them searchable. I have the tools I need so it's just matter of figuring it out and freeing the time.

Thanks again.

Mike
Post Reply