How to OCR a PDF (Make Scanned Documents Searchable)
Scanned documents and photographed pages are images — they look like text but aren't searchable, selectable, or copyable. OCR (Optical Character Recognition) bridges the gap, converting those images into machine-readable text that you can search, edit, and feed to AI assistants.
Two kinds of OCR
True image-based OCR uses a recognition engine like Tesseract to identify characters in pixel data — slow, requires significant compute, but works on pure image scans. Text-layer extraction reads an existing hidden text layer that many modern scanners and PDF tools embed alongside the visual scan — fast, accurate when present.
What PDFPuddle does
PDFPuddle uses PDF.js to extract text layers from PDFs that have them. Most modern scanned PDFs (anything from a recent multifunction printer, modern phone scanner app, or post-2018 PDF tool) include a text layer. PDFPuddle pulls that out as plain text in seconds.
When PDFPuddle returns nothing
If the result is empty or near-empty, the PDF is a pure image scan with no text layer. For those, use a dedicated OCR tool: Google Docs (free, surprisingly accurate), Tesseract (open-source, runs locally), or commercial OCR engines like ABBYY for production-grade accuracy.
OCR best practices
Higher source resolution = better OCR. Aim for 300 DPI scans. Strong contrast helps — clean white backgrounds with dark text recognize better than yellowed or speckled paper. For multi-language documents, ensure your OCR engine has the right language packs loaded.