← All articles
Tutorials
Tutorials · · 5 min read

How to OCR a PDF (Make Scanned Documents Searchable)

Scanned documents and photographed pages are images — they look like text but aren't searchable, selectable, or copyable. OCR (Optical Character Recognition) bridges the gap, converting those images into machine-readable text that you can search, edit, and feed to AI assistants.

Two kinds of OCR

True image-based OCR uses a recognition engine like Tesseract to identify characters in pixel data — slow, requires significant compute, but works on pure image scans. Text-layer extraction reads an existing hidden text layer that many modern scanners and PDF tools embed alongside the visual scan — fast, accurate when present.

What PDFPuddle does

PDFPuddle uses PDF.js to extract text layers from PDFs that have them. Most modern scanned PDFs (anything from a recent multifunction printer, modern phone scanner app, or post-2018 PDF tool) include a text layer. PDFPuddle pulls that out as plain text in seconds.

When PDFPuddle returns nothing

If the result is empty or near-empty, the PDF is a pure image scan with no text layer. For those, use a dedicated OCR tool: Google Docs (free, surprisingly accurate), Tesseract (open-source, runs locally), or commercial OCR engines like ABBYY for production-grade accuracy.

OCR best practices

Higher source resolution = better OCR. Aim for 300 DPI scans. Strong contrast helps — clean white backgrounds with dark text recognize better than yellowed or speckled paper. For multi-language documents, ensure your OCR engine has the right language packs loaded.

Try OCR PDF →