Alle Beiträge
|Auch verfügbar in:DE

OCR API – Create Searchable PDFs from Images and Scans

MaraDocs OCR API creates searchable PDFs with text overlay. Your document stays intact – text is selectable and searchable. Not just extracted text.

Martin Kurtz
APIOCRPDFText RecognitionDeveloper
OCR API – Create Searchable PDFs from Images and Scans

Scanned documents and photos contain text that isn't selectable or searchable. Many OCR APIs return only extracted text, not your original document with an invisible text layer. You want the same PDF – same layout and appearance – but with selectable, searchable content. That's what a proper OCR API for searchable PDFs should deliver.

Why Building a Searchable PDF OCR Solution Yourself Takes Weeks

If you try to build this yourself, you'll quickly find that Tesseract, EasyOCR, or cloud OCR return plain text and bounding boxes. To build a searchable PDF, you must overlay the text invisibly on the original image or PDF. That means coordinating coordinate systems, fonts, encoding, and PDF structure. Different languages, fonts, and layouts add complexity. A robust OCR API that keeps "your" document intact takes significant engineering.

How the MaraDocs OCR API Solves This in Minutes

The MaraDocs API performs OCR and outputs a PDF with the text invisibly overlaid. You get your original document – layout, images, appearance – with selectable and searchable text. Not a separate text file. Not a stripped-down version. The same document, enhanced.

OCR Workflow: Validate, OCR, Optimize

For images: validate, then img.ocrToPdf. For PDFs: validate, then pdf.ocrToPdf (optionally after pdf.orientation to fix rotated pages first). The high-level flow.ocrImg and flow.ocrPdf combine orientation, OCR, and optimization in one call. Output is always a PDF handle – the same document with an invisible text layer – that you can download or pass to composition, compression, or email workflows. The pipeline stays server-side; no re-upload between steps.

Get your API key in under a minute

Register for a free account and get your API key in under a minute. Of course we'll provide you with some developer credits.

Try MaraDocs API now →

Why MaraDocs is Different: Workspaces, Webview, and German Data Privacy

Most document APIs force you to upload, process, download, then re-upload for the next step. With MaraDocs, OCR runs in your workspace. Chain with document extraction, composition, or compression – pass the PDF handle directly to the next operation. No re-upload, fewer round-trips.

When OCR results need manual correction – misread characters, complex layouts, or low-quality scans – open app.maradocs.io for manual review and editing. Your users get full manual control when automation hits an edge case.

All processing runs in Germany (Maramia GmbH), encrypted at rest and in transit. Workspaces expire after 7 days. No data leaves the EU. For GDPR-sensitive OCR workloads, this matters.

TypeScript Code for Creating Searchable PDFs with OCR

API reference: data/upload, img/validate, pdf/validate, img/ocr/to/pdf, pdf/ocr/pdf, data/download/pdf

import { MaraDocsClient } from "@maramia/maradocs-sdk-ts";
import { okImg } from "@maramia/maradocs-sdk-ts/models/img";
import { okPdf } from "@maramia/maradocs-sdk-ts/models/pdf";

const client = new MaraDocsClient({ workspaceSecret: workspace_secret });

// High-level: upload, validate, full pipeline, download
const pdfHandle = await client.flow.ocrImg(imageFile);
const blob = await client.data.downloadPdf({ pdf_handle: pdfHandle });

// Low-level: image – upload, validate, OCR, download
const uploaded = await client.data.upload(imageFile);
const validated = await client.img.validate({ unvalidated_file_handle: uploaded.unvalidated_file_handle });
const imgHandle = okImg(validated);
const ocrPdf = await client.img.ocrToPdf({
  img_handle: imgHandle,
  options: { embed_in_blank_page: { size: { width: 210, height: 297 }, position: "center" } },
});
const blob2 = await client.data.downloadPdf({ pdf_handle: ocrPdf.pdf_handle });

// PDF: upload, validate, ocrToPdf, download
const pdfUploaded = await client.data.upload(pdfFile);
const pdfValidated = await client.pdf.validate({ unvalidated_file_handle: pdfUploaded.unvalidated_file_handle });
const pdfOcr = await client.pdf.ocrToPdf({ pdf_handle: okPdf(pdfValidated) });
const blob3 = await client.data.downloadPdf({ pdf_handle: pdfOcr.pdf_handle });

Python Code for OCR to Searchable PDF

API reference: data/upload, img/validate, img/ocr/to/pdf, pdf/ocr/pdf, data/download/pdf

import requests
import time

API_URL = "https://api.maradocs.io/v1"
headers = {"Authorization": f"Bearer {WORKSPACE_SECRET}"}

def poll(url, job_id):
    while True:
        r = requests.get(f"{url}/{job_id}", headers=headers).json()
        if r["status"] == "complete":
            return r["response"]["response"]
        time.sleep(1)

# 1. Upload, 2. Validate image
upload = requests.post(f"{API_URL}/data/upload", headers=headers, files={"file": (...)}).json()
val = requests.post(f"{API_URL}/img/validate", headers=headers,
    json={"unvalidated_file_handle": upload["unvalidated_file_handle"]}).json()
img_handle = poll(f"{API_URL}/img/validate", val["job_id"])["img_handle"]

# 3. OCR to PDF
ocr = requests.post(f"{API_URL}/img/ocr/to/pdf", headers=headers, json={"img_handle": img_handle}).json()
ocr_data = poll(f"{API_URL}/img/ocr/to/pdf", ocr["job_id"])
pdf_handle = ocr_data["pdf_handle"]

# 4. Download
pdf_resp = requests.get(f"{API_URL}/data/download/pdf", headers=headers, params={"pdf_handle": pdf_handle})
with open("searchable.pdf", "wb") as out:
    out.write(pdf_resp.content)

Summary and Next Steps

An OCR API that creates searchable PDFs – your document with invisible text overlay – is available. MaraDocs keeps the original layout and adds selectable, searchable text. See Document Scanner, PDF Handling, and Image on Blank Page for more.


Try it: MaraDocs API | TypeScript SDK


Jetzt Newsletter abonnieren

Bleiben Sie mit uns auf dem Laufenden und erhalten Sie die neuesten Nachrichten, Artikel und Ressourcen per E-Mail.