Data Diaries

From Scanned PDFs to Editable Text My Daily Workflow

From scanned PDFs to clean text Daily workflow using OCR, LLMs, and manual review to turn raw documents into structured, AI-ready content.

Written by

Amatullah Tyba

Published on

July 27, 2025

DOWNLOAD THE REPORT

Hi, I’m Shamima, and I’m part of the Senior Data Processing team at LexData Labs. My daily mission involves taking scanned, non-editable PDFs, the kind you can’t even copy-paste from and turning them into clean, structured, and editable text.

It’s a mix of OCR automation, large language models, and careful human review. And while it may sound like a technical back-office task, it actually plays a crucial role in powering smarter AI systems and digital archives.

Here’s what a typical day looks like, not just the tech, but the little things that help me stay focused and motivated.

9:00 AM OCR Time: Kicking Off with Raw Scans

We usually start the day with a new batch of scanned documents; forms, reports, old printed records often around 100 pages.

Here’s the breakdown:

📤 Upload the File: We upload each scanned PDF into our OCR processing script.

🧠 Use Google Cloud Vision OCR: The script sends the file to Google Cloud Vision API, which recognizes the characters and extracts the text.

📄 Receive Output: The API coverts the extracted data and returns machine-readable text, usually as plain text or JSON.

The OCR engine does a decent job, but it’s far from perfect especially with Arabic scripts where the shape of a letter can vary wildly based on context or print quality.

🎧 Morning soundtrack: I usually pop on a mellow Arabic indie playlist to get into focus mode. Something about having soft music in the background makes the repetitive work smoother for me.

11:30 AM Short Break: Hydrate & Refocus

By mid-morning, I take a quick 15-minute break. I use this time to grab a glass of water and sometimes snack on chips. This mini reset helps me catch a breather and recharge before heading into more complex parts of the day.

1:00 PM LLM Cleanup: Teaching Machines to Write Better Language

Once the raw OCR text is ready, I run it through a Large Language Model to clean and structure the content. This is the part I enjoy most, where raw text turns into something polished and readable.

The LLM helps by:

✍️ Reconstructing broken sentences

🧹 Fixing grammar, punctuation, and flow (without losing meaning)

📘 Making the content readable and ready for reuse

It’s like giving the machine a second chance to get it right with a little human supervision on the side.

🥘 Lunch break: Around 2:00 PM, I take a full hour off. Usually something light like rice and daal, and if the weather’s good, I’ll sit on the balcony for some fresh air. Even 15 minutes of sunshine makes a difference.

4:00 PM Manual Review: The Human Touch Before Delivery

Afternoon is when I really put in all my concentration and focus. This is the final QA pass.

I carefully review the text against the original scanned file, making sure:

✅ Nothing important got dropped or misread

🐞 Common OCR issues like wrong characters, merged lines, or missed diacritics are caught

🧠 Footnotes, stamps, headers, those tricky edge cases are preserved properly

🥜 I keep a stash of roasted peanuts on my desk for this shift, easy to munch while scanning rows of Arabic text line by line.

Sometimes it surprises me how small visual quirks, like a faded stamp or handwritten note can confuse an otherwise advanced AI. That’s where my eyes matter most.

Balancing Tools with Judgment

OCR and LLMs are powerful, but they don’t understand what’s right; they just predict it. Judgment, logic, and cultural nuance still come from real people, like me.

Characters like Arabic is especially complex for machines. Some classic errors I catch almost daily:

“ي” getting confused with “ب” depending on scan quality

Headings getting dumped into body text

Tables flattened into jumbled paragraphs

That’s why our work isn’t just about conversion, it’s about transformation. Making sure the output isn't just text, but trustworthy text.

Looking Ahead: One Clean Page at a Time

Every scanned PDF we digitize is another step toward faster, smarter document handling for internal databases, language models, or government archives.

Automation is fast but trust still takes a human eye.

‍

Subscribe to newsletter

Subscribe to receive the latest blog posts to your inbox every week.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

View related posts

Data Diaries

Turning Numbers into Narratives - The Art of Data Visualization at LexData Labs

A behind-the-scenes look at how LexData Labs transforms raw data into engaging visual stories through creativity, tech, and thoughtful storytelling by Shakib Hossain.

View project

Data Diaries

Inside LexData Labs: How AI Engineers Shape Tomorrow’s Models

A day in the life of an AI engineer at LexData Labs, building ethical face detection systems with Claude, teamwork, and a splash of humor and tea.

View project

Data Diaries

Between Pixels and Patterns – A Day in the Segmentation Studio

From heels to hoodies, Rafa helps bring fashion into AR with pixel-perfect precision. See how digital tailoring defines the future of virtual style one mask at a time.

View project

Start your next project with high-quality data

Book a free trial

reach@lexdatalabs.com

Address

55 Court Street, Boston, MA 02138, USA