
Open
Posted
•
Ends in 5 days
Paid on delivery
I have a collection of mixed-content PDFs—some pages are pure scans, the rest contain searchable text. Each file holds key tables I need converted into clean, structured data and then visualised as clear bar charts for quick comparison. You will: • Isolate every table across the PDFs, using OCR where the page is only an image. • Clean and normalise the numbers so columns and units stay consistent. • Produce bar charts that faithfully reflect the extracted figures (one chart per table unless otherwise noted). Deliverables 1. A single, tidy dataset (CSV or Excel) with each table clearly identified. 2. High-resolution bar chart images (PNG or SVG) and the source file (Excel, Power BI, or Python notebook) so I can regenerate them later. 3. A short note outlining any assumptions, edge cases handled, and pages that required manual correction. Acceptance Criteria • Every numeric value found in the original tables appears in the dataset, spot-checked against the PDF for accuracy. • Charts label axes, units, and categories clearly, with no truncated text. • No data lost due to OCR errors; if a cell cannot be resolved it is flagged in the note. Feel free to choose the tools you’re most comfortable with—Python (Camelot, Tabula-py, Pandas, Tesseract), R, or Excel macros are all fine as long as the final files meet the criteria above.
Project ID: 40384704
15 proposals
Open for bidding
Remote project
Active 7 hours ago
Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs
15 freelancers are bidding on average ₹5,824 INR for this job

Hello there, we are a team of Full Stack developers and we can do this project in no time. Please, send me the project complete details to start the work. Thanks Ashish Kumar.
₹7,000 INR in 7 days
4.4
4.4

Hi, This is a well-defined task, but given the mix of scanned and text-based PDFs, the key factor here is ensuring accuracy—especially where OCR is required. I can implement a robust pipeline combining automated table extraction (Camelot/Tabula, Tesseract for OCR) with structured post-processing in Python (pandas), and generate clean, reproducible bar charts (Python or Excel/Power BI). That said, “no data loss due to OCR” implies manual validation for low-confidence cells. To ensure quality, I approach this as a hybrid process (automation + targeted QA), not purely automated extraction. Before confirming scope and pricing, I’d need to clarify: • Number of PDFs and total pages • Average tables per document • Sample file (to assess OCR complexity and table structure consistency) My bid assumes a limited and well-scoped dataset. If volume or OCR complexity increases, we can adjust accordingly. Deliverables will include: • Clean, normalized dataset (CSV/Excel) • High-quality charts + fully reproducible source (notebook or BI file) • Notes on assumptions, edge cases, and any manual corrections Let me know if you can share a sample to proceed with an accurate estimate. Best regards
₹12,000 INR in 7 days
4.2
4.2

Hi, I’m Sudhir, I’ve handled similar PDF-to-Excel projects where table structures vary and accuracy is critical, so I understand the importance of getting every value and alignment exactly right. I’ll extract each table (including scanned pages using OCR), then manually verify and structure the data into a clean, consistent dataset with no missing or misplaced entries. I’ll also create clear, well-labeled bar charts for each table so comparisons are easy and presentation-ready. To ensure quality, I cross-check totals and values against the original PDFs and flag any unclear cells in a short summary. I can process a sample file immediately so you can verify accuracy before we proceed. Ready to start and deliver precise, reliable results quickly.
₹6,000 INR in 7 days
3.2
3.2

Hi, Mixed-content PDFs—scanned pages alongside searchable text—make table extraction tricky because you can't just parse text; you need to detect tables visually and apply OCR where the scan lives. I'll handle both: pytesseract for scanned table regions, PyPDF2 for extractable text-based tables, then consolidate into a unified dataset for charting. My approach: I'll build a Python backend that (1) detects and crops table boundaries using image processing, (2) applies OCR to scanned tables, (3) extracts text-based tables programmatically, (4) normalizes formats into clean DataFrames, then (5) pipes into Plotly for interactive charts on your site. This handles variable table layouts without requiring manual annotation. First step: I'll audit 2–3 sample PDFs from your collection to map table formats and confirm OCR accuracy before full implementation. What's the typical number of tables per PDF, and do you have a preference between static (PNG/SVG) or interactive (Plotly) charts? Best regards, Val
₹6,000 INR in 7 days
1.8
1.8

You need to extract tables from mixed-content PDFs, using OCR for scanned pages, and visualize them as bar charts for comparison. Table Extraction: I will use Python libraries like Camelot and Tesseract to isolate tables from PDFs, ensuring accurate OCR for scanned pages. Data Normalization: I will clean and normalize the data to maintain consistent columns and units across all tables. Charting: Each table will be visualized as a high-resolution bar chart, with clear labels for axes, units, and categories. Deliverables: A tidy dataset in CSV or Excel format, high-resolution chart images in PNG or SVG, and a Python notebook for regenerating charts. Documentation: A short note will outline assumptions, edge cases, and any manual corrections needed. Timeline: 1 day. Could you confirm the total number of PDFs and tables to be processed?
₹3,360 INR in 1 day
1.8
1.8

Hi, I can extract, clean, and structure your PDF tables accurately and convert them into clear, well-labeled bar charts. What I’ll deliver: Extraction of all tables (OCR for scanned pages) Cleaned and normalized dataset (CSV/Excel) Consistent columns, units, and formatting Bar charts for each table (PNG/SVG) Source files (Excel / Python notebook for regeneration) Notes covering assumptions, OCR fixes, and edge cases Approach: Use OCR + table extraction tools to capture all data Manually verify and clean to ensure accuracy Generate clear charts with proper labels and units Experience: Strong experience with PDF data extraction and OCR Worked with Python (Pandas, Tesseract, Tabula) and Excel Focused on accuracy, clean datasets, and clear visualization
₹4,000 INR in 1 day
1.2
1.2

Hi, I’m interested in your project and can deliver accurate, structured data extraction with clear visualizations. I have experience working with mixed PDFs (scanned + text-based) and use a reliable workflow combining OCR and table extraction tools to ensure no data is missed. I carefully clean and normalize all values so units, formats, and columns remain consistent across the final dataset. For this project, I will: • Extract all tables using OCR where required and validate against the original PDFs • Clean and structure the data into a well-organized Excel/CSV file with clear identifiers • Create high-quality bar charts with properly labeled axes, units, and categories • Provide source files (Excel/Python) so charts can be easily regenerated • Document any assumptions, edge cases, or OCR corrections clearly I prioritize accuracy and will cross-check values to ensure they match the source. Any unclear or unresolvable data will be flagged transparently. I’m available to start immediately and can maintain consistent communication throughout the project. Best regards, Mitali
₹3,000 INR in 1 day
0.9
0.9

Hello, I’d be a strong fit for this project—this is exactly the kind of **PDF table extraction + OCR cleanup + data visualization workflow** I specialize in. I understand the challenge: mixed PDFs often contain both searchable text and scanned pages, which means the job requires more than simple export tools—it needs **accurate extraction, normalization, validation, and clear charting**. --- **What I will deliver:** ### 1. Clean Structured Dataset • Consolidated **CSV or Excel workbook** • Every table clearly labeled by: * File name * Page number * Table identifier • Standardized columns, units, and formatting ### 2. Accurate Visualizations • High-resolution **bar charts (PNG / SVG)** • One chart per table (or as instructed) • Clear axis labels, units, legends, and readable categories ### 3. Editable Source Files • Excel dashboard / workbook **or** • Python notebook (.ipynb) with reusable code ### 4. Documentation Note • OCR/manual corrections made • Assumptions used • Unclear cells flagged transparently • Edge cases handled --- **My approach:** * Use the best extraction method per page: * Camelot / Tabula for text PDFs * OCR (Tesseract) for scanned pages * Clean and reconcile outputs in Pandas/Excel * Spot-check values against original PDFs * Build charts only after data validation
₹1,500 INR in 7 days
0.0
0.0

Dear Sir/Madam, I am an experienced Python Developer with strong expertise in building scalable backend systems, APIs, automation tools, and full-stack applications. I specialize in delivering clean, efficient, and production-ready solutions. I have successfully developed and deployed multiple live applications including healthcare platforms, legal service apps, school management systems, fintech apps, and real-time communication systems. My Core Python Expertise ✔ Django & Django REST Framework ✔ FastAPI (High-performance APIs) ✔ Flask ✔ SQLModel / SQLAlchemy ✔ PostgreSQL / MySQL / MongoDB ✔ Supabase Integration ✔ Authentication (JWT, OAuth) ✔ Payment Gateway Integration (PhonePe, Razorpay, Stripe) ✔ Web Scraping (BeautifulSoup, Selenium) ✔ Automation Scripts ✔ WebSocket & Real-time Systems ✔ Docker Deployment ✔ AWS / VPS Deployment ✔ REST API Design & Optimization What I Can Build For You Secure REST APIs SaaS backend architecture Admin dashboards Real-time chat systems Payment systems Data processing systems Microservices architecture AI/ML API integration Custom business logic systems Recent Project Experience Healthcare booking & wallet system Legal consultation backend platform School ERP & management API Fintech wallet & transaction management Real-time chat application (WebSocket + MQTT) Location-based services & geo APIs
₹8,000 INR in 6 days
0.0
0.0

Hi there, Your PDFs span both scanned and searchable pages — I'll handle the full pipeline: Tesseract OCR (with deskew/denoise preprocessing) on scanned pages, Camelot for native PDF tables and Tabula-py as fallback, normalisation in Pandas, and bar charts in Matplotlib exported as PNG + SVG. Deliverables: - Single consolidated CSV/Excel with each table labelled by source file and page number - High-resolution PNG + SVG charts and the Python notebook so you can regenerate everything from scratch - Methodology note covering toolchain, assumptions, edge cases, and any cells flagged as unresolvable Timeline: 3 days from file receipt Revisions: 1 round included One question: are the tables primarily numeric, or do they contain mixed text and numeric values? That affects which OCR model configuration I'll use for best accuracy. ₹1,500 fixed — happy to process a sample table from one page upfront so you can verify the output format before I run the full batch.
₹1,500 INR in 3 days
0.0
0.0

⭐PDF TABLE EXTRACTION & DATA VISUALIZATION PIPELINE⭐ Hey, ➤ I’ve reviewed your requirements. You need extraction of mixed PDFs (scanned + text-based), conversion of all tables into clean structured data, and generation of accurate bar charts for each dataset. I have strong experience in OCR-based extraction and data visualization pipelines. ✅How I will help: ↪️ Extract tables from PDFs using Camelot/Tabula for text-based pages ↪️ Apply OCR (Tesseract) for scanned/image-based tables ↪️ Clean & normalize numeric data (units, formats, missing values) ↪️ Generate bar charts for each table with proper labels ↪️ Ensure data validation against original PDFs ✅DELIVERABLES: ✔️ Clean CSV/Excel dataset (fully structured tables) ✔️ High-resolution charts (PNG/SVG) ✔️ Reusable Python notebook/script ✔️ Notes on assumptions, corrections & edge cases ✅TOOLS & APPROACH: ✔️ Python (Pandas, Camelot, Tabula-py) ✔️ OCR (Tesseract) for scanned pages ✔️ Matplotlib/Seaborn for charts ✔️ Excel export automation (openpyxl) ?Fixed Price: ₹6,000 ?Portfolio: https://www.freelancer.pk/u/usmansharif362 ⚫Quick Questions: ❓ Are PDFs consistent in structure or mixed formats? ❓ Do you want charts embedded in Excel or separate files only? ✨Goal is accurate data extraction with clean visualization ready for analysis and reuse. Regards, Usman Sharif
₹6,000 INR in 7 days
0.0
0.0

Delhi, India
Member since Apr 20, 2026
$750-1500 USD
$250-750 USD
$30-250 USD
$10-30 AUD
£250-750 GBP
$30-250 USD
₹600-1500 INR
₹100-400 INR / hour
$3000-5000 USD
€12-18 EUR / hour
$10-20 NZD / hour
$30-250 USD
€6-12 EUR / hour
₹12500-37500 INR
₹750-1250 INR / hour
$15-25 USD / hour
$250-750 USD
$30-250 USD
₹100-150 INR / hour
₹12500-37500 INR