
Open
Posted
•
Ends in 6 days
Paid on delivery
Build a Config-Driven Excel Ingestion & DLT Pipeline (Databricks + Unity Catalog) Overview: ---------------- We need a freelancer to design and implement a config-driven ingestion framework in Databricks (Unity Catalog) that ingests 15 different payor Excel formats from Volumes, performs auditing, cleanup, validation, and standardization, and loads the data through DLT into raw -> partitioned -> clean Delta tables, then consolidates into a summary table. The solution must remove header graphics, handle sheet selection, enforce mandatory columns per payor, and produce alerts for data/format issues. Environment & Assumptions: ----------------------------------------------- Platform: Databricks, Unity Catalog enabled Storage: Volumes (UC) — files arrive at: volumes/care/<State>/<yyyy>/<mm>/<filename>.xlsx Example: volumes/care/TX/2025/08/[login to view URL] Languages: PySpark / Spark SQL / DLT (Delta Live Tables) File types: Excel (.xlsx, some have graphics/banners/first rows to discard) Permissions: You will receive a test workspace and sample files Dest Catalog/Schema: ABC CATALOG/xyz schema High-Level Stages: ------------------------------- -------------------------------- 1. Data Loading Module -------------------------------------- Discover new files under Volumes path (auto-loader style acceptable but not required) Record file path, file name, size, processed/placed datetime (from volume metadata) Maintain a file ingest log to prevent re-processing 2. Audit Module -------------------------- Log source file path, filename, processed date/time, file format, record counts, column counts, and sheet used Capture raw volume-mounted file metadata where possible 3. Clean-up Module -------------------------------- Column standardization: rename columns (case, underscores), handle special characters, trim/blank handling Graphics/header removal: configurable option to discard first N lines/rows per payor; support “detect header row” if needed Sheet selection: configurable preferred sheet or name pattern per payor Remove entirely blank columns and normalize nulls 4. Quality Module -------------------------------- Config-driven validation using a payor config table: Mandatory columns (must exist)** Expected columns (POPULATE NULL if not present)** Data type (should be converted as defined in config table)** Optional checks (date formats, allowed values, uniqueness keys) Validate every incoming file after clean-up; log passes/fails with reasons If mandatory columns missing -> raise an alert + quarantine that file’s data (no promotion downstream) 5. Extraction & Conversion Module -------------------------------------------------------- Extract/standardize mandatory attributes across payors (normalize names & types) Write to Delta (Parquet/CSV intermediate only if justified) Optional: record counts for audit 6. DLT Module ----------------------- Implement DLT pipeline(s) to publish raw → partitioned → clean layers Partitions by State, Payor, Date for performance Idempotent, restart-safe, with expectations/tests inside DLT where useful 7. Summary Module ---------------------------------- Consolidate all clean payor tables into one summary table with harmonized schema Optional: SCD2 (CDC) logic if a payor re-sends corrected data; otherwise daily append with load_date Must-Have Config & Control Tables: ----------------------- config_payor: ----------------------- payor (PK) skip_rows (graphics/header rows to discard) mandatory_columns (ARRAY<STRING>) expected_columns (ARRAY<STRING>) column_renames (MAP<STRING,STRING>) data_types (MAP<STRING,STRING>) date_formats (MAP<STRING,STRING>) primary_keys (ARRAY<STRING>) ----------------------- ingest_file_log: ----------------------- file_path, file_name, bytes, state, discovered_ts, placed_ts, processed_ts, status (NEW/PROCESSED/FAILED/QUARANTINED), payor, hash/signature ----------------------- validation_log: ----------------------- file_name, payor, check_name, severity (ERROR/WARN), details, ts table_record_counts (optional audit) table_name, record_count, as_of_ts, source_file_name -------------------------- Tables to Deliver -------------------------- adhoc_reporting.raw_payor_* (raw) adhoc_reporting.part_payor_* (partitioned/harmonized) adhoc_reporting.clean_payor_* (validated/clean) adhoc_reporting.payor_summary (final consolidated table) ----------------------------- Alerts & Quarantine ------------------------------ If file fails mandatory-column checks or schema rules → write rows to quarantine table + log a validation error Send an alert (email/webhook) with: payor, file_name, issue, first N examples, log link We will provide an endpoint (email/Teams/Slack); you supply code hook Non-Functional Requirements ----------------------------------------------- Idempotent: re-running same files must not duplicate data Config-driven: new payor onboarded via config (minimal code changes) Performance: handle ≥ 200 MB Excel per payor efficiently; provide tuning notes Observability: clear logs/metrics; record counts per stage Security: Unity Catalog permissions; secrets handled via Databricks secrets Cost awareness: efficient cluster usage; partition pruning Deliverables: ----------------------- 1) Design doc (short): architecture, tables, DLT graph, config model, alerting flow 2) Working notebooks / jobs / DLT pipelines (importable): a. Ingestion & audit b. Clean-up & validation c. DLT pipeline(s) and definitions d. Summary builder 3) Config tables with seed rows for 15 payors 4) Sample test runs with logs and counts 5) Runbook: adding payor, adjusting skip rows, fixing data types, reprocessing failed file 5) Unit/expectation tests: missing mandatory col, wrong sheet, bad date format, duplicate keys 6) Handover session (recorded) + final PR/bundle Acceptance Criteria (Checklist): ---------------------------------------------------- New file under volumes/care/State/yyyy/mm/ is detected & logged once File metadata & counts appear in ingest_file_log and table_record_counts Graphics/header rows removed per config; correct sheet selected Mandatory columns enforced per payor; failure → quarantine + alert Successful files flow raw → partitioned → clean via DLT, idempotently Summary table shows harmonized rows across 15 payors with load_date Re-processing same file does not duplicate rows Documentation + runbook enable onboarding of 16th payor by config only
Project ID: 39750368
Open for bidding
Remote project
Active 56 yrs ago
Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs

Houston, United States
Payment method verified
Member since Mar 30, 2024
min $50 USD / hour
₹400-750 INR / hour
$2-8 USD / hour
$2-8 USD / hour
min $100000 USD
₹600-3000 INR / hour
min ₹2500 INR / hour
$250-750 AUD
€750-1500 EUR
$8-15 USD / hour
$750-1500 USD
₹1250-2500 INR / hour
$15-25 USD / hour
$2-8 USD / hour
$20-30 NZD / hour
₹500-1000 INR / hour
₹600-1500 INR
₹75000-150000 INR
$30-250 USD