
Ditutup
Disiarkan
Dibayar semasa penghantaran
I’m upgrading our analytics stack and need an expert who can own the Hadoop side and turn raw, high-volume feeds into analysis-ready datasets. The core objective is to design and build end-to-end data pipelines on a Hadoop cluster—this is where I believe Hadoop will be most valuable for the project. Here’s what I need from you: • An architecture that takes terabyte-scale log files, lands them in HDFS, applies basic cleansing, and outputs partitioned Parquet tables queryable from Hive or Spark • All scripts, configs, and scheduling (Oozie, Airflow, or your preferred orchestrator) committed to Git with clear documentation • A deployment guide plus a brief hand-over session so I can reproduce the setup on another cluster Acceptance criteria: the pipeline ingests a 100 GB test set, completes in under 60 minutes on my provided environment, and the resulting tables are fully accessible for downstream analytics. If you’ve previously worked with MapReduce, Hive, Spark, or similar tools, include code samples or screenshots so I can quickly verify fit. I’m ready to start as soon as the right approach and timeline are laid out.
ID Projek: 40303547
8 cadangan
Projek jarak jauh
Aktif 27 hari yang lalu
Tetapkan bajet dan garis masa anda
Dapatkan bayaran untuk kerja anda
Tuliskan cadangan anda
Ianya percuma untuk mendaftar dan membida pekerjaan
8 pekerja bebas membida secara purata ₹54,570 INR untuk pekerjaan ini

Hi there, I understand you need a robust Hadoop-based pipeline that can ingest terabyte-scale log data, process it efficiently, and deliver clean, analysis-ready datasets for downstream analytics. I have experience designing big data architectures using HDFS, Spark, and Hive, and can build an end-to-end workflow that lands raw logs into HDFS, performs scalable cleansing and transformations, and outputs optimized partitioned Parquet tables for fast querying. My approach focuses on performance and reliability: using Spark or MapReduce for distributed processing, implementing partitioning strategies that improve query efficiency, and orchestrating the workflow with Airflow or Oozie so ingestion, transformation, and table creation run automatically. All scripts, configurations, and scheduling logic will be version-controlled in Git with clear documentation to ensure the system can be easily maintained or reproduced. The final pipeline will meet your acceptance criteria by efficiently processing the 100 GB test dataset within the required timeframe while producing Hive/Spark-accessible tables ready for analytics. I will also provide a deployment guide and a walkthrough session so you can confidently replicate and manage the setup on other clusters. Regards, Ahmad
₹50,300 INR dalam 7 hari
4.3
4.3

I see you need a Hadoop Data Pipeline Specialist to upgrade your analytics stack by designing and building end-to-end pipelines on a Hadoop cluster. You want an expert to handle raw, high-volume log files and transform them into analysis-ready datasets with efficient processing and clear documentation. Your project requires landing terabyte-scale logs into HDFS, applying cleansing, and outputting partitioned Parquet tables accessible via Hive or Spark. You also need all scripts, configuration, and scheduling automated with tools like Oozie or Airflow, plus a deployment guide and hand-over session to replicate the setup on another cluster. The pipeline must process a 100 GB test set in under 60 minutes. I have built similar pipelines using Hadoop with Spark for data cleansing and partitioned Parquet outputs, integrated with Airflow for orchestration and Git for version control. I delivered deployment documentation and training to ensure smooth handover and reproducibility. This experience aligns directly with your requirements for performance, accessibility, and maintainability. I can complete the pipeline design, implementation, and documentation within 3 weeks. Let’s discuss your environment details so I can tailor the approach to meet your performance goals precisely.
₹660 INR dalam 7 hari
2.8
2.8

As a full-stack developer with a specialized background in web scraping and data optimization, I believe I would be an ideal fit for your project. I have spent significant years not only delving in Python, but also obtaining notable experiences with Hadoop's Hive and Spark as well as similar technologies. The proficiency I have built in designing and building end-to-end data pipelines on large-scale clusters matches up precisely with the kind of expertise you require. To further reinforce my qualifications, I've undertaken ventures involving terabyte-scale log files, which involved key aspects like fine-tuning inputs, data validation, cleaning, and efficient partitioning for optimized querying by Hive or Spark. During these projects, maintaining clearly documented code configurations and being version-controlled was absolutely crucial - standards that I scrupulously adhere to. What sets me apart is not only my technical acumen but also my dedication catapulted by working 10-15 hours daily. This generates a profound commitment to deliver quality work within agreed timelines. Should you choose to work with me, you will expertise-driven design pipeline that ingests 100 GB sets under 60 minutes while ensuring seamless accessibility for analytics downstream.I am eager to get started to prove I not only meet your expectations but surpass them.
₹50,300 INR dalam 1 hari
2.9
2.9

With over a decade of experience, I am a seasoned technologist that lives and breathes code. My prowess with Python extends to Hadoop ecosystem tools like MapReduce, Hive, and Spark, crucial to delivering exceptional results for your project. Need proof? Here's an example: as a key member of a team that spearheaded the redesign of a major e-commerce platform, I successfully migrated over terabytes of log files into analysis-ready datasets using parallel processing and Hadoop, improving efficiency by 40%. I can bring this level of ingenuity and proficiency to your project. Additionally, my extensive expertise in deploying applications on AWS and building end-to-end data pipelines will be invaluable in delivering your core objectives. My track record includes creating architectures that handle high data volumes with ease while applying intelligent cleansing processes that guarantee output quality – exactly what you're looking for. I will also ensure all scripts, configs, and scheduling are well-documented and committed to Git so you can maintain seamless control over your big data operations.
₹50,300 INR dalam 30 hari
1.3
1.3

I've built Hadoop ETL pipelines — log ingestion to HDFS, Spark transformations, Parquet outputs with Hive accessibility. Here's my approach: Architecture: - Landing zone in HDFS for raw logs - PySpark for cleansing (schema enforcement, deduplication, null handling) - Partitioned Parquet with snappy compression - Hive external tables with partition pruning - Airflow DAGs for orchestration with retry logic and failure alerts Deliverables: - Git repo: all scripts, configs, and clear README - Deployment guide: step-by-step cluster reproduction - Handover session: walkthrough + Q&A For the 60-min SLA on 100GB: - Tune Spark executor memory and parallelism based on your cluster - Optimize partitioning strategy for your query patterns - Benchmark incrementally and adjust Questions: - Cluster specs? (nodes, cores, memory) - Log format? (JSON, CSV, other) - Existing Hive metastore or fresh setup? - Spark version preference? Can start immediately once I understand the environment. Ready to discuss architecture before implementation.
₹60,000 INR dalam 7 hari
0.6
0.6

Chennai, India
Ahli sejak Feb 5, 2019
₹75000-150000 INR
$30-250 USD
$15-25 USD / jam
₹750-1250 INR / jam
£750-1500 GBP
$30-250 USD
$25-50 USD / jam
$30-250 USD
$250-750 USD
₹750-1250 INR / jam
$30-250 USD
$100-300 USD
₹12500-37500 INR
$30-250 AUD
$15-25 USD / jam
$10-30 USD
₹12500-37500 INR
$30-250 USD
₹12500-37500 INR
$25-50 USD / jam
$25 AUD