
Closed
Posted
Paid on delivery
We are looking for a Python developer with experience to build a robust, local pipeline that processes Binance Futures historical data into an ML-ready dataset. The goal is to ingest public data from Binance Vision (aggTrades, all klines, and bookDepth) and output clean, normalized, lookahead-bias-free features stored in Parquet format or DuckDB. Scope of Work & Deliverables 1. Ingestion & Database Setup (Core Foundation) Data Source: Programmatic downloading of historical daily/monthly ZIP files from public [login to view URL] (specifically aggTrades, all klines [1m], and bookDepth for BTCUSDT, ETHUSDT, SOLUSDT, XRPUSDT, BNBUSDT). Storage Architecture: Set up a local storage solution using DuckDB or Parquet to handle millions of rows without memory issues. Alignment: Parse and align different frequencies (tick-by-tick trades, order book snapshots, and 1m klines) to a unified timestamp sequence. 2. Core Microstructure Feature Extraction Implement Python/Polars (or Pandas) scripts to compute the features on the aligned data. 3. Advanced Optimization & ML Readiness Strict Lookahead Bias Prevention: Ensure all rolling features (e.g., rolling z-scores, Parkinson volatility) are calculated using t−1 parameters to prevent data leakage. Normalization: Implement rolling z-scores or min-max normalization per symbol to keep features stationary. Labeling: Implement a basic Triple Barrier Method or directional label generator. Output: Save clean Parquet files per symbol, free of NaNs and infinite values, structured for immediate model training.
Project ID: 40488723
7 proposals
Remote project
Active 4 days ago
Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs
7 freelancers are bidding on average ₹6,571 INR for this job

Hello, Building a Binance Futures pipeline isn't just about downloading ZIP files; it’s about ensuring point-in-time integrity for high-frequency alpha. I specialize in engineering data systems where memory efficiency meets rigorous financial logic. My Technical Approach for your 5 Symbols (BTC, ETH, SOL, XRP, BNB): High-Speed Ingestion: I will implement a multi-threaded crawler using aiohttp to pull historical aggTrades, klines, and bookDepth from Binance Vision. Storage Architecture: I will use DuckDB as the analytical engine with a Partitioned Parquet storage layer. This allows querying millions of rows in milliseconds without exhausting your RAM. The "Zero-Leakage" Alignment: Unlike standard joins, I will implement a strict As-Of Join (via Polars). This ensures that at any timestamp T, the model only sees order book snapshots and trade flows that occurred at or before T-1ms, eliminating lookahead bias completely. Advanced Feature Engineering: Beyond basic moving averages, I will deliver: Order Flow Imbalance (OFI) and Volumetric indicators. Realized Volatility using high-frequency trade data. Relative Strength & Z-Scores normalized across symbols. ML Labeling: Implementation of the Triple Barrier Method (Profit Taking, Stop Loss, and Time-out) to create realistic labels that account for execution costs and volatility. Best regards, Rafael
₹6,000 INR in 7 days
3.4
3.4

Hello, I can build the local Binance Futures data pipeline in Python to download, process, align, and export ML ready datasets for BTCUSDT, ETHUSDT, SOLUSDT, XRPUSDT, and BNBUSDT. I understand the main risk in this project is not just handling large Binance Vision files, but producing clean features without lookahead bias. I can create the ingestion layer for aggTrades, 1m klines, and bookDepth ZIP files, store the data efficiently in DuckDB or Parquet, and align trades, candles, and order book snapshots into a unified timestamp structure without loading everything into memory at once. For feature engineering, I can use Polars or pandas to build microstructure features, rolling volatility, z scores, order flow features, normalization per symbol, and labeling through Triple Barrier or directional labels. I will make sure all rolling calculations use only past data, remove NaNs and infinite values, and save final Parquet files ready for model training. I can also document the folder structure, run commands, and pipeline steps so you can extend it later with more symbols or timeframes. Best regards Ankit
₹5,000 INR in 1 day
2.7
2.7

Hi, I am currently building this exact pipeline for another client — Binance Vision ingestion (aggTrades, klines 1m, bookDepth), DuckDB + Parquet storage, microstructure feature extraction with strict t-1 lookahead bias prevention, rolling z-score normalization, and Triple Barrier labeling. I can deliver your full scope cleanly and efficiently.
₹9,000 INR in 2 days
2.1
2.1

The core challenge is building a local pipeline that ingests Binance Vision historical data for five futures symbols and outputs a lookahead-bias-free dataset without memory collapse. I would use DuckDB for its columnar storage and ability to handle millions of rows locally, combined with asyncio and aiohttp to parallelize the ZIP downloads across symbols. For parsing the aggTrades, klines, and bookDepth files, Pandas with chunked reading will prevent memory overflow during alignment. A critical scoping question: does the alignment require forward-filling bookDepth snapshots to the nearest kline timestamp, or should the snapshots remain at their native update rate with a separate time index?
₹1,500 INR in 3 days
0.4
0.4

I see you need a Python developer to build a robust data pipeline for Binance Futures historical data. I’d build this using Python scripts to ingest public data from Binance Vision, normalize features, and prevent lookahead bias. This will allow you to access clean, ML-ready datasets stored in Parquet format, enabling seamless model training. I’ve worked with similar projects, ensuring data integrity and optimized feature extraction. Quick question: Are you open to a quick chat to discuss how we can efficiently map out this solution for your specific needs? Regards, Collen Jr Liebenberg
₹5,000 INR in 7 days
0.0
0.0

Hi, I can build this as a clean local Python pipeline, not just a notebook dump. For the handoff I'd set up: - downloader for Binance Vision daily/monthly ZIPs for BTCUSDT, ETHUSDT, SOLUSDT, XRPUSDT, and BNBUSDT - DuckDB/Parquet storage so the data can handle millions of rows without memory issues - parsing/alignment for aggTrades, 1m klines, and bookDepth into timestamp-safe tables - data-quality checks for missing days, duplicate timestamps, NaN/inf values, and symbol coverage - core feature scripts using Polars/Pandas, with rolling features shifted so there is no lookahead leakage - basic label/triple-barrier starter plus README showing how to rerun and update the dataset The important part here is not just downloading files. Trading datasets usually fail because of timestamp alignment, silent gaps, leakage, memory blowups, or messy replay assumptions. I'll build it as a reproducible foundation you can inspect and extend. One question before I start: do you prefer DuckDB as the main query layer, or Parquet files with DuckDB examples for querying?
₹12,500 INR in 5 days
0.0
0.0

Lucknow, India
Member since Sep 25, 2021
₹150000-250000 INR
₹600-1500 INR
₹600-1500 INR
₹1500-12500 INR
₹600-1500 INR
$30-250 USD
₹750-1250 INR / hour
$250-750 USD
₹12500-37500 INR
$1500-3000 USD
₹12500-37500 INR
$50-100 USD
$8-15 USD / hour
₹600-1500 INR
₹1500-12500 INR
₹1500-12500 INR
₹1500-12500 INR
₹750-1250 INR / hour
$30-60 USD
$30-250 AUD
₹12500-37500 INR
$2-8 USD / hour
$15-25 USD / hour
$2-8 USD / hour
₹600-1500 INR