Store Sales Data Analysis: A Data Engineering Capstone Project

Project Overview

The project aims to analyze global sales data to offer actionable insights into regional sales trends, item popularity, and profitability.

Real-World Implications

• Optimizing Inventory: Know what items sell well in which regions.

• Sales Strategy: Develop targeted sales strategies for different markets.

Target Audience

• Sales Managers

• Business Analysts

• Data Scientists

Technologies and Tools

• Data Processing: Pandas, Spark

• Query Language: Hive

• Data Visualization: Matplotlib, Seaborn

• Big Data Technologies: HDFS, YARN

Data Source

The dataset includes:

• Transaction Information: Region, Country, Item Type, Sales Channel, Order Priority, Order Date, Order ID, Ship Date

• Sales Data: Units Sold, Unit Price, Unit Cost, Total Revenue, Total Cost, Total Profit

Problem Statements

Data Preprocessing

1. Null Value Elimination

2. Date Data Cleaning

3. Categorize Items

4. Sales Data Cleanup

5. Data Type Conversion

6. Seasonal Decomposition: Break down sales data into seasonal, trend, and residual components.

7. Feature Engineering: Create new features like Profit Margin, Sales Velocity.

Data Analytics (Big Data Analysis with Visualization)

1. Number of Countries (Using Hive)

2. Units Sold by Region (Using Hive)

3. Most Recent Sales (Using Hive)

4. Products with Specific Letters (Using Spark)

5. Top Selling Countries (Using Spark)

6. Item Costs (Using Spark)

7. Sales Yearwise (Using PySpark)

8. Orders per Item (Using PySpark)

9. Country with Highest Sales (Using PySpark)

10. Customer Segmentation: Use clustering algorithms to identify different customer segments.

11. Time Series Forecasting: Predict future sales using ARIMA or LSTM.

12. Anomaly Detection: Identify any anomalies or outliers that could indicate fraudulent activity.

13. Association Rule Mining: Find associations between different products in the data (Using Spark).

14. Price Elasticity: Understand how the demand for a product changes with a change in its price (Using PySpark).

15. Correlation Between Priority and Profit: Analyze if 'Order Priority' has any correlation with 'Total Profit'.

Data Visualization

1. Regional Sales Distribution

2. Top 10 Items Pie Chart

3. Sales Time Series

4. Profit Distribution

5. Sales by Item

6. Heatmap: Show the correlation between different numerical features like Unit Price, Unit Cost, and Total Profit.

7. Interactive Dashboard: Create an interactive dashboard where users can filter data by year, region, or item.

8. Geographic Heatmap with Time Slider: Show how sales in different regions have evolved over time.

9. Cohort Analysis: Visualize customer retention over time.

10. Bubble Chart: Display Units Sold, Total Revenue, and Total Profit in a three-dimensional bubble chart.

Performance Metrics

1. Spark Job Metrics

2. Query Latency in Hive

3. HDFS Storage Utilization

4. Data Skew Detection

5. Resource Utilization with YARN

6. Task Failure Rates: Monitor and minimize the failure rates of tasks in Spark or Hive jobs.

7. Data Replication Metrics in HDFS: Track and optimize data replication times and success rates.

8. Data Ingestion Latency: Measure the latency of data ingestion from different sources into HDFS.

