NFL Stats ETL Pipeline

Apache Airflow Orchestrated Data Processing System

Project Overview

Objective: Build a comprehensive ETL pipeline for collecting, processing, and analyzing NFL player statistics using Apache Airflow for workflow orchestration. The system scrapes data from NFL.com, processes player statistics across multiple categories, and provides analytics capabilities including machine learning model evaluation.

Innovation: Historical year parameter support (2000-2025) enables time-series analysis and year-over-year performance comparisons with robust CI/CD integration.

Source Repository

26
Years of Data
5
Stat Categories
7
Pipeline Tasks
100%
Automated

Airflow DAG Architecture

Apache Airflow DAG NFL.com Data Source Year Parameter (2000-2025) Extract Scrape Passing Scrape Rushing Scrape Receiving Scrape Scoring Scrape Tackles Player Data Transform Data Cleaning Normalization Data Aggregation Feature Creation ML Model Preparation Load AWS S3 Storage Model Evaluation Report Generation CI/CD GitHub Actions Test & Deploy

๐Ÿ” Extract Phase

  • Web Scraping: Selenium-based data collection
  • Multi-Category: Passing, Rushing, Receiving, Scoring, Tackles
  • Year Selection: Historical data from 2000-2025
  • Async Processing: Concurrent data retrieval
  • Error Handling: Robust retry mechanisms

โš™๏ธ Transform Phase

  • Data Cleaning: Standardized player names
  • Normalization: Consistent statistical formats
  • Aggregation: Unified dataset creation
  • Feature Engineering: ML-ready data preparation
  • Validation: Data quality assurance

๐Ÿ“ค Load Phase

  • AWS S3 Storage: Scalable cloud storage
  • Model Evaluation: ML performance metrics
  • Report Generation: Analytics summaries
  • Data Cataloging: Metadata management
  • Version Control: Data lineage tracking

๐ŸŽฏ Historical Year Parameter Feature

Innovation Highlight: The pipeline supports collecting NFL statistics for any year between 2000-2025, enabling comprehensive historical analysis and trend identification.

๐Ÿ“… Usage Examples

  • python main.py --year 2022
  • python main.py --category rushing --year 2020
  • Airflow DAG parameter configuration
  • Environment variable control

๐Ÿ“Š Analysis Capabilities

  • Year-over-year performance trends
  • Historical player comparisons
  • Era-based statistical analysis
  • Longitudinal career tracking

๐ŸŽผ Orchestration

  • Apache Airflow 2.7.3+: Workflow orchestration
  • Python Operators: Custom task execution
  • DAG Scheduling: Automated pipeline runs
  • Task Dependencies: Proper execution order
  • Monitoring: Web UI and logging

โšก Data Processing

  • Python 3.11+: Core processing language
  • Pandas & NumPy: Data manipulation
  • Selenium: Web scraping automation
  • Async Processing: Concurrent operations
  • Machine Learning: Scikit-learn evaluation

๐Ÿ’พ Storage & Cloud

  • AWS S3: Scalable object storage
  • AWS CloudWatch: Monitoring and logging
  • JSON/CSV: Structured data formats
  • Compression: Optimized storage
  • Versioning: Data lineage tracking

๐Ÿš€ CI/CD & Testing

  • GitHub Actions: Automated testing
  • Pytest: Comprehensive test suite
  • Chrome/Selenium: Browser automation testing
  • Code Coverage: Quality metrics
  • Artifact Generation: Test reports

๐Ÿ”’ Security & SSL

SSL-secure data collection with proper certificate handling and secure AWS credential management.

๐Ÿ“ˆ Scalable Architecture

Modular design supports easy addition of new statistical categories and data sources.

๐Ÿ”„ Error Recovery

Comprehensive error handling with automatic retries and detailed logging for troubleshooting.

๐Ÿ“Š Analytics Ready

Machine learning evaluation framework built-in for predictive analytics and performance modeling.

โฑ๏ธ Time Series Support

Historical data collection enables trend analysis and comparative studies across multiple seasons.

๐Ÿงช Test Coverage

Extensive testing suite with automated CI/CD pipeline ensuring code quality and reliability.

๐ŸŽฏ Project Impact

Automated NFL data collection pipeline enabling comprehensive statistical analysis across 26 years of historical data. The system supports real-time analytics, predictive modeling, and longitudinal performance studies with robust cloud integration and monitoring capabilities.

Technical Achievement: Successfully integrated modern data engineering practices with sports analytics, creating a scalable and maintainable solution for complex data processing workflows.