Evgueni Poloukarov commited on
Commit
7aa0336
·
1 Parent(s): 4d742bd

feat: complete weather feature engineering with simplified approach (375 features)

Browse files

Final weather feature set after two rounds of user feedback refinement.

Weather Data Collection:
- Collected 24-month weather data (51 points × 7 vars, 9.1 MB)
- 2,703 API requests to OpenMeteo Historical API (14 min runtime)
- 100% data completeness (894,744 records)
- Date range: Oct 2023 - Sep 2025

Weather Feature Engineering (375 features):
- 357 grid-level features (51 points × 7 weather variables)
- 12 temporal lags (temp/wind/solar × 1h/6h/12h/24h)
- 6 derived features (rate-of-change + stability)

Simplification 1 - Physics → Rate-of-Change (user feedback):
Removed overly complex physics-based features requiring calibration data:
× wind_power_potential (wind^3) - requires turbine power curves
× temp_deviation - arbitrary 15C reference
× solar_efficiency - requires solar panel specs

Replaced with simple rate-of-change features (hour-over-hour deltas):
✓ wind_rate_change - captures wind spikes/drops
✓ solar_rate_change - captures solar ramps (cloud cover)
✓ temp_rate_change - captures temperature swings

Kept stability features (detect volatility):
✓ wind_stability_6h, solar_stability_6h, temp_stability_6h

Simplification 2 - Removed Zone Aggregates (user feedback):
Removed zone-level aggregates (36 features) requiring capacity weighting:
× zone_temp_*, zone_wind_*, zone_solar_* for 12 zones
× Fatal flaw: Equal weighting without knowing generation capacity
× Example: Hamburg offshore (5 GW) ≠ Munich (0.1 GW)

Rationale:
- Model learns from granular grid-level data (357 features)
- Rate-of-change captures timing of weather events → grid adjustments
- No calibration data needed (turbine curves, asset locations, capacities)
- Simpler = more interpretable = easier to debug
- Zero-shot MVP: maximize raw signal, minimize engineered assumptions

Bug Fixes:
- Unicode emoji crash (Windows cp1252 compatibility)
- Polars completeness calculation (scalar extraction)
- Polars join deprecation (outer → left with coalesce)

Files:
- scripts/collect_openmeteo_24month.py (new)
- src/data_collection/collect_openmeteo.py (bug fixes)
- src/feature_engineering/engineer_weather_features.py (new, refined twice)
- data/raw/weather_24month.parquet (9.1 MB, not in git)
- data/processed/features_weather_24month.parquet (10.19 MB, not in git)

Feature Engineering COMPLETE:
- JAO: 1,698 features
- ENTSO-E: 296 features
- Weather: 375 features
- Total: 2,369 features ready for unification

Next: Feature unification → Zero-shot inference

doc/activity.md CHANGED
@@ -2605,3 +2605,497 @@ Cleanup logic added at line 821-880:
2605
 
2606
  **Status**: ✅ ENTSO-E Features Clean & Ready - Moving to Weather Collection
2607
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2605
 
2606
  **Status**: ✅ ENTSO-E Features Clean & Ready - Moving to Weather Collection
2607
 
2608
+ ---
2609
+
2610
+ ## 2025-11-10 (Part 2) - Weather Data Collection Infrastructure Ready
2611
+
2612
+ ### Summary
2613
+ Prepared weather data collection infrastructure and fixed critical bugs. Ready for full 24-month collection (deferred to next session due to time constraints).
2614
+
2615
+ ### Weather Collection Scope
2616
+ **Target**: 52 strategic grid points × 7 weather variables × 24 months
2617
+
2618
+ **Grid Coverage**:
2619
+ - Germany: 6 points (North Sea, Hamburg, Berlin, Frankfurt, Munich, Baltic)
2620
+ - France: 5 points (Dunkirk, Paris, Lyon, Marseille, Strasbourg)
2621
+ - Netherlands: 4 points (Offshore, Amsterdam, Rotterdam, Groningen)
2622
+ - Austria: 3 points (Kaprun, St. Peter, Vienna)
2623
+ - Belgium: 3 points (Offshore, Doel, Avelgem)
2624
+ - Czech Republic: 3 points (Hradec, Bohemia, Temelin)
2625
+ - Poland: 4 points (Baltic, SHVDC, Belchatow, Mikulowa)
2626
+ - Hungary: 3 points (Paks, Bekescsaba, Gyor)
2627
+ - Romania: 3 points (Fantanele, Iron Gates, Cernavoda)
2628
+ - Slovakia: 3 points (Bohunice, Gabcikovo, Rimavska)
2629
+ - Slovenia: 2 points (Krsko, Divaca)
2630
+ - Croatia: 2 points (Ernestinovo, Zagreb)
2631
+ - Luxembourg: 2 points (Trier, Bauler)
2632
+ - External: 8 points (CH, UK, ES, IT, NO, SE, DK×2)
2633
+
2634
+ **Weather Variables**:
2635
+ - `temperature_2m`: Air temperature (C)
2636
+ - `windspeed_10m`: Wind at 10m (m/s)
2637
+ - `windspeed_100m`: Wind at 100m for generation (m/s)
2638
+ - `winddirection_100m`: Wind direction (degrees)
2639
+ - `shortwave_radiation`: Solar radiation (W/m2)
2640
+ - `cloudcover`: Cloud cover (%)
2641
+ - `surface_pressure`: Pressure (hPa)
2642
+
2643
+ **Collection Strategy**:
2644
+ - OpenMeteo Historical API (free tier)
2645
+ - 2-week chunks (1.0 API call each)
2646
+ - 270 requests/minute (45% of 600/min limit)
2647
+ - Total: 2,703 HTTP requests
2648
+ - Estimated runtime: 10 minutes
2649
+ - Expected output: ~50-80 MB parquet file
2650
+
2651
+ ### Bugs Discovered and Fixed
2652
+
2653
+ #### Bug 1: Unicode Emoji in Windows Console
2654
+ **Problem**:
2655
+ - Windows cmd.exe uses cp1252 encoding (not UTF-8)
2656
+ - Emojis (✓, ✗, ✅) in progress messages caused `UnicodeEncodeError`
2657
+ - Collection crashed at 15% after successfully fetching data
2658
+
2659
+ **Root Cause**:
2660
+ ```python
2661
+ # Line 281, 347, 372 in collect_openmeteo.py
2662
+ print(f"✅ {location_id}: {location_df.shape[0]} hours") # BROKEN
2663
+ print(f"❌ Failed {location_id}") # BROKEN
2664
+ ```
2665
+
2666
+ **Fix Applied**:
2667
+ ```python
2668
+ print(f"[OK] {location_id}: {location_df.shape[0]} hours")
2669
+ print(f"[ERROR] Failed {location_id}")
2670
+ ```
2671
+
2672
+ **Files Modified**: `src/data_collection/collect_openmeteo.py:281,347,372`
2673
+
2674
+ #### Bug 2: Polars Completeness Calculation
2675
+ **Problem**:
2676
+ - Line 366: `combined_df.null_count().sum()` returns DataFrame (not scalar)
2677
+ - Type error: `unsupported operand type(s) for -: 'int' and 'DataFrame'`
2678
+ - Collection completed 100% but failed at final save step
2679
+ - All 894,744 records collected but lost (not written to disk)
2680
+
2681
+ **Root Cause**:
2682
+ ```python
2683
+ # BROKEN - Polars returns DataFrame
2684
+ completeness = (1 - combined_df.null_count().sum() / (rows * cols)) * 100
2685
+ ```
2686
+
2687
+ **Fix Applied**:
2688
+ ```python
2689
+ # Extract scalar from Polars
2690
+ null_count_total = combined_df.null_count().sum_horizontal()[0]
2691
+ completeness = (1 - null_count_total / (rows * cols)) * 100
2692
+ ```
2693
+
2694
+ **Files Modified**: `src/data_collection/collect_openmeteo.py:366-370`
2695
+
2696
+ ### Test Results
2697
+
2698
+ **Test Scope**: 1 week × 51 grid points (minimal test)
2699
+ ```bash
2700
+ Date range: 2025-09-23 to 2025-09-30
2701
+ Grid points: 51
2702
+ Total records: 9,792 (192 hours each)
2703
+ Test duration: ~20 seconds
2704
+ ```
2705
+
2706
+ **Test Output**:
2707
+ ```
2708
+ Total HTTP requests: 51
2709
+ Total API calls consumed: 51.0
2710
+ Total records: 9,792
2711
+ Date range: 2025-09-23 00:00:00 to 2025-09-30 23:00:00
2712
+ Grid points: 51
2713
+ Completeness: 100.00% ✅
2714
+ Output: test_weather.parquet
2715
+ File size: 0.1 MB
2716
+ ```
2717
+
2718
+ **Validation**:
2719
+ - ✅ All 51 grid points collected successfully
2720
+ - ✅ 100% data completeness (no missing values)
2721
+ - ✅ File saved and loaded correctly
2722
+ - ✅ No errors or crashes
2723
+ - ✅ Test file cleaned up
2724
+
2725
+ ### Files Modified
2726
+
2727
+ **Scripts Created**:
2728
+ - `scripts/collect_openmeteo_24month.py` - 24-month collection script
2729
+ - Uses existing `OpenMeteoCollector` class
2730
+ - 2-week chunking
2731
+ - Progress tracking with tqdm
2732
+ - Output: `data/raw/weather_24month.parquet`
2733
+
2734
+ **Bug Fixes**:
2735
+ - `src/data_collection/collect_openmeteo.py:281,347,372` - Removed Unicode emojis
2736
+ - `src/data_collection/collect_openmeteo.py:366-370` - Fixed Polars completeness calculation
2737
+
2738
+ ### Current Status
2739
+
2740
+ **Weather Infrastructure**: ✅ Complete and Tested
2741
+ - Collection script ready
2742
+ - All bugs fixed
2743
+ - Tested successfully with 1-week sample
2744
+ - Ready for full 24-month collection
2745
+
2746
+ **Data Collected**:
2747
+ - JAO: ✅ 1,698 features (24 months)
2748
+ - ENTSO-E: ✅ 296 features (24 months)
2749
+ - Weather: ⏳ Pending (infrastructure ready, ~10 min runtime)
2750
+
2751
+ **Why Deferred**:
2752
+ User had time constraints - weather collection requires ~10 minutes uninterrupted runtime.
2753
+
2754
+ ### Next Session Workflow
2755
+
2756
+ **IMMEDIATE ACTION** (when you return):
2757
+ ```bash
2758
+ # Run 24-month weather collection (~10 minutes)
2759
+ .venv/Scripts/python.exe scripts/collect_openmeteo_24month.py
2760
+ ```
2761
+
2762
+ **Expected Output**:
2763
+ - File: `data/raw/weather_24month.parquet`
2764
+ - Size: 50-80 MB
2765
+ - Records: ~894,744 (51 points × 17,544 hours)
2766
+ - Features (raw): 12 columns (timestamp, grid_point, location_name, lat, lon, + 7 weather vars)
2767
+
2768
+ **After Weather Collection**:
2769
+ 1. **Feature Engineering** - Weather features (~364 features)
2770
+ - Grid-level: `temp_{grid}`, `wind_{grid}`, `solar_{grid}` (51 × 7 = 357)
2771
+ - Zone-level aggregation: `temp_avg_{zone}`, `wind_avg_{zone}` (optional)
2772
+ - Lags: Previous 1h, 6h, 12h, 24h (key variables only)
2773
+
2774
+ 2. **Feature Unification** - Merge all sources
2775
+ - JAO: 1,698 features
2776
+ - ENTSO-E: 296 features
2777
+ - Weather: ~364 features
2778
+ - **Total: ~2,358 unified features**
2779
+
2780
+ 3. **Day 3: Zero-Shot Inference**
2781
+ - Load Chronos 2 Large (710M params)
2782
+ - Run inference on unified feature set
2783
+ - Evaluate D+1 MAE (target: <150 MW)
2784
+
2785
+ ### Lessons Learned
2786
+
2787
+ 1. **Windows Console Limitations**: Never use Unicode characters in backend scripts on Windows
2788
+ - Use ASCII alternatives: `[OK]`, `[ERROR]`, `[SUCCESS]`
2789
+ - Emojis OK in: Marimo notebooks (browser-rendered), documentation
2790
+
2791
+ 2. **Polars API Differences**: Always extract scalars explicitly
2792
+ - `.sum()` returns DataFrame in Polars
2793
+ - Use `.sum_horizontal()[0]` to get scalar value
2794
+
2795
+ 3. **Test Before Full Collection**: Quick tests save hours
2796
+ - 20-second test caught a bug that would have lost 10 minutes of collection
2797
+ - Always test with minimal data (1 week vs 24 months)
2798
+
2799
+ ### Git Status
2800
+
2801
+ **Committed**: ENTSO-E quality fixes (previous session)
2802
+ **Uncommitted**: Weather collection bug fixes (ready to commit)
2803
+
2804
+ **Next Commit** (after weather collection completes):
2805
+ ```
2806
+ feat: complete weather data collection with bug fixes
2807
+
2808
+ - Fixed Unicode emoji crash (Windows cp1252 compatibility)
2809
+ - Fixed Polars completeness calculation
2810
+ - Collected 24-month weather data (51 points × 7 vars)
2811
+ - Created scripts/collect_openmeteo_24month.py
2812
+ - Output: data/raw/weather_24month.parquet (~50-80 MB)
2813
+
2814
+ Next: Weather feature engineering (~364 features)
2815
+ ```
2816
+
2817
+ ### Summary Statistics
2818
+
2819
+ **Project Progress**:
2820
+ - Day 0: ✅ Setup complete
2821
+ - Day 1: ✅ Data collection (JAO, ENTSO-E complete; Weather ready)
2822
+ - Day 2: 🔄 Feature engineering (JAO ✅, ENTSO-E ✅, Weather ⏳)
2823
+ - Day 3: ⏳ Zero-shot inference (pending)
2824
+ - Day 4: ⏳ Evaluation (pending)
2825
+ - Day 5: ⏳ Documentation (pending)
2826
+
2827
+ **Feature Count Tracking**:
2828
+ - JAO: 1,698 ✅
2829
+ - ENTSO-E: 296 ✅ (cleaned from 464)
2830
+ - Weather: 364 ⏳ (infrastructure ready)
2831
+ - **Projected Total: ~2,358 features**
2832
+
2833
+ **Data Quality**:
2834
+ - JAO: 100% complete
2835
+ - ENTSO-E: 99.76% complete
2836
+ - Weather: TBD (expect >99% based on test)
2837
+
2838
+ ---
2839
+
2840
+ ## 2025-11-10 (Part 3) - Weather Feature Engineering Complete
2841
+
2842
+ ### Summary
2843
+ Completed weather data collection and feature engineering. All three feature sets (JAO, ENTSO-E, Weather) are now ready for unification.
2844
+
2845
+ ### Weather Data Collection
2846
+ **Execution**:
2847
+ - Ran `scripts/collect_openmeteo_24month.py`
2848
+ - Collection time: 14 minutes (2,703 API requests)
2849
+ - 51 grid points × 53 two-week chunks × 7 variables
2850
+
2851
+ **Results**:
2852
+ - ✅ 894,744 records collected (51 points × 17,544 hours)
2853
+ - ✅ 100% data completeness
2854
+ - ✅ File: `data/raw/weather_24month.parquet` (9.1 MB)
2855
+ - ✅ Date range: Oct 2023 - Sep 2025 (24 months)
2856
+
2857
+ **Bug Fixed** (post-collection):
2858
+ - Line 85-86 in script still had completeness calculation bug
2859
+ - Fixed `.sum()` to `.sum_horizontal()[0]` for scalar extraction
2860
+ - Data was saved successfully despite error
2861
+
2862
+ ### Weather Feature Engineering
2863
+ **Created**: `src/feature_engineering/engineer_weather_features.py`
2864
+
2865
+ **Features Engineered** (411 total):
2866
+ 1. **Grid-level features** (357): 51 grid points × 7 weather variables
2867
+ - temp_<grid_point>, wind10m_<grid_point>, wind100m_<grid_point>
2868
+ - winddir_<grid_point>, solar_<grid_point>, cloud_<grid_point>, pressure_<grid_point>
2869
+
2870
+ 2. **Zone-level aggregates** (36): 12 Core FBMC zones × 3 key variables
2871
+ - zone_temp_<zone>, zone_wind_<zone>, zone_solar_<zone>
2872
+
2873
+ 3. **Temporal lags** (12): 3 variables × 4 time periods
2874
+ - temp_avg_lag1h/6h/12h/24h
2875
+ - wind_avg_lag1h/6h/12h/24h
2876
+ - solar_avg_lag1h/6h/12h/24h
2877
+
2878
+ 4. **Derived features** (6):
2879
+ - wind_power_potential (wind^3, proportional to turbine output)
2880
+ - temp_deviation (deviation from 15C reference)
2881
+ - solar_efficiency (solar output adjusted for temperature)
2882
+ - wind_stability_6h, solar_stability_6h, temp_stability_6h (rolling std)
2883
+
2884
+ **Output**:
2885
+ - File: `data/processed/features_weather_24month.parquet`
2886
+ - Size: 11.48 MB
2887
+ - Shape: 17,544 rows × 412 columns (411 features + timestamp)
2888
+ - Completeness: 100%
2889
+
2890
+ **Bugs Fixed During Development**:
2891
+ 1. **Polars join deprecation**: Changed `how='outer'` to `how='left'` with `coalesce=True`
2892
+ 2. **Duplicate timestamp columns**: Used coalesce to prevent `timestamp_right` duplicates
2893
+
2894
+ ### Files Created
2895
+ - `scripts/collect_openmeteo_24month.py` (fixed bugs)
2896
+ - `src/feature_engineering/engineer_weather_features.py` (new)
2897
+ - `data/raw/weather_24month.parquet` (9.1 MB)
2898
+ - `data/processed/features_weather_24month.parquet` (11.48 MB)
2899
+
2900
+ ### Feature Count Update
2901
+ **Final Feature Inventory**:
2902
+ - JAO: 1,698 ✅ Complete
2903
+ - ENTSO-E: 296 ✅ Complete
2904
+ - Weather: 411 ✅ Complete
2905
+ - **Total: 2,405 features** (vs target ~1,735 = +39%)
2906
+
2907
+ ### Key Lessons
2908
+ 1. **Polars API Evolution**: Deprecation warnings for join methods
2909
+ - `how='outer'` → `how='left'` with `coalesce=True`
2910
+ - Prevents duplicate columns in sequential joins
2911
+
2912
+ 2. **Feature Engineering Approach**:
2913
+ - Grid-level: Maximum spatial resolution (51 points)
2914
+ - Zone-level: Aggregated for regional patterns
2915
+ - Temporal lags: Capture weather persistence
2916
+ - Derived: Physical relationships (wind^3 for power, temp effects on solar)
2917
+
2918
+ 3. **Data Completeness**: 100% across all three feature sets
2919
+ - No missing values to impute
2920
+ - Ready for direct model input
2921
+
2922
+ ### Git Status
2923
+ **Ready to commit**:
2924
+ - Weather collection script (bug fixes)
2925
+ - Weather feature engineering module
2926
+ - Two new parquet files (raw + processed)
2927
+
2928
+ **Next Commit**:
2929
+ ```bash
2930
+ feat: complete weather feature engineering (411 features)
2931
+
2932
+ - Collected 24-month weather data (51 points × 7 vars, 9.1 MB)
2933
+ - Engineered 411 weather features (100% complete)
2934
+ * 357 grid-level features
2935
+ * 36 zone-level aggregates
2936
+ * 12 temporal lags (1h/6h/12h/24h)
2937
+ * 6 derived features (wind power, solar efficiency, stability)
2938
+ - Created src/feature_engineering/engineer_weather_features.py
2939
+ - Output: data/processed/features_weather_24month.parquet (11.48 MB)
2940
+
2941
+ Feature engineering COMPLETE:
2942
+ - JAO: 1,698 features
2943
+ - ENTSO-E: 296 features
2944
+ - Weather: 411 features
2945
+ - Total: 2,405 features ready for unification
2946
+
2947
+ Next: Feature unification → Zero-shot inference
2948
+ ```
2949
+
2950
+ ### Summary Statistics
2951
+ **Project Progress**:
2952
+ - Day 0: ✅ Setup complete
2953
+ - Day 1: ✅ Data collection complete (JAO, ENTSO-E, Weather)
2954
+ - Day 2: ✅ Feature engineering complete (JAO, ENTSO-E, Weather)
2955
+ - Day 3: ⏳ Feature unification → Zero-shot inference
2956
+ - Day 4: ⏳ Evaluation
2957
+ - Day 5: ⏳ Documentation + handover
2958
+
2959
+ **Feature Count (Final)**:
2960
+ - JAO: 1,698 ✅
2961
+ - ENTSO-E: 296 ✅
2962
+ - Weather: 411 ✅
2963
+ - **Total: 2,405 features** (39% above target)
2964
+
2965
+ **Data Quality**:
2966
+ - JAO: 100% complete
2967
+ - ENTSO-E: 99.76% complete
2968
+ - Weather: 100% complete
2969
+
2970
+ ---
2971
+
2972
+ ## 2025-11-10 (Part 4) - Simplified Weather Features (Physics → Rate-of-Change)
2973
+
2974
+ ### Summary
2975
+ Replaced overly complex physics-based features with simple rate-of-change features based on user feedback.
2976
+
2977
+ ### Problem Identified
2978
+ **User feedback**: Original derived features were too complex without calibration data:
2979
+ - `wind_power_potential` (wind^3) - requires turbine power curves
2980
+ - `temp_deviation` (from 15C) - arbitrary reference point
2981
+ - `solar_efficiency` (temp-adjusted) - requires solar panel specifications
2982
+
2983
+ These require geographic knowledge, power curves, and equipment specs we don't have.
2984
+
2985
+ ### Solution Applied
2986
+ **Replaced 3 complex features with 3 simple rate-of-change features:**
2987
+
2988
+ **Removed:**
2989
+ 1. `wind_power_potential` (wind^3 transformation)
2990
+ 2. `temp_deviation` (arbitrary 15C reference)
2991
+ 3. `solar_efficiency` (requires solar panel specs)
2992
+
2993
+ **Added (hour-over-hour deltas):**
2994
+ 1. `wind_rate_change` - captures wind spikes/drops
2995
+ 2. `solar_rate_change` - captures solar ramps (cloud cover)
2996
+ 3. `temp_rate_change` - captures temperature swings
2997
+
2998
+ **Kept (stability metrics - useful for volatility):**
2999
+ 1. `wind_stability_6h` (rolling std)
3000
+ 2. `solar_stability_6h` (rolling std)
3001
+ 3. `temp_stability_6h` (rolling std)
3002
+
3003
+ ### Rationale
3004
+ **Rate-of-change features capture what matters:**
3005
+ - Sudden wind spikes → wind generation ramping → redispatch
3006
+ - Solar drops (clouds) → solar generation drops → grid adjustments
3007
+ - Temperature swings → demand shifts → flow changes
3008
+
3009
+ **No calibration data needed:**
3010
+ - Model learns physics from raw grid-level data (357 features)
3011
+ - Rate-of-change provides timing signals for correlation
3012
+ - Simpler features = more interpretable = easier to debug
3013
+
3014
+ ### Results
3015
+ **Re-ran feature engineering:**
3016
+ - Total features: 411 (unchanged)
3017
+ - Derived features: 6 (3 rate-of-change + 3 stability)
3018
+ - File size: 11.41 MB (0.07 MB smaller)
3019
+ - Completeness: 100%
3020
+
3021
+ ### Key Lesson
3022
+ **Simplicity over complexity in zero-shot MVP:**
3023
+ - Don't attempt to encode domain physics without calibration data
3024
+ - Let the model learn complex relationships from raw signals
3025
+ - Use simple derived features (deltas, rolling stats) for timing/volatility
3026
+ - Save physics-based features for Phase 2 when we have equipment data
3027
+
3028
+ ---
3029
+
3030
+ ## 2025-11-10 (Part 5) - Removed Zone Aggregates (Final: 375 Weather Features)
3031
+
3032
+ ### Summary
3033
+ Removed zone-level aggregate features (36 features) due to lack of capacity weighting data.
3034
+
3035
+ ### Problem Identified
3036
+ **User feedback**: Zone aggregates assume equal weighting without capacity data:
3037
+ - Averaging wind speed across DE_LU grid points (6 locations)
3038
+ - No knowledge of actual generation capacity at each location
3039
+ - Hamburg offshore: 5 GW vs Munich: 0.1 GW → equal averaging = meaningless
3040
+
3041
+ **Fatal flaw**: Without knowing WHERE wind farms/solar parks are located and their CAPACITY, zone averages add noise instead of signal.
3042
+
3043
+ ### Solution Applied
3044
+ **Removed zone aggregation entirely:**
3045
+ - Deleted `engineer_zone_aggregates()` function
3046
+ - Removed 36 features (12 zones × 3 variables)
3047
+ - Deleted GRID_POINT_TO_ZONE mapping (unused)
3048
+
3049
+ **Final Feature Set (375 features):**
3050
+ 1. **Grid-level**: 357 features (51 points × 7 variables)
3051
+ - Model learns which specific locations correlate with flows
3052
+ 2. **Temporal lags**: 12 features (3 variables × 4 time periods)
3053
+ - Captures weather persistence
3054
+ 3. **Derived**: 6 features (rate-of-change + stability)
3055
+ - Simple signals without requiring calibration data
3056
+
3057
+ ### Rationale
3058
+ **Let the model find the important locations:**
3059
+ - 51 grid-level features give model full spatial resolution
3060
+ - Model can learn which points have generation assets
3061
+ - No false precision from unweighted aggregation
3062
+ - Cleaner signal for zero-shot learning
3063
+
3064
+ ### Results
3065
+ **Re-ran feature engineering:**
3066
+ - Total features: 375 (down from 411, -36)
3067
+ - File size: 10.19 MB (down from 11.41 MB, -1.22 MB)
3068
+ - Completeness: 100%
3069
+
3070
+ ### Key Lesson
3071
+ **Avoid aggregation without domain knowledge:**
3072
+ - Equal weighting ≠ capacity-weighted average
3073
+ - Geographic averages require knowing asset locations and capacities
3074
+ - When in doubt, keep granular data and let the model learn patterns
3075
+ - Zero-shot MVP: maximize raw signal, minimize engineered assumptions
3076
+
3077
+ ### Final Weather Features Breakdown
3078
+ 1. **Grid-level (357)**:
3079
+ - temp_*, wind10m_*, wind100m_*, winddir_*
3080
+ - solar_*, cloud_*, pressure_* for each of 51 grid points
3081
+
3082
+ 2. **Temporal lags (12)**:
3083
+ - temp_avg_lag1h/6h/12h/24h
3084
+ - wind_avg_lag1h/6h/12h/24h
3085
+ - solar_avg_lag1h/6h/12h/24h
3086
+
3087
+ 3. **Derived (6)**:
3088
+ - wind_rate_change, solar_rate_change, temp_rate_change (hour-over-hour)
3089
+ - wind_stability_6h, solar_stability_6h, temp_stability_6h (rolling std)
3090
+
3091
+ ---
3092
+
3093
+ **NEXT SESSION BOOKMARK**: Feature unification (merge 2,369 features on timestamp), then zero-shot inference
3094
+
3095
+ **Status**: ✅ All Feature Engineering Complete - Ready for Unification
3096
+
3097
+ **Final Feature Count**:
3098
+ - JAO: 1,698
3099
+ - ENTSO-E: 296
3100
+ - Weather: 375
3101
+ - **Total: 2,369 features** (down from 2,405)
scripts/collect_openmeteo_24month.py ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Collect 24-Month Weather Data from OpenMeteo
3
+ =============================================
4
+
5
+ Collects hourly weather data from OpenMeteo Historical API for the full
6
+ 24-month period (Oct 2023 - Sept 2025) across 52 strategic grid points.
7
+
8
+ 7 Weather Variables:
9
+ - temperature_2m: Air temperature at 2m (C)
10
+ - windspeed_10m: Wind speed at 10m (m/s)
11
+ - windspeed_100m: Wind speed at 100m (m/s) - for wind generation
12
+ - winddirection_100m: Wind direction at 100m (degrees)
13
+ - shortwave_radiation: Solar radiation (W/m2) - for solar generation
14
+ - cloudcover: Cloud cover percentage
15
+ - surface_pressure: Surface air pressure (hPa)
16
+
17
+ Collection Strategy:
18
+ - 52 grid points (covering all FBMC zones + neighbors)
19
+ - 2-week chunks (1.0 API call each)
20
+ - 270 requests/minute (45% of 600 limit)
21
+ - Estimated runtime: ~5 minutes
22
+
23
+ Output: data/raw/weather_24month.parquet
24
+ Size: ~50-80 MB (52 points × 7 vars × 17,520 hours)
25
+ Features: 364 (52 × 7) when engineered
26
+ """
27
+
28
+ import sys
29
+ from pathlib import Path
30
+
31
+ # Add src to path
32
+ sys.path.append(str(Path(__file__).parent.parent))
33
+
34
+ from src.data_collection.collect_openmeteo import OpenMeteoCollector
35
+
36
+ # Date range: Oct 2023 - Sept 2025 (24 months)
37
+ START_DATE = '2023-10-01'
38
+ END_DATE = '2025-09-30'
39
+
40
+ # Output file
41
+ OUTPUT_DIR = Path(__file__).parent.parent / 'data' / 'raw'
42
+ OUTPUT_FILE = OUTPUT_DIR / 'weather_24month.parquet'
43
+
44
+ print("="*80)
45
+ print("24-MONTH WEATHER DATA COLLECTION")
46
+ print("="*80)
47
+ print()
48
+ print("Period: October 2023 - September 2025 (24 months)")
49
+ print("Grid points: 52 strategic locations across FBMC")
50
+ print("Variables: 7 weather parameters")
51
+ print("Estimated runtime: ~5 minutes")
52
+ print()
53
+
54
+ # Initialize collector with safe rate limiting
55
+ print("Initializing OpenMeteo collector...")
56
+ collector = OpenMeteoCollector(
57
+ requests_per_minute=270, # 45% of 600 limit
58
+ chunk_days=14 # 1.0 API call per request
59
+ )
60
+ print("[OK] Collector initialized")
61
+ print()
62
+
63
+ # Run collection
64
+ try:
65
+ df = collector.collect_all(
66
+ start_date=START_DATE,
67
+ end_date=END_DATE,
68
+ output_path=OUTPUT_FILE
69
+ )
70
+
71
+ if not df.is_empty():
72
+ print()
73
+ print("="*80)
74
+ print("COLLECTION SUCCESS")
75
+ print("="*80)
76
+ print()
77
+ print(f"Output: {OUTPUT_FILE}")
78
+ print(f"Shape: {df.shape[0]:,} rows x {df.shape[1]} columns")
79
+ print(f"Date range: {df['timestamp'].min()} to {df['timestamp'].max()}")
80
+ print(f"Grid points: {df['grid_point'].n_unique()}")
81
+ print(f"Weather variables: {len([c for c in df.columns if c not in ['timestamp', 'grid_point', 'location_name', 'latitude', 'longitude']])}")
82
+ print()
83
+
84
+ # Data quality summary
85
+ null_count_total = df.null_count().sum_horizontal()[0]
86
+ null_pct = (null_count_total / (df.shape[0] * df.shape[1])) * 100
87
+ print(f"Data completeness: {100 - null_pct:.2f}%")
88
+
89
+ if null_pct > 0:
90
+ print()
91
+ print("Missing data by column:")
92
+ for col in df.columns:
93
+ null_count = df[col].null_count()
94
+ if null_count > 0:
95
+ pct = (null_count / len(df)) * 100
96
+ print(f" - {col}: {null_count:,} ({pct:.2f}%)")
97
+
98
+ print()
99
+ print("="*80)
100
+ print("NEXT STEPS")
101
+ print("="*80)
102
+ print()
103
+ print("1. Implement weather feature engineering:")
104
+ print(" - Create src/feature_engineering/engineer_weather_features.py")
105
+ print(" - Engineer ~364 features (52 grid points x 7 variables)")
106
+ print(" - Add spatial aggregation (zone-level averages)")
107
+ print()
108
+ print("2. Expected features:")
109
+ print(" - Grid-level: temp_{grid_point}, wind_{grid_point}, solar_{grid_point}, etc.")
110
+ print(" - Zone-level: temp_avg_{zone}, wind_avg_{zone}, solar_avg_{zone}, etc.")
111
+ print(" - Lags: Previous 1h, 6h, 12h, 24h for key variables")
112
+ print()
113
+ print("3. Final unified features:")
114
+ print(" - JAO: 1,698")
115
+ print(" - ENTSO-E: 296")
116
+ print(" - Weather: 364")
117
+ print(" - Total: ~2,358 features")
118
+ print()
119
+ print("[OK] Weather data collection COMPLETE!")
120
+ else:
121
+ print()
122
+ print("[ERROR] No weather data collected")
123
+ print()
124
+ print("Possible causes:")
125
+ print(" - OpenMeteo API access issues")
126
+ print(" - Rate limit exceeded")
127
+ print(" - Network connectivity problems")
128
+ print()
129
+ sys.exit(1)
130
+
131
+ except KeyboardInterrupt:
132
+ print()
133
+ print()
134
+ print("="*80)
135
+ print("COLLECTION INTERRUPTED")
136
+ print("="*80)
137
+ print()
138
+ print("Collection was stopped by user.")
139
+ print()
140
+ print("NOTE: OpenMeteo collection does NOT have checkpoint/resume capability")
141
+ print(" (collection completes in ~5 minutes, so not needed)")
142
+ print()
143
+ print("To restart: Run this script again")
144
+ print()
145
+ sys.exit(130)
146
+
147
+ except Exception as e:
148
+ print()
149
+ print()
150
+ print("="*80)
151
+ print("COLLECTION FAILED")
152
+ print("="*80)
153
+ print()
154
+ print(f"Error: {e}")
155
+ print()
156
+ import traceback
157
+ traceback.print_exc()
158
+ print()
159
+ sys.exit(1)
src/data_collection/collect_openmeteo.py CHANGED
@@ -278,7 +278,7 @@ class OpenMeteoCollector:
278
  return df
279
 
280
  except requests.exceptions.RequestException as e:
281
- print(f" Failed {location_id} ({start_date} to {end_date}): {e}")
282
  return pl.DataFrame()
283
 
284
  def collect_all(
@@ -344,7 +344,7 @@ class OpenMeteoCollector:
344
  if location_chunks:
345
  location_df = pl.concat(location_chunks)
346
  all_data.append(location_df)
347
- print(f" {location_id}: {location_df.shape[0]} hours")
348
 
349
  # Combine all dataframes
350
  if all_data:
@@ -363,13 +363,18 @@ class OpenMeteoCollector:
363
  print(f"Total records: {combined_df.shape[0]:,}")
364
  print(f"Date range: {combined_df['timestamp'].min()} to {combined_df['timestamp'].max()}")
365
  print(f"Grid points: {combined_df['grid_point'].n_unique()}")
366
- print(f"Completeness: {(1 - combined_df.null_count().sum() / (combined_df.shape[0] * combined_df.shape[1])) * 100:.2f}%")
 
 
 
 
 
367
  print(f"Output: {output_path}")
368
  print(f"File size: {output_path.stat().st_size / (1024**2):.1f} MB")
369
 
370
  return combined_df
371
  else:
372
- print(" No data collected")
373
  return pl.DataFrame()
374
 
375
 
 
278
  return df
279
 
280
  except requests.exceptions.RequestException as e:
281
+ print(f"[ERROR] Failed {location_id} ({start_date} to {end_date}): {e}")
282
  return pl.DataFrame()
283
 
284
  def collect_all(
 
344
  if location_chunks:
345
  location_df = pl.concat(location_chunks)
346
  all_data.append(location_df)
347
+ print(f"[OK] {location_id}: {location_df.shape[0]} hours")
348
 
349
  # Combine all dataframes
350
  if all_data:
 
363
  print(f"Total records: {combined_df.shape[0]:,}")
364
  print(f"Date range: {combined_df['timestamp'].min()} to {combined_df['timestamp'].max()}")
365
  print(f"Grid points: {combined_df['grid_point'].n_unique()}")
366
+
367
+ # Calculate completeness (fix: extract scalar from Polars)
368
+ null_count_total = combined_df.null_count().sum_horizontal()[0]
369
+ completeness = (1 - null_count_total / (combined_df.shape[0] * combined_df.shape[1])) * 100
370
+ print(f"Completeness: {completeness:.2f}%")
371
+
372
  print(f"Output: {output_path}")
373
  print(f"File size: {output_path.stat().st_size / (1024**2):.1f} MB")
374
 
375
  return combined_df
376
  else:
377
+ print("[ERROR] No data collected")
378
  return pl.DataFrame()
379
 
380
 
src/feature_engineering/engineer_weather_features.py ADDED
@@ -0,0 +1,263 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Engineer 375 Weather features for FBMC forecasting.
2
+
3
+ Transforms OpenMeteo weather data into model-ready features:
4
+ 1. Grid-level features (51 points × 7 vars = 357 features)
5
+ 2. Temporal lags (3 vars × 4 time periods = 12 features)
6
+ 3. Derived features (rate-of-change + stability = 6 features)
7
+
8
+ Total: 375 weather features
9
+
10
+ Weather Variables (7):
11
+ - temperature_2m (C)
12
+ - windspeed_10m (m/s)
13
+ - windspeed_100m (m/s) - for wind generation
14
+ - winddirection_100m (degrees)
15
+ - shortwave_radiation (W/m2) - for solar generation
16
+ - cloudcover (%)
17
+ - surface_pressure (hPa)
18
+
19
+ Author: Claude
20
+ Date: 2025-11-10
21
+ """
22
+ from pathlib import Path
23
+ import polars as pl
24
+
25
+
26
+ def engineer_grid_level_features(weather_df: pl.DataFrame) -> pl.DataFrame:
27
+ """Engineer grid-level weather features (51 points × 7 vars = 357 features).
28
+
29
+ For each grid point, pivot all 7 weather variables to wide format:
30
+ - temp_<grid_point>
31
+ - wind10m_<grid_point>
32
+ - wind100m_<grid_point>
33
+ - winddir_<grid_point>
34
+ - solar_<grid_point>
35
+ - cloud_<grid_point>
36
+ - pressure_<grid_point>
37
+ """
38
+ print("\n[1/5] Engineering grid-level features (51 points × 7 vars)...")
39
+
40
+ # Pivot each weather variable separately
41
+ features = None
42
+
43
+ weather_vars = [
44
+ ('temperature_2m', 'temp'),
45
+ ('windspeed_10m', 'wind10m'),
46
+ ('windspeed_100m', 'wind100m'),
47
+ ('winddirection_100m', 'winddir'),
48
+ ('shortwave_radiation', 'solar'),
49
+ ('cloudcover', 'cloud'),
50
+ ('surface_pressure', 'pressure')
51
+ ]
52
+
53
+ for orig_col, short_name in weather_vars:
54
+ print(f" Pivoting {orig_col}...")
55
+
56
+ pivoted = weather_df.select(['timestamp', 'grid_point', orig_col]).pivot(
57
+ values=orig_col,
58
+ index='timestamp',
59
+ on='grid_point',
60
+ aggregate_function='first'
61
+ )
62
+
63
+ # Rename columns to <short_name>_<grid_point>
64
+ rename_map = {}
65
+ for col in pivoted.columns:
66
+ if col != 'timestamp':
67
+ rename_map[col] = f'{short_name}_{col}'
68
+
69
+ pivoted = pivoted.rename(rename_map)
70
+
71
+ # Join to features
72
+ if features is None:
73
+ features = pivoted
74
+ else:
75
+ features = features.join(pivoted, on='timestamp', how='left', coalesce=True)
76
+
77
+ print(f" [OK] {len(features.columns) - 1} grid-level features")
78
+ return features
79
+
80
+
81
+ def engineer_temporal_lags(features: pl.DataFrame) -> pl.DataFrame:
82
+ """Add temporal lags for key weather variables.
83
+
84
+ Lags: 1h, 6h, 12h, 24h for:
85
+ - Average temperature (1 lag feature)
86
+ - Average wind speed (1 lag feature)
87
+ - Average solar radiation (1 lag feature)
88
+
89
+ Total: ~12 lag features (3 vars × 4 lags)
90
+ """
91
+ print("\n[2/3] Engineering temporal lags (1h, 6h, 12h, 24h)...")
92
+
93
+ # Calculate system-wide averages for lagging
94
+ # Temperature average (across all temp_ columns)
95
+ temp_cols = [c for c in features.columns if c.startswith('temp_')]
96
+ features = features.with_columns([
97
+ pl.concat_list([pl.col(c) for c in temp_cols]).list.mean().alias('temp_avg')
98
+ ])
99
+
100
+ # Wind speed average (100m - for wind generation)
101
+ wind_cols = [c for c in features.columns if c.startswith('wind100m_')]
102
+ features = features.with_columns([
103
+ pl.concat_list([pl.col(c) for c in wind_cols]).list.mean().alias('wind_avg')
104
+ ])
105
+
106
+ # Solar radiation average
107
+ solar_cols = [c for c in features.columns if c.startswith('solar_')]
108
+ features = features.with_columns([
109
+ pl.concat_list([pl.col(c) for c in solar_cols]).list.mean().alias('solar_avg')
110
+ ])
111
+
112
+ # Add lags
113
+ lag_vars = ['temp_avg', 'wind_avg', 'solar_avg']
114
+ lag_hours = [1, 6, 12, 24]
115
+
116
+ for var in lag_vars:
117
+ for lag_h in lag_hours:
118
+ features = features.with_columns([
119
+ pl.col(var).shift(lag_h).alias(f'{var}_lag{lag_h}h')
120
+ ])
121
+
122
+ # Drop intermediate averages (keep only lagged versions)
123
+ features = features.drop(['temp_avg', 'wind_avg', 'solar_avg'])
124
+
125
+ lag_features = len(lag_vars) * len(lag_hours)
126
+ print(f" [OK] {lag_features} temporal lag features")
127
+ return features
128
+
129
+
130
+ def engineer_derived_features(features: pl.DataFrame) -> pl.DataFrame:
131
+ """Engineer derived weather features (6 features).
132
+
133
+ Simple features without requiring calibration data:
134
+ - Rate of change (hour-over-hour deltas): wind, solar, temperature
135
+ - Weather stability (rolling std): wind, solar, temperature
136
+ """
137
+ print("\n[3/3] Engineering derived features (rate-of-change + stability)...")
138
+
139
+ # Calculate system averages for rate-of-change and stability
140
+ wind_cols = [c for c in features.columns if c.startswith('wind100m_')]
141
+ solar_cols = [c for c in features.columns if c.startswith('solar_')]
142
+ temp_cols = [c for c in features.columns if c.startswith('temp_')]
143
+
144
+ features = features.with_columns([
145
+ pl.concat_list([pl.col(c) for c in wind_cols]).list.mean().alias('wind_system_avg'),
146
+ pl.concat_list([pl.col(c) for c in solar_cols]).list.mean().alias('solar_system_avg'),
147
+ pl.concat_list([pl.col(c) for c in temp_cols]).list.mean().alias('temp_system_avg')
148
+ ])
149
+
150
+ # Rate of change (hour-over-hour deltas)
151
+ # Captures sudden spikes/drops that correlate with grid constraints
152
+ features = features.with_columns([
153
+ pl.col('wind_system_avg').diff().alias('wind_rate_change'),
154
+ pl.col('solar_system_avg').diff().alias('solar_rate_change'),
155
+ pl.col('temp_system_avg').diff().alias('temp_rate_change')
156
+ ])
157
+
158
+ # Weather stability: 6-hour rolling std
159
+ # Detects volatility periods (useful for forecasting uncertainty)
160
+ features = features.with_columns([
161
+ pl.col('wind_system_avg').rolling_std(window_size=6).alias('wind_stability_6h'),
162
+ pl.col('solar_system_avg').rolling_std(window_size=6).alias('solar_stability_6h'),
163
+ pl.col('temp_system_avg').rolling_std(window_size=6).alias('temp_stability_6h')
164
+ ])
165
+
166
+ # Drop intermediate columns
167
+ features = features.drop(['wind_system_avg', 'solar_system_avg', 'temp_system_avg'])
168
+
169
+ # Count derived features
170
+ derived_cols = ['wind_rate_change', 'solar_rate_change', 'temp_rate_change',
171
+ 'wind_stability_6h', 'solar_stability_6h', 'temp_stability_6h']
172
+
173
+ print(f" [OK] {len(derived_cols)} derived features")
174
+ return features
175
+
176
+
177
+ def engineer_weather_features(
178
+ weather_path: Path,
179
+ output_dir: Path
180
+ ) -> pl.DataFrame:
181
+ """Main feature engineering pipeline for weather data.
182
+
183
+ Args:
184
+ weather_path: Path to raw weather data (weather_24month.parquet)
185
+ output_dir: Directory to save engineered features
186
+
187
+ Returns:
188
+ DataFrame with ~435 weather features
189
+ """
190
+ print("=" * 80)
191
+ print("WEATHER FEATURE ENGINEERING")
192
+ print("=" * 80)
193
+ print()
194
+ print(f"Input: {weather_path}")
195
+ print(f"Output: {output_dir}")
196
+ print()
197
+
198
+ # Load raw weather data
199
+ print("Loading weather data...")
200
+ weather_df = pl.read_parquet(weather_path)
201
+ print(f" [OK] {weather_df.shape[0]:,} rows × {weather_df.shape[1]} columns")
202
+ print(f" Date range: {weather_df['timestamp'].min()} to {weather_df['timestamp'].max()}")
203
+ print()
204
+
205
+ # 1. Grid-level features (51 × 7 = 357 features)
206
+ all_features = engineer_grid_level_features(weather_df)
207
+
208
+ # 2. Temporal lags (~12 features)
209
+ all_features = engineer_temporal_lags(all_features)
210
+
211
+ # 3. Derived features (6 features: rate-of-change + stability)
212
+ all_features = engineer_derived_features(all_features)
213
+
214
+ # Sort by timestamp
215
+ all_features = all_features.sort('timestamp')
216
+
217
+ # Final validation
218
+ print("\n" + "=" * 80)
219
+ print("FEATURE ENGINEERING COMPLETE")
220
+ print("=" * 80)
221
+ print(f"Total features: {all_features.shape[1] - 1} (excluding timestamp)")
222
+ print(f"Total rows: {len(all_features):,}")
223
+
224
+ # Check completeness
225
+ null_count_total = all_features.null_count().sum_horizontal()[0]
226
+ completeness = (1 - null_count_total / (all_features.shape[0] * all_features.shape[1])) * 100
227
+ print(f"Completeness: {completeness:.2f}%")
228
+ print()
229
+
230
+ # Save features
231
+ output_path = output_dir / 'features_weather_24month.parquet'
232
+ all_features.write_parquet(output_path)
233
+
234
+ file_size_mb = output_path.stat().st_size / (1024 ** 2)
235
+ print(f"Features saved: {output_path}")
236
+ print(f"File size: {file_size_mb:.2f} MB")
237
+ print("=" * 80)
238
+ print()
239
+
240
+ return all_features
241
+
242
+
243
+ def main():
244
+ """Main execution."""
245
+ # Paths
246
+ base_dir = Path.cwd()
247
+ raw_dir = base_dir / 'data' / 'raw'
248
+ processed_dir = base_dir / 'data' / 'processed'
249
+
250
+ weather_path = raw_dir / 'weather_24month.parquet'
251
+
252
+ # Verify file exists
253
+ if not weather_path.exists():
254
+ raise FileNotFoundError(f"Weather data not found: {weather_path}")
255
+
256
+ # Engineer features
257
+ features = engineer_weather_features(weather_path, processed_dir)
258
+
259
+ print("SUCCESS: Weather features engineered and saved to data/processed/")
260
+
261
+
262
+ if __name__ == '__main__':
263
+ main()