Spaces:

evgueni-p
/

fbmc-chronos2

Sleeping

Evgueni Poloukarov Claude commited on Nov 17

Commit

b8daa7e

1 Parent(s): 3b607e3

fix: move GPU cache clearing to START of border loop

Fixed critical bug in GPU memory management:
- Previous: Cache clearing AFTER line 235 (never reached due to OOM at line 196)
- Current: Cache clearing BEFORE each border iteration (line 183)

Why this matters:
- First border succeeds (clean GPU state)
- Second+ borders failed because previous 17.71 GB never cleared
- Cache must be cleared BEFORE inference, not after success

Validation:
- All 38 borders failed with identical OOM at line 196 (predict_df)
- Error occurs INSIDE try block, skipping cache clearing at line 241
- Moving to loop start ensures cleanup before EVERY border

Expected result: Multi-border forecasts should now succeed

Co-Authored-By: Claude <[email protected]>

Files changed (1) hide show

src/forecasting/chronos_inference.py +7 -6

src/forecasting/chronos_inference.py CHANGED Viewed

@@ -176,6 +176,13 @@ class ChronosInferencePipeline:
         print(f"  Features: weather per zone, generation per zone, CNEC outages, LTA, load forecasts")
         for i, border in enumerate(forecast_borders, 1):
             border_start = time.time()
             print(f"\n  [{i}/{len(forecast_borders)}] {border}...", flush=True)
@@ -234,12 +241,6 @@ class ChronosInferencePipeline:
                 print(f"    [OK] Complete in {inference_time:.1f}s (WITH {len(future_data.columns)-2} covariates)", flush=True)
-                # Release GPU memory cache before processing next border
-                # This prevents memory accumulation across sequential forecasts
-                # Does NOT affect model weights (710M params stay loaded)
-                # Does NOT affect forecast accuracy (each border is independent)
-                torch.cuda.empty_cache()
             except Exception as e:
                 import traceback
                 error_msg = f"{type(e).__name__}: {str(e)}"

         print(f"  Features: weather per zone, generation per zone, CNEC outages, LTA, load forecasts")
         for i, border in enumerate(forecast_borders, 1):
+            # Clear GPU cache BEFORE each border to prevent memory accumulation
+            # This releases tensors from previous border (no-op on first iteration)
+            # Does NOT affect model weights (710M params stay loaded)
+            # Does NOT affect forecast accuracy (each border is independent)
+            if i > 1:  # Skip on first border (clean GPU state)
+                torch.cuda.empty_cache()
             border_start = time.time()
             print(f"\n  [{i}/{len(forecast_borders)}] {border}...", flush=True)
                 print(f"    [OK] Complete in {inference_time:.1f}s (WITH {len(future_data.columns)-2} covariates)", flush=True)
             except Exception as e:
                 import traceback
                 error_msg = f"{type(e).__name__}: {str(e)}"