Evgueni Poloukarov Claude commited on
Commit
b8daa7e
·
1 Parent(s): 3b607e3

fix: move GPU cache clearing to START of border loop

Browse files

Fixed critical bug in GPU memory management:
- Previous: Cache clearing AFTER line 235 (never reached due to OOM at line 196)
- Current: Cache clearing BEFORE each border iteration (line 183)

Why this matters:
- First border succeeds (clean GPU state)
- Second+ borders failed because previous 17.71 GB never cleared
- Cache must be cleared BEFORE inference, not after success

Validation:
- All 38 borders failed with identical OOM at line 196 (predict_df)
- Error occurs INSIDE try block, skipping cache clearing at line 241
- Moving to loop start ensures cleanup before EVERY border

Expected result: Multi-border forecasts should now succeed

Co-Authored-By: Claude <[email protected]>

src/forecasting/chronos_inference.py CHANGED
@@ -176,6 +176,13 @@ class ChronosInferencePipeline:
176
  print(f" Features: weather per zone, generation per zone, CNEC outages, LTA, load forecasts")
177
 
178
  for i, border in enumerate(forecast_borders, 1):
 
 
 
 
 
 
 
179
  border_start = time.time()
180
  print(f"\n [{i}/{len(forecast_borders)}] {border}...", flush=True)
181
 
@@ -234,12 +241,6 @@ class ChronosInferencePipeline:
234
 
235
  print(f" [OK] Complete in {inference_time:.1f}s (WITH {len(future_data.columns)-2} covariates)", flush=True)
236
 
237
- # Release GPU memory cache before processing next border
238
- # This prevents memory accumulation across sequential forecasts
239
- # Does NOT affect model weights (710M params stay loaded)
240
- # Does NOT affect forecast accuracy (each border is independent)
241
- torch.cuda.empty_cache()
242
-
243
  except Exception as e:
244
  import traceback
245
  error_msg = f"{type(e).__name__}: {str(e)}"
 
176
  print(f" Features: weather per zone, generation per zone, CNEC outages, LTA, load forecasts")
177
 
178
  for i, border in enumerate(forecast_borders, 1):
179
+ # Clear GPU cache BEFORE each border to prevent memory accumulation
180
+ # This releases tensors from previous border (no-op on first iteration)
181
+ # Does NOT affect model weights (710M params stay loaded)
182
+ # Does NOT affect forecast accuracy (each border is independent)
183
+ if i > 1: # Skip on first border (clean GPU state)
184
+ torch.cuda.empty_cache()
185
+
186
  border_start = time.time()
187
  print(f"\n [{i}/{len(forecast_borders)}] {border}...", flush=True)
188
 
 
241
 
242
  print(f" [OK] Complete in {inference_time:.1f}s (WITH {len(future_data.columns)-2} covariates)", flush=True)
243
 
 
 
 
 
 
 
244
  except Exception as e:
245
  import traceback
246
  error_msg = f"{type(e).__name__}: {str(e)}"