Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.1.0
Final Domain Collection Research
Summary of Findings
Available Methods in jao-py
The JaoPublicationToolPandasClient class provides three domain query methods:
query_final_domain(mtu, presolved, cne, co, use_mirror)(Line 233)- Final Computation - Final FB parameters following LTN
- Published: 10:30 D-1
- Most complete dataset (recommended for Phase 2)
query_prefinal_domain(mtu, presolved, cne, co, use_mirror)(Line 248)- Pre-Final (EarlyPub) - Pre-final FB parameters before LTN
- Published: 08:00 D-1
- Earlier publication time, but before LTN application
query_initial_domain(mtu, presolved, cne, co)(Line 264)- Initial Computation (Virgin Domain) - Initial flow-based parameters
- Published: Early in D-1
- Before any adjustments
Method Parameters
def query_final_domain(
mtu: pd.Timestamp, # Market Time Unit (1 hour, timezone-aware)
presolved: bool = None, # Filter: True=binding, False=non-binding, None=ALL
cne: str = None, # CNEC name keyword filter (NOT EIC-based!)
co: str = None, # Contingency keyword filter
use_mirror: bool = False # Use mirror.flowbased.eu for faster bulk download
) -> pd.DataFrame
Key Findings
DENSE Data Acquisition:
- Set
presolved=Noneto get ALL CNECs (binding + non-binding) - This provides the DENSE format needed for Phase 2 feature engineering
- Set
Filtering Limitations:
- ❌ NO EIC-based filtering on server side
- ✅ Only keyword-based filters (cne, co) available
- Solution: Download all CNECs, filter locally by EIC codes
Query Granularity:
- Method queries 1 hour at a time (mtu = Market Time Unit)
- For 24 months: Need 17,520 API calls (1 per hour)
- Alternative: Use
use_mirror=Truefor whole-day downloads
Mirror Option (Recommended for bulk collection):
- URL:
https://mirror.flowbased.eu/dacc/final_domain/YYYY-MM-DD - Returns full day (24 hours) as CSV in ZIP file
- Much faster than hourly API calls
- Set
use_mirror=TrueOR set env varJAO_USE_MIRROR=1
- URL:
Data Structure (from
parse_final_domain()):- Returns pandas DataFrame with columns:
- Identifiers:
mtu(timestamp),tso,cnec_name,cnec_eic,direction - Contingency:
contingency_*fields (nested structure flattened) - Presolved field: Indicates if CNEC is binding (True) or redundant (False)
- RAM breakdown:
ram,fmax,imax,frm,fuaf,amr,lta_margin, etc. - PTDFs:
ptdf_AT,ptdf_BE, ...,ptdf_SK(12 Core zones)
- Identifiers:
- Timestamps converted to Europe/Amsterdam timezone
- snake_case column names (except PTDFs)
- Returns pandas DataFrame with columns:
Recommended Implementation for Phase 2
Option A: Mirror-based (FASTEST):
def collect_final_domain_sample(
start_date: str,
end_date: str,
target_cnec_eics: list[str], # 200 EIC codes from Phase 1
output_path: Path
) -> pl.DataFrame:
"""Collect DENSE CNEC data for specific CNECs using mirror."""
client = JAOClient() # With use_mirror=True
all_data = []
for date in pd.date_range(start_date, end_date):
# Query full day (all CNECs) via mirror
df_day = client.query_final_domain(
mtu=pd.Timestamp(date, tz='Europe/Amsterdam'),
presolved=None, # ALL CNECs (DENSE!)
use_mirror=True # Fast bulk download
)
# Filter to target CNECs only
df_filtered = df_day[df_day['cnec_eic'].isin(target_cnec_eics)]
all_data.append(df_filtered)
# Combine and save
df_full = pd.concat(all_data)
pl_df = pl.from_pandas(df_full)
pl_df.write_parquet(output_path)
return pl_df
Option B: Hourly API calls (SLOWER, but more granular):
def collect_final_domain_hourly(
start_date: str,
end_date: str,
target_cnec_eics: list[str],
output_path: Path
) -> pl.DataFrame:
"""Collect DENSE CNEC data hour-by-hour."""
client = JAOClient()
all_data = []
for date in pd.date_range(start_date, end_date, freq='H'):
try:
df_hour = client.query_final_domain(
mtu=pd.Timestamp(date, tz='Europe/Amsterdam'),
presolved=None # ALL CNECs
)
df_filtered = df_hour[df_hour['cnec_eic'].isin(target_cnec_eics)]
all_data.append(df_filtered)
except NoMatchingDataError:
continue # Hour may have no data
df_full = pd.concat(all_data)
pl_df = pl.from_pandas(df_full)
pl_df.write_parquet(output_path)
return pl_df
Data Volume Estimates
Full Download (all ~20K CNECs):
- 20,000 CNECs × 17,520 hours = 350M records
- ~27 columns × 8 bytes/value = ~75 GB uncompressed
- Parquet compression: ~10-20 GB
Filtered (200 target CNECs):
- 200 CNECs × 17,520 hours = 3.5M records
- ~27 columns × 8 bytes/value = ~750 MB uncompressed
- Parquet compression: ~100-150 MB
Implementation Strategy
- Phase 1 complete: Identify top 200 CNECs from SPARSE data
- Extract EIC codes: Save to
data/processed/critical_cnecs_eic_codes.csv - Test on 1 week: Validate DENSE collection with mirror
# Test: 2025-09-23 to 2025-09-30 (8 days) # Expected: 200 CNECs × 192 hours = 38,400 records - Collect 24 months: Using mirror for speed
- Validate DENSE structure:
unique_cnecs = df['cnec_eic'].n_unique() unique_hours = df['mtu'].n_unique() expected = unique_cnecs * unique_hours actual = len(df) assert actual == expected, f"Not DENSE! {actual} != {expected}"
Advantages of Mirror Method
- ✅ Faster: 1 request/day vs 24 requests/day
- ✅ Rate limit friendly: 730 requests vs 17,520 requests
- ✅ More reliable: Less chance of timeout/connection errors
- ✅ Complete days: Guarantees all 24 hours present
Next Steps
- Add
collect_final_domain_dense()method tocollect_jao.py - Test on 1-week sample with target EIC codes
- Validate DENSE structure and data quality
- Run 24-month collection after Phase 1 complete
- Use DENSE data for Tier 1 & Tier 2 feature engineering
Research completed: 2025-11-05 jao-py version: 0.6.2 Source: C:\Users\evgue\projects\fbmc_chronos2.venv\Lib\site-packages\jao\jao.py