Spaces:

davanstrien
/

huggingface-datasets-search-v2

Sleeping

davanstrien HF Staff Claude Opus 4.6 (1M context) commited on Apr 6

Commit

21a271a

1 Parent(s): 756e837

Migrate off persistent storage to bucket-mounted ChromaDB

- Simplify main.py: remove setup_database() and all indexing logic.
Space now reads pre-built ChromaDB from mounted storage bucket.
- Add build_chroma_index.py: standalone uv script that builds the
ChromaDB index as an HF Job on GPU (much faster than CPU).
- Update generate_summaries_uv.py: support mounted volumes for model
and input data, pin transformers<4.52, fix vllm version, reduce
content truncation to 3000 chars to avoid exceeding model max length.
- Update HFJOBS_COMMANDS.md: correct output repo names, add index
build command, use hf jobs uv run with volume mounts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (4) hide show

HFJOBS_COMMANDS.md +134 -0
build_chroma_index.py +293 -0
generate_summaries_uv.py +43 -18
main.py +14 -295

HFJOBS_COMMANDS.md ADDED Viewed

	@@ -0,0 +1,134 @@

+# HFJobs Commands for Summary Generation
+This document contains the hfjobs commands for running the summary generation pipeline.
+## Performance Optimizations
+For batch inference workloads (processing thousands of short summaries), consider these vLLM optimizations:
+### Memory and Throughput Settings
+1. **GPU Memory Utilization** (`gpu_memory_utilization`)
+   - Default: 0.9 (90%)
+   - Recommended: 0.95 or 0.98 for batch workloads
+   - Allocates more GPU memory for KV cache, allowing more concurrent sequences
+2. **Chunked Prefill** (`enable_chunked_prefill`)
+   - Set to `True` for many short requests
+   - Interleaves prefill and decode phases more efficiently
+   - Particularly beneficial for uniform, short outputs like summaries
+3. **Max Batched Tokens** (`max_num_batched_tokens`)
+   - Default: 512
+   - Recommended: 4096 or 8192 for better throughput
+   - Controls tokens processed together in a single iteration
+4. **Max Number of Sequences** (`max_num_seqs`)
+   - Increase to 256 or 512 for batch workloads
+   - More concurrent sequences = better throughput
+   - L4 GPU (24GB) can handle aggressive settings
+### Example Optimized Configuration
+```python
+llm = LLM(
+    model=local_model_path,
+    max_model_len=4096,
+    gpu_memory_utilization=0.95,  # Use 95% of GPU memory
+    enable_chunked_prefill=True,   # Better for short requests
+    max_num_batched_tokens=8192,   # High throughput batching
+    max_num_seqs=256,              # Many concurrent sequences
+)
+```
+## Summary Generation (hf jobs uv run)
+Uses `generate_summaries_uv.py` with volume mounts for fast startup (no download step).
+### Dataset Summaries
+```bash
+hf jobs uv run --flavor l4x1 \
+  -v hf://datasets/librarian-bots/dataset_cards_with_metadata:/input:ro \
+  -v hf://davanstrien/Smol-Hub-tldr:/model:ro \
+  -s HF_TOKEN \
+  --timeout 2h \
+  generate_summaries_uv.py \
+    /model \
+    librarian-bots/dataset_cards_with_metadata \
+    davanstrien/datasets_with_metadata_and_summaries \
+    --card-type dataset \
+    --input-path /input \
+    --batch-size 2000
+```
+### Model Summaries
+```bash
+hf jobs uv run --flavor l4x1 \
+  -v hf://datasets/librarian-bots/model_cards_with_metadata:/input:ro \
+  -v hf://davanstrien/SmolLM2-135M-tldr-sft-2025-03-12_19-02:/model:ro \
+  -s HF_TOKEN \
+  --timeout 2h \
+  generate_summaries_uv.py \
+    /model \
+    librarian-bots/model_cards_with_metadata \
+    davanstrien/models_with_metadata_and_summaries \
+    --card-type model \
+    --min-likes 5 \
+    --min-downloads 1000 \
+    --input-path /input \
+    --batch-size 2000
+```
+### Without volume mounts (downloads data instead)
+If volumes aren't available, the script falls back to downloading:
+```bash
+hf jobs uv run --flavor l4x1 \
+  -s HF_TOKEN \
+  --timeout 2h \
+  generate_summaries_uv.py \
+    davanstrien/Smol-Hub-tldr \
+    librarian-bots/dataset_cards_with_metadata \
+    davanstrien/datasets_with_metadata_and_summaries \
+    --card-type dataset \
+    --batch-size 2000
+```
+## ChromaDB Index Build
+Builds/updates the ChromaDB vector index from the summary datasets. Must run after summary generation to update search results. Writes to a Storage Bucket mounted as a volume.
+```bash
+hf jobs uv run --flavor l4x1 \
+  -v hf://buckets/davanstrien/search-v2-chroma:/data \
+  -s HF_TOKEN \
+  https://huggingface.co/spaces/davanstrien/huggingface-datasets-search-v2/raw/main/build_chroma_index.py
+```
+For a full rebuild (delete existing collections first):
+```bash
+hf jobs uv run --flavor l4x1 \
+  -v hf://buckets/davanstrien/search-v2-chroma:/data \
+  -s HF_TOKEN \
+  https://huggingface.co/spaces/davanstrien/huggingface-datasets-search-v2/raw/main/build_chroma_index.py \
+  --full-rebuild
+```
+### Full Pipeline (summaries → index)
+Run summary generation first, then rebuild the index:
+1. Generate dataset summaries (see Dataset Summaries above)
+2. Generate model summaries (see Model Summaries above)
+3. Build the ChromaDB index (this section)
+## Notes
+- The vLLM Docker image approach is preferred over the uv:debian image because it includes all necessary system dependencies (Python headers, CUDA libraries, etc.)
+- The script is run directly from the Hugging Face Space URL using `uv run`
+- Adjust `--batch-size` based on available GPU memory
+- For models, adjust `--min-likes` and `--min-downloads` thresholds as needed

build_chroma_index.py ADDED Viewed

	@@ -0,0 +1,293 @@

+# /// script
+# requires-python = ">=3.11"
+# dependencies = [
+#     "chromadb==1.0.12",
+#     "hf-transfer",
+#     "hf-xet",
+#     "huggingface-hub",
+#     "polars",
+#     "python-dateutil",
+#     "sentence-transformers",
+#     "torch",
+# ]
+# ///
+"""
+Build ChromaDB index for the datasets-search-v2 Space.
+Reads summary parquets from the Hub, embeds them with Qwen3-Embedding-0.6B,
+and writes the ChromaDB index to a mounted Storage Bucket.
+Usage (via hf jobs):
+    hf jobs uv run \
+      --flavor l4x1 \
+      -v hf://buckets/davanstrien/search-v2-chroma:/data \
+      -s HF_TOKEN \
+      build_chroma_index.py
+Local usage:
+    uv run build_chroma_index.py --data-dir ./data
+"""
+import argparse
+import logging
+import os
+import sys
+os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
+import chromadb
+import dateutil.parser
+import polars as pl
+import torch
+from chromadb.config import Settings
+from chromadb.utils import embedding_functions
+from huggingface_hub import login
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s - %(levelname)s - %(message)s",
+    datefmt="%Y-%m-%d %H:%M:%S",
+)
+logger = logging.getLogger(__name__)
+EMBEDDING_MODEL = "Qwen/Qwen3-Embedding-0.6B"
+BATCH_SIZE = 2000
+DATASET_SOURCE = "davanstrien/datasets_with_metadata_and_summaries"
+MODEL_SOURCE = "davanstrien/models_with_metadata_and_summaries"
+def get_device():
+    if torch.cuda.is_available():
+        return "cuda"
+    elif torch.backends.mps.is_available():
+        return "mps"
+    return "cpu"
+def get_embedding_function(device):
+    logger.info(f"Loading embedding model {EMBEDDING_MODEL} on {device}")
+    return embedding_functions.SentenceTransformerEmbeddingFunction(
+        model_name=EMBEDDING_MODEL, device=device
+    )
+def build_dataset_collection(client, embedding_function):
+    """Build/update the dataset_cards collection."""
+    logger.info("=== Building dataset collection ===")
+    collection = client.get_or_create_collection(
+        embedding_function=embedding_function,
+        name="dataset_cards",
+        metadata={"hnsw:space": "cosine"},
+    )
+    df = pl.scan_parquet(
+        f"hf://datasets/{DATASET_SOURCE}/data/train-*.parquet"
+    )
+    df = df.filter(
+        pl.col("datasetId").str.contains_any(["open-llm-leaderboard-old/"]).not_()
+    )
+    df = df.filter(
+        pl.col("datasetId")
+        .str.contains_any(["gemma-2-2B-it-thinking-function_calling-V0"])
+        .not_()
+    )
+    # Check for incremental update
+    latest_update = None
+    if collection.count() > 0:
+        metadata = collection.get(include=["metadatas"]).get("metadatas")
+        logger.info(f"Found {len(metadata)} existing records in collection")
+        last_modifieds = [
+            dateutil.parser.parse(m.get("last_modified")) for m in metadata
+        ]
+        latest_update = max(last_modifieds)
+        logger.info(f"Most recent record in DB from: {latest_update}")
+    df = df.select(["datasetId", "summary", "likes", "downloads", "last_modified"])
+    total_incoming = df.select(pl.len()).collect().item()
+    logger.info(f"Total incoming records from source: {total_incoming}")
+    if latest_update:
+        logger.info(f"Filtering records newer than {latest_update}")
+        df = df.with_columns(pl.col("last_modified").str.to_datetime())
+        df = df.filter(pl.col("last_modified") > latest_update)
+        filtered_count = df.select(pl.len()).collect().item()
+        logger.info(f"Found {filtered_count} records to update after filtering")
+    df = df.collect()
+    total_rows = len(df)
+    if total_rows > 0:
+        logger.info(f"Updating dataset collection with {total_rows} new records")
+        for i in range(0, total_rows, BATCH_SIZE):
+            batch_df = df.slice(i, min(BATCH_SIZE, total_rows - i))
+            batch_size = len(batch_df)
+            collection.upsert(
+                ids=batch_df.select(["datasetId"]).to_series().to_list(),
+                documents=batch_df.select(["summary"]).to_series().to_list(),
+                metadatas=[
+                    {
+                        "likes": int(likes),
+                        "downloads": int(downloads),
+                        "last_modified": str(last_modified),
+                    }
+                    for likes, downloads, last_modified in zip(
+                        batch_df.select(["likes"]).to_series().to_list(),
+                        batch_df.select(["downloads"]).to_series().to_list(),
+                        batch_df.select(["last_modified"]).to_series().to_list(),
+                    )
+                ],
+            )
+            logger.info(f"Processed {i + batch_size:,} / {total_rows:,} dataset records")
+    else:
+        logger.info("No new dataset records to update")
+    final_count = collection.count()
+    logger.info(f"Dataset collection: {final_count:,} total records")
+def build_model_collection(client, embedding_function):
+    """Build/update the model_cards collection."""
+    logger.info("=== Building model collection ===")
+    collection = client.get_or_create_collection(
+        embedding_function=embedding_function,
+        name="model_cards",
+        metadata={"hnsw:space": "cosine"},
+    )
+    model_lazy_df = pl.scan_parquet(
+        f"hf://datasets/{MODEL_SOURCE}/data/train-*.parquet"
+    )
+    # Check for incremental update
+    model_latest_update = None
+    if collection.count() > 0:
+        model_metadata = collection.get(include=["metadatas"]).get("metadatas")
+        logger.info(f"Found {len(model_metadata)} existing model records in collection")
+        model_last_modifieds = [
+            dateutil.parser.parse(m.get("last_modified")) for m in model_metadata
+        ]
+        model_latest_update = max(model_last_modifieds)
+        logger.info(f"Most recent model record in DB from: {model_latest_update}")
+    # Set up columns to select
+    schema = model_lazy_df.collect_schema()
+    select_columns = ["modelId", "summary", "likes", "downloads", "last_modified"]
+    if "param_count" in schema:
+        logger.info("Found 'param_count' column in model data schema.")
+        select_columns.append("param_count")
+    else:
+        logger.warning("'param_count' column not found. Will add with null values.")
+    model_df = model_lazy_df.select(select_columns)
+    model_row_count = model_df.select(pl.len()).collect().item()
+    logger.info(f"Total model records in source: {model_row_count}")
+    if model_latest_update:
+        logger.info(f"Filtering model records newer than {model_latest_update}")
+        model_df = model_df.with_columns(pl.col("last_modified").str.to_datetime())
+        model_df = model_df.filter(pl.col("last_modified") > model_latest_update)
+        model_filtered_count = model_df.select(pl.len()).collect().item()
+        logger.info(f"Found {model_filtered_count} model records to update")
+    else:
+        model_filtered_count = model_df.select(pl.len()).collect().item()
+        logger.info(f"Initial model load: processing all {model_filtered_count} records")
+    if model_filtered_count > 0:
+        model_df = model_df.collect()
+        if "param_count" not in model_df.columns:
+            model_df = model_df.with_columns(
+                pl.lit(None).cast(pl.Int64).alias("param_count")
+            )
+        total_rows = len(model_df)
+        logger.info(f"Updating model collection with {total_rows} new records")
+        for i in range(0, total_rows, BATCH_SIZE):
+            batch_df = model_df.slice(i, min(BATCH_SIZE, total_rows - i))
+            collection.upsert(
+                ids=batch_df.select(["modelId"]).to_series().to_list(),
+                documents=batch_df.select(["summary"]).to_series().to_list(),
+                metadatas=[
+                    {
+                        "likes": int(likes),
+                        "downloads": int(downloads),
+                        "last_modified": str(last_modified),
+                        "param_count": int(param_count)
+                        if param_count is not None
+                        else 0,
+                    }
+                    for likes, downloads, last_modified, param_count in zip(
+                        batch_df.select(["likes"]).to_series().to_list(),
+                        batch_df.select(["downloads"]).to_series().to_list(),
+                        batch_df.select(["last_modified"]).to_series().to_list(),
+                        batch_df.select(["param_count"]).to_series().to_list(),
+                    )
+                ],
+            )
+            logger.info(
+                f"Processed {i + len(batch_df):,} / {total_rows:,} model records"
+            )
+    else:
+        logger.info("No new model records to update")
+    logger.info(f"Model collection: {collection.count():,} total records")
+def main():
+    parser = argparse.ArgumentParser(
+        description="Build ChromaDB index for datasets-search-v2"
+    )
+    parser.add_argument(
+        "--data-dir",
+        default="/data",
+        help="Path to write ChromaDB data (default: /data, the bucket mount point)",
+    )
+    parser.add_argument(
+        "--full-rebuild",
+        action="store_true",
+        help="Delete existing collections and rebuild from scratch",
+    )
+    args = parser.parse_args()
+    # Login
+    HF_TOKEN = os.environ.get("HF_TOKEN")
+    if HF_TOKEN:
+        login(token=HF_TOKEN)
+    chroma_path = os.path.join(args.data_dir, "chroma")
+    logger.info(f"ChromaDB path: {chroma_path}")
+    logger.info(f"ChromaDB version: {chromadb.__version__}")
+    client = chromadb.PersistentClient(
+        path=chroma_path,
+        settings=Settings(anonymized_telemetry=False, is_persistent=True),
+    )
+    if args.full_rebuild:
+        logger.info("Full rebuild requested — deleting existing collections")
+        for name in ["dataset_cards", "model_cards"]:
+            try:
+                client.delete_collection(name)
+                logger.info(f"Deleted collection: {name}")
+            except Exception:
+                pass
+    device = get_device()
+    logger.info(f"Using device: {device}")
+    embedding_function = get_embedding_function(device)
+    build_dataset_collection(client, embedding_function)
+    build_model_collection(client, embedding_function)
+    logger.info("=== Index build complete ===")
+if __name__ == "__main__":
+    main()

generate_summaries_uv.py CHANGED Viewed

@@ -6,13 +6,11 @@
 #     "huggingface-hub[hf_xet]",
 #     "polars",
 #     "stamina",
-#     "transformers",
-#     "vllm",
 #     "tqdm",
 #     "setuptools",
-#     "flashinfer-python",
 # ]
-#
 # ///
 import argparse
 import logging
@@ -54,12 +52,17 @@ logger.info(f"PyTorch version: {torch.__version__}")
 logger.info(f"vLLM version: {vllm.__version__}")
-def format_prompt(content: str, card_type: str, tokenizer) -> str:
-    """Format content as a prompt for the model."""
     if card_type == "model":
-        messages = [{"role": "user", "content": f"<MODEL_CARD>{content[:4000]}"}]
     else:
-        messages = [{"role": "user", "content": f"<DATASET_CARD>{content[:4000]}"}]
     return tokenizer.apply_chat_template(
         messages, add_generation_prompt=True, tokenize=False
@@ -67,12 +70,21 @@ def format_prompt(content: str, card_type: str, tokenizer) -> str:
 def load_and_filter_data(
-    dataset_id: str, card_type: str, min_likes: int = 1, min_downloads: int = 1
 ) -> pl.DataFrame:
-    """Load and filter dataset/model data."""
-    logger.info(f"Loading data from {dataset_id}")
-    ds = load_dataset(dataset_id, split="train")
-    df = ds.to_polars().lazy()
     # Extract content after YAML frontmatter
     df = df.with_columns(
@@ -108,6 +120,7 @@ def generate_summaries(
     min_likes: int = 1,
     min_downloads: int = 1,
     hf_token: Optional[str] = None,
 ):
     """Main function to generate summaries."""
@@ -118,13 +131,19 @@ def generate_summaries(
     # Load and filter data
     df_filtered = load_and_filter_data(
-        input_dataset_id, card_type, min_likes, min_downloads
     )
-    # Download model to local directory first
-    logger.info(f"Downloading model {model_id} to local directory...")
-    local_model_path = snapshot_download(repo_id=model_id, resume_download=True)
-    logger.info(f"Model downloaded to: {local_model_path}")
     # Initialize model and tokenizer from local path
     logger.info(f"Initializing vLLM model from local path: {local_model_path}")
@@ -229,6 +248,11 @@ def main():
     parser.add_argument(
         "--hf-token", help="Hugging Face token (uses HF_TOKEN env var if not provided)"
     )
     args = parser.parse_args()
@@ -243,6 +267,7 @@ def main():
         min_likes=args.min_likes,
         min_downloads=args.min_downloads,
         hf_token=args.hf_token,
     )

 #     "huggingface-hub[hf_xet]",
 #     "polars",
 #     "stamina",
+#     "transformers<4.52",
+#     "vllm>=0.8",
 #     "tqdm",
 #     "setuptools",
 # ]
 # ///
 import argparse
 import logging
 logger.info(f"vLLM version: {vllm.__version__}")
+def format_prompt(content: str, card_type: str, tokenizer, max_content_chars: int = 3000) -> str:
+    """Format content as a prompt for the model.
+    Truncates content to max_content_chars (default 3000) to stay safely
+    under the model's max sequence length after tokenization.
+    """
+    truncated = content[:max_content_chars]
     if card_type == "model":
+        messages = [{"role": "user", "content": f"<MODEL_CARD>{truncated}"}]
     else:
+        messages = [{"role": "user", "content": f"<DATASET_CARD>{truncated}"}]
     return tokenizer.apply_chat_template(
         messages, add_generation_prompt=True, tokenize=False
 def load_and_filter_data(
+    dataset_id: str, card_type: str, min_likes: int = 1, min_downloads: int = 1,
+    local_path: Optional[str] = None,
 ) -> pl.DataFrame:
+    """Load and filter dataset/model data.
+    If local_path is provided (e.g. a mounted volume), reads parquet files
+    directly from disk instead of downloading from the Hub.
+    """
+    if local_path:
+        logger.info(f"Loading data from local path: {local_path}")
+        df = pl.scan_parquet(os.path.join(local_path, "data", "train-*.parquet"))
+    else:
+        logger.info(f"Loading data from {dataset_id}")
+        ds = load_dataset(dataset_id, split="train")
+        df = ds.to_polars().lazy()
     # Extract content after YAML frontmatter
     df = df.with_columns(
     min_likes: int = 1,
     min_downloads: int = 1,
     hf_token: Optional[str] = None,
+    input_path: Optional[str] = None,
 ):
     """Main function to generate summaries."""
     # Load and filter data
     df_filtered = load_and_filter_data(
+        input_dataset_id, card_type, min_likes, min_downloads,
+        local_path=input_path,
     )
+    # Use model_id directly if it's a local path (e.g. mounted volume),
+    # otherwise download from the Hub
+    if os.path.isdir(model_id):
+        local_model_path = model_id
+        logger.info(f"Using model from local/mounted path: {local_model_path}")
+    else:
+        logger.info(f"Downloading model {model_id} to local directory...")
+        local_model_path = snapshot_download(repo_id=model_id, resume_download=True)
+        logger.info(f"Model downloaded to: {local_model_path}")
     # Initialize model and tokenizer from local path
     logger.info(f"Initializing vLLM model from local path: {local_model_path}")
     parser.add_argument(
         "--hf-token", help="Hugging Face token (uses HF_TOKEN env var if not provided)"
     )
+    parser.add_argument(
+        "--input-path",
+        help="Local/mounted path to input dataset (skips download). "
+             "E.g. /input when using -v hf://datasets/org/dataset:/input",
+    )
     args = parser.parse_args()
         min_likes=args.min_likes,
         min_downloads=args.min_downloads,
         hf_token=args.hf_token,
+        input_path=args.input_path,
     )

main.py CHANGED Viewed

@@ -3,32 +3,27 @@ import logging
 import os
 import sys
 from contextlib import asynccontextmanager
-from datetime import datetime
 from typing import List, Optional
 import chromadb
-import dateutil.parser
 import httpx
-import polars as pl
 import torch
 from cashews import cache
 from chromadb.utils import embedding_functions
 from fastapi import FastAPI, HTTPException, Query
 from fastapi.middleware.cors import CORSMiddleware
 from pydantic import BaseModel
-from transformers import AutoTokenizer
 from dotenv import load_dotenv
 from huggingface_hub import login
 load_dotenv(override=True)
 HF_TOKEN = os.getenv("HF_TOKEN")
 login(token=HF_TOKEN)
 # Configuration constants
-MODEL_NAME = "davanstrien/Smol-Hub-tldr"
 EMBEDDING_MODEL = "Qwen/Qwen3-Embedding-0.6B"
-BATCH_SIZE = 2000
 CACHE_TTL = "24h"
-TRENDING_CACHE_TTL = "1h"  # 15 minutes cache for trending data
 if torch.cuda.is_available():
     DEVICE = "cuda"
@@ -37,34 +32,34 @@ elif torch.backends.mps.is_available():
 else:
     DEVICE = "cpu"
-tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
-os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"  # turn on HF_TRANSFER
-# Set up logging
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
-LOCAL = False
-if sys.platform == "darwin":
-    LOCAL = True
 DATA_DIR = "data" if LOCAL else "/data"
 # Configure cache
 cache.setup("mem://", size_limit="8gb")
-# Initialize ChromaDB client
 client = chromadb.PersistentClient(path=f"{DATA_DIR}/chroma")
 # Initialize FastAPI app
 @asynccontextmanager
 async def lifespan(app: FastAPI):
-    # Setup
-    setup_database()
     yield
-    # Cleanup
     await cache.close()
@@ -92,282 +87,6 @@ def get_embedding_function():
     )
-def setup_database():
-    try:
-        embedding_function = get_embedding_function()
-        dataset_collection = client.get_or_create_collection(
-            embedding_function=embedding_function,
-            name="dataset_cards",
-            metadata={"hnsw:space": "cosine"},
-        )
-        model_collection = client.get_or_create_collection(
-            embedding_function=embedding_function,
-            name="model_cards",
-            metadata={"hnsw:space": "cosine"},
-        )
-        # Load dataset data
-        df = pl.scan_parquet(
-            "hf://datasets/davanstrien/datasets_with_metadata_and_summaries/data/train-*.parquet"
-        )
-        df = df.filter(
-            pl.col("datasetId").str.contains_any(["open-llm-leaderboard-old/"]).not_()
-        )
-        df = df.filter(
-            pl.col("datasetId")
-            .str.contains_any(
-                ["gemma-2-2B-it-thinking-function_calling-V0"]
-            )  # course model that's not useful for retrieving
-            .not_()
-        )
-        # Get the most recent last_modified date from the collection
-        latest_update = None
-        if dataset_collection.count() > 0:
-            metadata = dataset_collection.get(include=["metadatas"]).get("metadatas")
-            logger.info(f"Found {len(metadata)} existing records in collection")
-            last_modifieds = [
-                dateutil.parser.parse(m.get("last_modified")) for m in metadata
-            ]
-            latest_update = max(last_modifieds)
-            logger.info(f"Most recent record in DB from: {latest_update}")
-            logger.info(f"Oldest record in DB from: {min(last_modifieds)}")
-            # Log sample of existing timestamps for debugging
-            sample_timestamps = sorted(last_modifieds, reverse=True)[:5]
-            logger.info(f"Sample of most recent DB timestamps: {sample_timestamps}")
-        # Filter and process only newer records
-        df = df.select(["datasetId", "summary", "likes", "downloads", "last_modified"])
-        # Log some stats about the incoming data BEFORE collecting
-        total_incoming = df.select(pl.len()).collect().item()
-        logger.info(f"Total incoming records from source: {total_incoming}")
-        # Get sample of dates to understand the data
-        sample_df = (
-            df.select(["datasetId", "last_modified"])
-            .sort("last_modified", descending=True)
-            .limit(5)
-            .collect()
-        )
-        logger.info(f"Sample of most recent incoming records: {sample_df.rows()[:3]}")
-        if latest_update:
-            logger.info(f"Filtering records newer than {latest_update}")
-            logger.info(f"Latest update type: {type(latest_update)}")
-            # Get date range before filtering
-            date_stats = df.select(
-                [
-                    pl.col("last_modified").min().alias("min_date"),
-                    pl.col("last_modified").max().alias("max_date"),
-                ]
-            ).collect()
-            logger.info(f"Incoming data date range: {date_stats.row(0)}")
-            # Ensure last_modified is datetime before comparison
-            df = df.with_columns(pl.col("last_modified").str.to_datetime())
-            df = df.filter(pl.col("last_modified") > latest_update)
-            filtered_count = df.select(pl.len()).collect().item()
-            logger.info(f"Found {filtered_count} records to update after filtering")
-            if filtered_count == 0:
-                logger.warning(
-                    "No new records found after filtering! This might indicate a problem."
-                )
-                # Log a few records that were just below the cutoff
-                just_before = (
-                    df.select(["datasetId", "last_modified"])
-                    .filter(pl.col("last_modified") <= latest_update)
-                    .sort("last_modified", descending=True)
-                    .limit(3)
-                    .collect()
-                )
-                if len(just_before) > 0:
-                    logger.info(f"Records just before cutoff: {just_before.rows()}")
-        df = df.collect()
-        total_rows = len(df)
-        if total_rows > 0:
-            logger.info(f"Updating dataset collection with {total_rows} new records")
-            logger.info(
-                f"Date range of updates: {df['last_modified'].min()} to {df['last_modified'].max()}"
-            )
-            for i in range(0, total_rows, BATCH_SIZE):
-                batch_df = df.slice(i, min(BATCH_SIZE, total_rows - i))
-                batch_size = len(batch_df)
-                logger.info(
-                    f"Processing batch {i // BATCH_SIZE + 1}: {batch_size} records "
-                    f"({batch_df['last_modified'].min()} to {batch_df['last_modified'].max()})"
-                )
-                ids_to_upsert = batch_df.select(["datasetId"]).to_series().to_list()
-                # Log progress for every batch
-                if i == 0 or (i // BATCH_SIZE + 1) % 5 == 0:  # Log every 5th batch
-                    logger.info(f"Upserting batch {i // BATCH_SIZE + 1} (sample IDs: {ids_to_upsert[:3]})")
-                # Check if any of these already exist (sample only)
-                if i == 0:  # Only log for first batch to reduce noise
-                    existing_check = dataset_collection.get(
-                        ids=ids_to_upsert[:3], include=["metadatas"]
-                    )
-                    if existing_check["ids"]:
-                        logger.info(
-                            f"Sample: {len(existing_check['ids'])} existing records being updated"
-                        )
-                dataset_collection.upsert(
-                    ids=ids_to_upsert,
-                    documents=batch_df.select(["summary"]).to_series().to_list(),
-                    metadatas=[
-                        {
-                            "likes": int(likes),
-                            "downloads": int(downloads),
-                            "last_modified": str(last_modified),
-                        }
-                        for likes, downloads, last_modified in zip(
-                            batch_df.select(["likes"]).to_series().to_list(),
-                            batch_df.select(["downloads"]).to_series().to_list(),
-                            batch_df.select(["last_modified"]).to_series().to_list(),
-                        )
-                    ],
-                )
-                logger.info(f"Processed {i + batch_size:,} / {total_rows:,} records")
-        # Final validation
-        final_count = dataset_collection.count()
-        logger.info(f"Database initialized with {final_count:,} total rows")
-        # Verify the update worked by checking latest records
-        if final_count > 0:
-            # Get ALL metadata to find the true latest timestamp (not just 5 records)
-            final_metadata = dataset_collection.get(include=["metadatas"])
-            final_timestamps = [
-                dateutil.parser.parse(m.get("last_modified"))
-                for m in final_metadata.get("metadatas")
-            ]
-            if final_timestamps:
-                latest_after_update = max(final_timestamps)
-                logger.info(f"Latest record after update: {latest_after_update}")
-                if latest_update and latest_after_update <= latest_update:
-                    logger.error(
-                        "WARNING: No new records were added! Latest timestamp hasn't changed."
-                    )
-                elif latest_update:
-                    logger.info(
-                        f"Successfully added records from {latest_update} to {latest_after_update}"
-                    )
-                else:
-                    logger.info(f"Initial database setup completed. Latest record: {latest_after_update}")
-        # Load model data
-        model_lazy_df = pl.scan_parquet(
-            "hf://datasets/davanstrien/models_with_metadata_and_summaries/data/train-*.parquet"
-        )
-        model_row_count = model_lazy_df.select(pl.len()).collect().item()
-        logger.info(f"Total model records in source: {model_row_count}")
-        # Get the most recent last_modified date from the model collection
-        model_latest_update = None
-        if model_collection.count() > 0:
-            model_metadata = model_collection.get(include=["metadatas"]).get(
-                "metadatas"
-            )
-            logger.info(
-                f"Found {len(model_metadata)} existing model records in collection"
-            )
-            model_last_modifieds = [
-                dateutil.parser.parse(m.get("last_modified")) for m in model_metadata
-            ]
-            model_latest_update = max(model_last_modifieds)
-            logger.info(f"Most recent model record in DB from: {model_latest_update}")
-        # Set up model schema columns
-        schema = model_lazy_df.collect_schema()
-        select_columns = [
-            "modelId",
-            "summary",
-            "likes",
-            "downloads",
-            "last_modified",
-        ]
-        if "param_count" in schema:
-            logger.info("Found 'param_count' column in model data schema.")
-            select_columns.append("param_count")
-        else:
-            logger.warning(
-                "'param_count' column not found in model data schema. Will add it with null values."
-            )
-        # Filter and process only newer model records
-        model_df = model_lazy_df.select(select_columns)
-        # Apply timestamp filtering like we do for datasets
-        if model_latest_update:
-            logger.info(f"Filtering model records newer than {model_latest_update}")
-            model_df = model_df.with_columns(pl.col("last_modified").str.to_datetime())
-            model_df = model_df.filter(pl.col("last_modified") > model_latest_update)
-            model_filtered_count = model_df.select(pl.len()).collect().item()
-            logger.info(f"Found {model_filtered_count} model records to update after filtering")
-        else:
-            model_filtered_count = model_df.select(pl.len()).collect().item()
-            logger.info(f"Initial model load: processing all {model_filtered_count} model records")
-        if model_filtered_count > 0:
-            model_df = model_df.collect()
-            # If param_count was not in the original schema, add it now to the collected DataFrame
-            if "param_count" not in model_df.columns:
-                model_df = model_df.with_columns(
-                    pl.lit(None).cast(pl.Int64).alias("param_count")
-                )
-            total_rows = len(model_df)
-            logger.info(f"Updating model collection with {total_rows} new records")
-            for i in range(0, total_rows, BATCH_SIZE):
-                batch_df = model_df.slice(i, min(BATCH_SIZE, total_rows - i))
-                model_collection.upsert(
-                    ids=batch_df.select(["modelId"]).to_series().to_list(),
-                    documents=batch_df.select(["summary"]).to_series().to_list(),
-                    metadatas=[
-                        {
-                            "likes": int(likes),
-                            "downloads": int(downloads),
-                            "last_modified": str(last_modified),
-                            "param_count": int(param_count)
-                            if param_count is not None
-                            else 0,
-                        }
-                        for likes, downloads, last_modified, param_count in zip(
-                            batch_df.select(["likes"]).to_series().to_list(),
-                            batch_df.select(["downloads"]).to_series().to_list(),
-                            batch_df.select(["last_modified"]).to_series().to_list(),
-                            batch_df.select(["param_count"]).to_series().to_list(),
-                        )
-                    ],
-                )
-                logger.info(
-                    f"Processed {i + len(batch_df):,} / {total_rows:,} model rows"
-                )
-        logger.info(
-            f"Model database initialized with {model_collection.count():,} rows"
-        )
-    except Exception as e:
-        logger.error(f"Setup error: {e}")
-# Setup database is called in lifespan, not here
 class QueryResult(BaseModel):
     dataset_id: str
     similarity: float

 import os
 import sys
 from contextlib import asynccontextmanager
 from typing import List, Optional
 import chromadb
 import httpx
 import torch
 from cashews import cache
 from chromadb.utils import embedding_functions
 from fastapi import FastAPI, HTTPException, Query
 from fastapi.middleware.cors import CORSMiddleware
 from pydantic import BaseModel
 from dotenv import load_dotenv
 from huggingface_hub import login
 load_dotenv(override=True)
 HF_TOKEN = os.getenv("HF_TOKEN")
 login(token=HF_TOKEN)
 # Configuration constants
 EMBEDDING_MODEL = "Qwen/Qwen3-Embedding-0.6B"
 CACHE_TTL = "24h"
+TRENDING_CACHE_TTL = "1h"
 if torch.cuda.is_available():
     DEVICE = "cuda"
 else:
     DEVICE = "cpu"
+os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
+LOCAL = sys.platform == "darwin"
 DATA_DIR = "data" if LOCAL else "/data"
 # Configure cache
 cache.setup("mem://", size_limit="8gb")
+# Initialize ChromaDB client (index is pre-built by build_chroma_index.py Job)
 client = chromadb.PersistentClient(path=f"{DATA_DIR}/chroma")
 # Initialize FastAPI app
 @asynccontextmanager
 async def lifespan(app: FastAPI):
+    # Index is pre-built by build_chroma_index.py Job — no setup needed
+    logger.info(f"ChromaDB path: {DATA_DIR}/chroma")
+    try:
+        dc = client.get_collection("dataset_cards")
+        mc = client.get_collection("model_cards")
+        logger.info(f"dataset_cards: {dc.count():,} records, model_cards: {mc.count():,} records")
+    except Exception as e:
+        logger.error(f"Failed to read collections — is the bucket mounted at {DATA_DIR}? {e}")
     yield
     await cache.close()
     )
 class QueryResult(BaseModel):
     dataset_id: str
     similarity: float