--- title: HuggingFace EDA MCP Server short_description: MCP server to explore and analyze HuggingFace datasets emoji: 📊 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 6.0.0 app_file: src/app.py pinned: false license: apache-2.0 app_port: 7860 tags: - building-mcp-track-enterprise - building-mcp-track-consumer --- # 📊 HuggingFace EDA MCP Server > 🎉 Submission for the [Gradio MCP 1st Birthday Hackathon](https://huggingface.co/MCP-1st-Birthday) An MCP server that gives AI assistants the ability to explore and analyze any of the 500,000+ datasets on the HuggingFace Hub. Whether you're a ML engineer, data scientist, or researcher, dataset exploration is a critical part of the workflow. This server automates the tedious parts such as fetching metadata, sampling data, computing statistics, so you can focus on what matters: finding and understanding the right data for your task. **Use cases:** - **Dataset discovery**: - Inspect metadata, schemas, and samples to evaluate datasets before use - Use it in conjunction with HuggingFace MCP `search_dataset` for even more powerful dataset discovery - **Exploratory Data analysis**: - Analyze feature distributions, detect missing values, and review statistics - Ask your AI assistant to build reports and visualizations - **Content search**: Find specific examples in datasets using text search

Demo Video   LinkedIn Post   HF Space

## MCP Client Configuration Connect your MCP client to the hosted server. A HuggingFace token is required to access private/gated datasets and to use the Dataset Viewer API. **Hosted endpoint:** `https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/` ### With URL ```json { "mcpServers": { "hf-eda-mcp": { "url": "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/", "headers": { "hf-api-token": "" } } } } ``` ### With mcp-remote ```json { "mcpServers": { "hf-eda-mcp": { "command": "npx", "args": [ "mcp-remote", "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/", "--transport", "streamable-http", "--header", "hf-api-token: " ] } } } ``` ## Available Tools ### `get_dataset_metadata` Retrieve comprehensive metadata about a HuggingFace dataset. | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `dataset_id` | string | ✅ | - | HuggingFace dataset identifier (e.g., `imdb`, `squad`, `glue`) | | `config_name` | string | ❌ | `None` | Configuration name for multi-config datasets | **Returns:** Dataset size, features schema, splits info, configurations, download stats, tags, download size, description and more. --- ### `get_dataset_sample` Retrieve sample rows from a dataset for quick exploration. | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `dataset_id` | string | ✅ | - | HuggingFace dataset identifier | | `split` | string | ❌ | `train` | Dataset split to sample from | | `num_samples` | int | ❌ | `10` | Number of samples to retrieve (max: 10,000) | | `config_name` | string | ❌ | `None` | Configuration name for multi-config datasets | | `streaming` | bool | ❌ | `True` | Use streaming mode for efficient loading | **Returns:** Sample data rows with schema information and sampling metadata. --- ### `analyze_dataset_features` Perform exploratory data analysis on dataset features with automatic optimization. | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `dataset_id` | string | ✅ | - | HuggingFace dataset identifier | | `split` | string | ❌ | `train` | Dataset split to analyze | | `sample_size` | int | ❌ | `1000` | Number of samples for analysis (max: 50,000) | | `config_name` | string | ❌ | `None` | Configuration name for multi-config datasets | **Returns:** Feature types, statistics (mean, std, min, max for numerical), distributions, histograms, and missing value analysis. Supports numerical, categorical, text, image, and audio data types. --- ### `search_text_in_dataset` Search for text in dataset columns using the Dataset Viewer API. | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `dataset_id` | string | ✅ | - | Full dataset identifier (e.g., `stanfordnlp/imdb`) | | `config_name` | string | ✅ | - | Configuration name | | `split` | string | ✅ | - | Split name | | `query` | string | ✅ | - | Search query | | `offset` | int | ❌ | `0` | Pagination offset | | `length` | int | ❌ | `10` | Number of results to return | **Returns:** Matching rows with highlighted search results. Only works on parquet datasets with text columns. --- ## How It Works ### API Integrations The server leverages multiple HuggingFace APIs: | API | Used For | |-----|----------| | **[Hub API](https://huggingface.co/docs/huggingface_hub/guides/hf_api)** | Dataset metadata, repository info, download stats | | **[Dataset Viewer API](https://huggingface.co/docs/dataset-viewer)** | Full dataset statistics, text search, parquet row access | | **[datasets library](https://huggingface.co/docs/datasets)** | Streaming data loading, sample extraction | ### Data Loading Strategy - **Streaming mode** (default): Uses `datasets.load_dataset(..., streaming=True)` to avoid downloading entire datasets. Samples are taken from an iterator, minimizing memory footprint. - **Statistics API**: For parquet datasets, `analyze_dataset_features` first attempts to fetch pre-computed statistics from the Dataset Viewer API (`/statistics` endpoint), providing full dataset coverage without sampling. - **Fallback**: If statistics aren't available, analysis falls back to sample-based computation. ### Caching Results are cached locally to reduce API calls: | Cache Type | TTL | Location | |------------|-----|----------| | Metadata | 1 hour | `~/.cache/hf_eda_mcp/metadata/` | | Samples | 1 hour | `~/.cache/hf_eda_mcp/samples/` | | Statistics | 1 hour | `~/.cache/hf_eda_mcp/statistics/` | ### Parquet Requirements Some features require datasets with `builder_name="parquet"`: - **Text search** (`search_text_in_dataset`): Only parquet datasets are searchable - **Full statistics**: Pre-computed stats are only available for parquet datasets ### Error Handling - Automatic retry with exponential backoff for transient network errors - Graceful fallback from statistics API to sample-based analysis - Descriptive error messages with suggestions for common issues ## Project Structure ``` src/hf_eda_mcp/ ├── server.py # Gradio app with MCP server setup ├── config.py # Server configuration (env vars, defaults) ├── validation.py # Input validation for all tools ├── error_handling.py # Retry logic, error formatting ├── tools/ # MCP tools (exposed via Gradio) │ ├── metadata.py # get_dataset_metadata │ ├── sampling.py # get_dataset_sample │ ├── analysis.py # analyze_dataset_features │ └── search.py # search_text_in_dataset ├── services/ # Business logic layer │ ├── dataset_service.py # Caching, data loading, statistics └── integrations/ └── dataset_viewer_adapter.py # Dataset Viewer API client └── hf_client.py # HuggingFace Hub API wrapper (HfApi) ``` ## Local Development ### Setup ```bash # Install pdm brew install pdm # Clone the repository git clone https://huggingface.co/spaces/MCP-1st-Birthday/hf-eda-mcp cd hf-eda-mcp # Install dependencies pdm install # Set your HuggingFace token export HF_TOKEN=hf_xxx # or create a .env file with HF_TOKEN=hf_xxx (see config.example.env) # Run the server pdm run hf-eda-mcp ``` The server starts at `http://localhost:7860` with MCP endpoint at `/gradio_api/mcp/`. ## License Apache License 2.0