---
title: HuggingFace EDA MCP Server
short_description: MCP server to explore and analyze HuggingFace datasets
emoji: 📊
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.0.0
app_file: src/app.py
pinned: false
license: apache-2.0
app_port: 7860
tags:
- building-mcp-track-enterprise
- building-mcp-track-consumer
---
# 📊 HuggingFace EDA MCP Server
> 🎉 Submission for the [Gradio MCP 1st Birthday Hackathon](https://huggingface.co/MCP-1st-Birthday)
An MCP server that gives AI assistants the ability to explore and analyze any of the 500,000+ datasets on the HuggingFace Hub.
Whether you're a ML engineer, data scientist, or researcher, dataset exploration is a critical part of the workflow. This server automates the tedious parts such as fetching metadata, sampling data, computing statistics, so you can focus on what matters: finding and understanding the right data for your task.
**Use cases:**
- **Dataset discovery**:
- Inspect metadata, schemas, and samples to evaluate datasets before use
- Use it in conjunction with HuggingFace MCP `search_dataset` for even more powerful dataset discovery
- **Exploratory Data analysis**:
- Analyze feature distributions, detect missing values, and review statistics
- Ask your AI assistant to build reports and visualizations
- **Content search**: Find specific examples in datasets using text search
## MCP Client Configuration
Connect your MCP client to the hosted server. A HuggingFace token is required to access private/gated datasets and to use the Dataset Viewer API.
**Hosted endpoint:** `https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/`
### With URL
```json
{
"mcpServers": {
"hf-eda-mcp": {
"url": "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
"headers": {
"hf-api-token": ""
}
}
}
}
```
### With mcp-remote
```json
{
"mcpServers": {
"hf-eda-mcp": {
"command": "npx",
"args": [
"mcp-remote",
"https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
"--transport",
"streamable-http",
"--header",
"hf-api-token: "
]
}
}
}
```
## Available Tools
### `get_dataset_metadata`
Retrieve comprehensive metadata about a HuggingFace dataset.
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `dataset_id` | string | ✅ | - | HuggingFace dataset identifier (e.g., `imdb`, `squad`, `glue`) |
| `config_name` | string | ❌ | `None` | Configuration name for multi-config datasets |
**Returns:** Dataset size, features schema, splits info, configurations, download stats, tags, download size, description and more.
---
### `get_dataset_sample`
Retrieve sample rows from a dataset for quick exploration.
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `dataset_id` | string | ✅ | - | HuggingFace dataset identifier |
| `split` | string | ❌ | `train` | Dataset split to sample from |
| `num_samples` | int | ❌ | `10` | Number of samples to retrieve (max: 10,000) |
| `config_name` | string | ❌ | `None` | Configuration name for multi-config datasets |
| `streaming` | bool | ❌ | `True` | Use streaming mode for efficient loading |
**Returns:** Sample data rows with schema information and sampling metadata.
---
### `analyze_dataset_features`
Perform exploratory data analysis on dataset features with automatic optimization.
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `dataset_id` | string | ✅ | - | HuggingFace dataset identifier |
| `split` | string | ❌ | `train` | Dataset split to analyze |
| `sample_size` | int | ❌ | `1000` | Number of samples for analysis (max: 50,000) |
| `config_name` | string | ❌ | `None` | Configuration name for multi-config datasets |
**Returns:** Feature types, statistics (mean, std, min, max for numerical), distributions, histograms, and missing value analysis. Supports numerical, categorical, text, image, and audio data types.
---
### `search_text_in_dataset`
Search for text in dataset columns using the Dataset Viewer API.
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `dataset_id` | string | ✅ | - | Full dataset identifier (e.g., `stanfordnlp/imdb`) |
| `config_name` | string | ✅ | - | Configuration name |
| `split` | string | ✅ | - | Split name |
| `query` | string | ✅ | - | Search query |
| `offset` | int | ❌ | `0` | Pagination offset |
| `length` | int | ❌ | `10` | Number of results to return |
**Returns:** Matching rows with highlighted search results. Only works on parquet datasets with text columns.
---
## How It Works
### API Integrations
The server leverages multiple HuggingFace APIs:
| API | Used For |
|-----|----------|
| **[Hub API](https://huggingface.co/docs/huggingface_hub/guides/hf_api)** | Dataset metadata, repository info, download stats |
| **[Dataset Viewer API](https://huggingface.co/docs/dataset-viewer)** | Full dataset statistics, text search, parquet row access |
| **[datasets library](https://huggingface.co/docs/datasets)** | Streaming data loading, sample extraction |
### Data Loading Strategy
- **Streaming mode** (default): Uses `datasets.load_dataset(..., streaming=True)` to avoid downloading entire datasets. Samples are taken from an iterator, minimizing memory footprint.
- **Statistics API**: For parquet datasets, `analyze_dataset_features` first attempts to fetch pre-computed statistics from the Dataset Viewer API (`/statistics` endpoint), providing full dataset coverage without sampling.
- **Fallback**: If statistics aren't available, analysis falls back to sample-based computation.
### Caching
Results are cached locally to reduce API calls:
| Cache Type | TTL | Location |
|------------|-----|----------|
| Metadata | 1 hour | `~/.cache/hf_eda_mcp/metadata/` |
| Samples | 1 hour | `~/.cache/hf_eda_mcp/samples/` |
| Statistics | 1 hour | `~/.cache/hf_eda_mcp/statistics/` |
### Parquet Requirements
Some features require datasets with `builder_name="parquet"`:
- **Text search** (`search_text_in_dataset`): Only parquet datasets are searchable
- **Full statistics**: Pre-computed stats are only available for parquet datasets
### Error Handling
- Automatic retry with exponential backoff for transient network errors
- Graceful fallback from statistics API to sample-based analysis
- Descriptive error messages with suggestions for common issues
## Project Structure
```
src/hf_eda_mcp/
├── server.py # Gradio app with MCP server setup
├── config.py # Server configuration (env vars, defaults)
├── validation.py # Input validation for all tools
├── error_handling.py # Retry logic, error formatting
├── tools/ # MCP tools (exposed via Gradio)
│ ├── metadata.py # get_dataset_metadata
│ ├── sampling.py # get_dataset_sample
│ ├── analysis.py # analyze_dataset_features
│ └── search.py # search_text_in_dataset
├── services/ # Business logic layer
│ ├── dataset_service.py # Caching, data loading, statistics
└── integrations/
└── dataset_viewer_adapter.py # Dataset Viewer API client
└── hf_client.py # HuggingFace Hub API wrapper (HfApi)
```
## Local Development
### Setup
```bash
# Install pdm
brew install pdm
# Clone the repository
git clone https://huggingface.co/spaces/MCP-1st-Birthday/hf-eda-mcp
cd hf-eda-mcp
# Install dependencies
pdm install
# Set your HuggingFace token
export HF_TOKEN=hf_xxx
# or create a .env file with HF_TOKEN=hf_xxx (see config.example.env)
# Run the server
pdm run hf-eda-mcp
```
The server starts at `http://localhost:7860` with MCP endpoint at `/gradio_api/mcp/`.
## License
Apache License 2.0