# Scripts Documentation ๐Ÿš€ Automated scripts for HeoCare Chatbot setup and maintenance. ## ๐Ÿ“‹ Quick Start ### One-Command Setup (Recommended) ```bash # Run everything in one command bash scripts/setup_rag.sh ``` **What it does:** 1. โœ… Check Python & dependencies 2. โœ… Install required packages 3. โœ… Download 6 medical datasets from HuggingFace 4. โœ… Build ChromaDB vector stores (~160 MB) 5. โœ… Generate training data (200 conversations) 6. โœ… Optional: Fine-tune agents **Time:** ~15-20 minutes (depends on internet speed) --- ## ๐Ÿ“œ Available Scripts ### 1. `setup_rag.sh` โญ Main Setup ```bash bash scripts/setup_rag.sh ``` **Features:** - Downloads 6 datasets from HuggingFace: - ViMedical (603 diseases) - MentalChat16K (16K conversations) - Nutrition recommendations - Vietnamese food nutrition - Fitness exercises (1.66K) - Medical Q&A (9.3K pairs) - Builds ChromaDB vector stores - Generates training data - Optional fine-tuning **Skip existing databases automatically!** --- ### 2. `generate_training_data.py` - Training Data ```bash python scripts/generate_training_data.py ``` **What it does:** - Generates 200 synthetic conversations - 50 scenarios per agent (nutrition, symptom, exercise, mental_health) - Uses GPT-4o-mini - Output: `fine_tuning/training_data/*.jsonl` **Cost:** ~$0.50 (OpenAI API) --- ### 3. `auto_finetune.py` - Batch Fine-tuning ```bash python scripts/auto_finetune.py ``` **What it does:** - Fine-tunes all 4 agents automatically - Uploads training files - Creates fine-tuning jobs - Tracks progress - Updates model config **Requirements:** OpenAI official API (custom APIs not supported) --- ### 4. `fine_tune_agent.py` - Single Agent Fine-tuning ```bash python scripts/fine_tune_agent.py nutrition_agent ``` **What it does:** - Fine-tune one specific agent - Manual control over the process - Alternative to auto_finetune.py **Agents:** `nutrition_agent`, `symptom_agent`, `exercise_agent`, `mental_health_agent` --- ### 5. `check_rag_status.py` - Diagnostic Tool ```bash python scripts/check_rag_status.py ``` **What it checks:** - โœ… ChromaDB folders exist - ๐Ÿ“Š Database sizes - ๐Ÿ“š Document counts - ๐Ÿงช Test queries **Note:** May need updates for new vector store paths --- ## ๐Ÿ“ Directory Structure ``` scripts/ โ”œโ”€โ”€ setup_rag.sh # โญ Main setup script โ”œโ”€โ”€ generate_training_data.py # Generate synthetic data โ”œโ”€โ”€ auto_finetune.py # Batch fine-tuning โ”œโ”€โ”€ fine_tune_agent.py # Single agent fine-tuning โ”œโ”€โ”€ check_rag_status.py # Diagnostic tool โ””โ”€โ”€ README.md # This file data_mining/ # Dataset downloaders โ”œโ”€โ”€ mining_vimedical.py # ViMedical diseases โ”œโ”€โ”€ mining_mentalchat.py # Mental health conversations โ”œโ”€โ”€ mining_nutrition.py # Nutrition recommendations โ”œโ”€โ”€ mining_vietnamese_food.py # Vietnamese food data โ”œโ”€โ”€ mining_fitness.py # Fitness exercises โ””โ”€โ”€ mining_medical_qa.py # Medical Q&A pairs rag/vector_store/ # ChromaDB (NOT committed) โ”œโ”€โ”€ medical_diseases/ # ViMedical (603 diseases) โ”œโ”€โ”€ mental_health/ # MentalChat (16K conversations) โ”œโ”€โ”€ nutrition/ # Nutrition plans โ”œโ”€โ”€ vietnamese_nutrition/ # Vietnamese foods (73) โ”œโ”€โ”€ fitness/ # Exercises (1.66K) โ”œโ”€โ”€ symptom_qa/ # Medical Q&A โ””โ”€โ”€ general_health_qa/ # General health Q&A fine_tuning/training_data/ # Generated data (NOT committed) โ”œโ”€โ”€ nutrition_training.jsonl โ”œโ”€โ”€ symptom_training.jsonl โ”œโ”€โ”€ exercise_training.jsonl โ””โ”€โ”€ mental_health_training.jsonl ``` --- ## ๐Ÿ”„ Team Workflow ### First Time Setup (New Team Member) ```bash # 1. Clone repo git clone cd heocare-chatbot # 2. Create .env file cp .env.example .env # Add your OPENAI_API_KEY # 3. Setup everything (one command) bash scripts/setup_rag.sh # 4. Run app python app.py ``` **Time:** ~15-20 minutes --- ### Daily Development ```bash # Pull latest code git pull # If setup_rag.sh was updated, run it again # (It will skip existing databases automatically) bash scripts/setup_rag.sh # Run app python app.py ``` --- ### Regenerate Training Data ```bash # If you updated agent prompts or scenarios python scripts/generate_training_data.py # Optional: Fine-tune with new data python scripts/auto_finetune.py ``` --- ### Reset Everything ```bash # Delete all generated data rm -rf rag/vector_store/* rm -rf fine_tuning/training_data/* rm -rf data_mining/datasets/* rm -rf data_mining/output/* # Setup from scratch bash scripts/setup_rag.sh ``` --- ## ๐Ÿ› Troubleshooting ### Setup Failed ```bash # Check Python version (need 3.8+) python --version # Check dependencies pip install -r requirements.txt # Check API key echo $OPENAI_API_KEY ``` --- ### Dataset Download Failed ```bash # Check internet connection ping huggingface.co # Try manual download for specific dataset python data_mining/mining_vimedical.py python data_mining/mining_mentalchat.py ``` --- ### ChromaDB Issues ```bash # Check status python scripts/check_rag_status.py # Delete and rebuild specific database rm -rf rag/vector_store/medical_diseases python data_mining/mining_vimedical.py # Move to correct location mkdir -p rag/vector_store mv data_mining/output/medical_chroma rag/vector_store/medical_diseases ``` --- ### Fine-tuning 404 Error ``` Error: 404 - {'detail': 'Not Found'} ``` **Cause:** Custom API endpoint doesn't support fine-tuning **Solution:** 1. Use OpenAI official API for fine-tuning 2. Or skip fine-tuning (app works fine with base model + RAG) ```bash # Option 1: Update .env to use official API OPENAI_BASE_URL=https://api.openai.com/v1 OPENAI_API_KEY=sk-proj-your-official-key # Option 2: Skip fine-tuning # Just run the app without fine-tuning python app.py ``` --- ## ๐Ÿ“Š Performance | Task | Time | Size | |------|------|------| | Download datasets | ~5-8 min | ~50 MB | | Build ChromaDB | ~5-7 min | ~160 MB | | Generate training data | ~2-3 min | ~500 KB | | Fine-tuning (optional) | ~30-60 min | - | | **Total Setup** | **~15-20 min** | **~160 MB** | --- ## ๐Ÿ†˜ Support If you encounter issues: 1. Run `python scripts/check_rag_status.py` for diagnostics 2. Check console logs for errors 3. Verify `.gitignore` is correct 4. Try deleting and rebuilding specific databases 5. Check that `.env` has valid API key --- **Happy Coding! ๐Ÿš€**