--- base_model: - Qwen/Qwen2.5-7B-Instruct license: apache-2.0 pipeline_tag: text-generation library_name: transformers tags: - data-analysis - code-generation - qwen --- This repository contains the **DataMind-Qwen2.5-7B** model, which was presented in the paper [Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study](https://huggingface.co/papers/2506.19794). **Paper Abstract:** Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate model behavior across three core dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs' analytical reasoning capabilities. For more details, visit the official [DataMind GitHub repository](https://github.com/zjunlp/DataMind).

✨ DataMind

## 🔧 Installation #### 🔩Manual Environment Configuration Conda virtual environments offer a light and flexible setup. **Prerequisites** - Anaconda Installation - GPU support (recommended CUDA version: 12.4) **Configure Steps** 1. Clone the repository: ```bash git clone https://github.com/zjunlp/DataMind.git ``` 2. Enter the working directory, and all subsequent commands should be executed in this directory. ```bash cd DataMind/eval ``` 3. Create a virtual environment using `Anaconda`. ```bash conda create -n DataMind python=3.10 conda activate DataMind ``` 4. Install all required Python packages. ```bash pip install -r requirements.txt ``` ## Usage (Text Generation for Data Analysis) You can use this model with the Hugging Face `transformers` library for text generation, particularly for data analysis and code generation tasks. First, ensure you have the `transformers` library installed: ```bash pip install transformers torch ``` Then, you can load and use the model as follows: ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "zjunlp/DataMind-Qwen2.5-7B" # Or zjunlp/DataMind-Qwen2.5-14B, if available # Load the model and tokenizer # Use torch_dtype=torch.bfloat16 for better performance on compatible GPUs # Use device_map="auto" to automatically distribute the model across available devices model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) # Example: Generate Python code for data analysis messages = [ {"role": "user", "content": "I have a CSV file named 'sales_data.csv' with columns 'Date', 'Product', 'Quantity', 'Price'. Write Python code using pandas to calculate the total revenue for each product and save it to a new CSV file named 'product_revenue.csv'."} ] # Apply chat template for Qwen models text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) # Generate response generated_ids = model.generate( model_inputs.input_ids, max_new_tokens=512, do_sample=True, temperature=0.7, top_p=0.8, repetition_penalty=1.05, eos_token_id=tokenizer.eos_token_id, # Ensure generation stops at EOS token ) # Decode and print the generated text response = tokenizer.batch_decode(generated_ids[0][len(model_inputs.input_ids[0]):], skip_special_tokens=True)[0] print(response) ``` ## 🧐 Evaluation > Note: > > - **Ensure** that your working directory is set to the **`eval`** folder in a virtual environment. > - If you have more questions, feel free to open an issue with us. > - If you need to use local model, you need to deploy it according to **(Optional)`local_model.sh`**. **Step 1: Prepare the parameter configuration** The evaluation datasets we used are in [QRData](https://github.com/xxxiaol/QRData) and [DiscoveryBench](https://github.com/allenai/discoverybench). The script expects data to be at `data/QRData/benchmark/data/*.csv` and `data/DiscoveryBench/*.csv`. You can also download our sft models directly from Hugging Face: [DataMind-Qwen2.5-7B](https://huggingface.co/zjunlp/DataMind-Qwen2.5-7B) ,[DataMind-Qwen2.5-14B ](https://huggingface.co/zjunlp/DataMind-Qwen2.5-14B). Here is the example: **`config.yaml`** ```yaml api_key: your_api_key # your API key for the model with API service. No need for open-source models. data_root: /path/to/your/project/DataMind/eval/data # Root directory for data. (absolute path) ``` **`run_eval.sh`** ```bash python do_generate.py \ --model_name DataMind-Qwen2.5-7B \ # Model name to use. --check_model gpt-4o-mini \ # Check model to use. --output results \ # Output directory path. --dataset_name QRData \ # Dataset name to use, chosen from QRData, DiscoveryBench. --max_round 25 \ # Maximum number of steps. --api_port 8000 \ # API port number, it is necessary if the local model is used. --bidx 0 \ # Begin index (inclusive), `None` indicates that there is no restriction. --eidx None \ # End index (exclusive), `None` indicates that there is no restriction. --temperature 0.0 \ # Temperature for sampling. --top_p 1 \ # Top p for sampling. --add_random False \ # Whether to add random files. ``` **(Optional)`local_model.sh`** ```bash CUDA_VISIBLE_DEVICES=$i python -m vllm.entrypoints.openai.api_server \ --model $MODEL_PATH \ # Local model path. --served-model-name $MODEL_NAME \ # The model name specified by you. --tensor-parallel-size $i \ # Set the size of tensor parallel processing. --port $port # API port number, which is consistent with the `api_port` above. ``` **Step 2: Run the shell script** **(Optional)** Deploy the local model if you need. ```bash bash local_model.sh ``` Run the shell script to start the process. ```bash bash run_eval.sh ``` ## ✍️ Citation If you find our work helpful, please use the following citations. ``` @article{zhu2025open, title={Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study}, author={Zhu, Yuqi and Zhong, Yi and Zhang, Jintian and Zhang, Ziheng and Qiao, Shuofei and Luo, Yujie and Du, Lun and Zheng, Da and Chen, Huajun and Zhang, Ningyu}, journal={arXiv preprint arXiv:2506.19794}, year={2025} } ```