AI/ML

How to Install Qwen-2.5 Model on a Local Server Using Hugging Face

Free Installation Guide - Step by Step Instructions Inside!

Problem

We want to install and run the Qwen-2.5 model on our local server using Hugging Face, but are unsure how to properly set up the environment, manage dependencies, and execute a prompt.

Solution

We will go through the step-by-step process of:

Setting up the local server with required dependencies.
Installing Hugging Face Transformers & PyTorch for model inference.
Downloading and loading the Qwen-2.5 model for text generation.
Running the model locally and testing an AI-generated response.

1. System Requirements

Before installation, ensure that the local server has the following:

Operating System: Ubuntu 22.04 (or similar)
GPU Support (Optional but Recommended): NVIDIA GPU with CUDA support
RAM: At least 16GB (32GB+ recommended for large models)
Disk Space: At least 50GB free for model storage

2. Install System Dependencies

Start by updating the system and installing required packages:

sudo apt update && sudo apt upgrade -ysudo apt install -y python3 python3-pip git

For NVIDIA GPU, install CUDA & cuDNN:

sudo apt install -y nvidia-driver-525pip install torch torchvision torchaudio --index-url ttps://download.pytorch.org/whl/cu118

Verify GPU installation:

nvidia-smi

If you see GPU details, it’s installed correctly.

3. Set Up a Virtual Environment (Recommended)

To isolate dependencies, create and activate a virtual environment:

python3 -m venv qwen_envsource qwen_env/bin/activate

4. Installing Hugging Face Transformers & Dependencies

Now, install Hugging Face Transformers, PyTorch, and other required libraries:

pip install torch transformers acceleratepip install sentencepiece

Confirm installation:

python -c "import torch; print(torch.cuda.is_available())"

If it prints True, CUDA is enabled for GPU acceleration.

5. Download the Qwen-2.5 Model from Hugging Face

Use the Hugging Face CLI to pull the model:

pip install huggingface_hubhuggingface-cli login # (Optional, required for some models)

Then, download the Qwen-2.5 model:

from transformers import AutoModelForCausalLM, AutoTokenizermodel_name = "Qwen/Qwen2.5-7B" # Change to Qwen2.5-14B if needed # Load tokenizer and modeltokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

print("Model loaded successfully!")

6. Running Qwen-2.5 Locally & Executing a Prompt

Now, let’s test text generation using Qwen-2.5:

def generate_text(prompt): inputs = tokenizer(prompt, return_tensors="pt").to("cuda") # Use "cpu" if no GPU output = model.generate(**inputs, max_length=100) return tokenizer.decode(output[0], skip_special_tokens=True)# Example usageprint(generate_text("What is the meaning of life?"))

If the setup is correct, we should see an AI-generated response.

7. Optimizing Performance (For Large Models)

Enable Half-Precision (FP16) for Faster Inference

Modify the model loading to use torch_dtype=torch.float16:

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")Use DeepSpeed or BitsAndBytes for Memory Efficiency

Install additional tools for better memory usage:

pip install bitsandbytes deepspeed

Then, modify model loading:

from transformers import BitsAndBytesConfigbnb_config = BitsAndBytesConfig(load_in_8bit=True)model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, device_map="auto")

8. Running Qwen-2.5 as an API (Optional)

To access Qwen-2.5 via an API, use FastAPI:

pip install fastapi uvicorn

Create a simple API (app.py):

from fastapi import FastAPIfrom transformers import AutoModelForCausalLM, AutoTokenizerapp = FastAPI()model_name = "Qwen/Qwen2.5-7B"tokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")@app.post("/generate")async def generate(prompt: str): inputs = tokenizer(prompt, return_tensors="pt").to("cuda") output = model.generate(**inputs, max_length=200) return {"response": tokenizer.decode(output[0], skip_special_tokens=True)}# Run API# uvicorn app:app --host 0.0.0.0 --port 8000

This allows you to send prompts via HTTP requests:

curl -X POST "http://localhost:8000/generate" -H "Content-Type: application/json" -d '{"prompt": "Tell me about quantum physics"}'