Run 35B LLMs on Dual Pascal GPUs with QLoRA

Share This Post

Hi HN,

  I built a system to run 35B parameter language models on older Pascal GPUs (P100 +
  GTX 1080 Ti) using multi-GPU memory spillover.

  Problem: Most LLM inference tools (Ollama, LM Studio) are limited to single GPU VRAM
  (~13B models max on a 16GB GPU). If you have multiple older GPUs, the second one sits
   idle.

  Solution: Multi-GPU + CPU memory spillover with QLoRA 4-bit quantization. The system
  automatically distributes layers across GPU0 → GPU1 → CPU RAM, enabling 35B models on
   hardware that normally maxes at 13B.

  Benchmarks (P100 16GB + GTX 1080 Ti 11GB):
  - Qwen-14B: 13.7 tokens/sec (9.4GB VRAM)
  - OPT-30B: 5.4 tokens/sec (15.2GB VRAM)
  - CodeLlama-34B: 0.8 tokens/sec (16.7GB VRAM)

  Quick start:
    docker pull rickeshtn/large-model-international_release:latest
    docker run -it --rm --runtime=nvidia --gpus all --ipc=host     --ulimit memlock=-1
  --ulimit stack=268435456     -v $(pwd):/workspace -e HF_HOME=/workspace/model_cache
     rickeshtn/large-model-international_release:latest     python
  /app/interactive_chat.py --model-name Qwen/Qwen2.5-14B-Instruct

  Technical details:
  - QLoRA 4-bit NF4 quantization (75% memory reduction)
  - HuggingFace Transformers + Accelerate + bitsandbytes
  - Automatic device mapping with CPU offload
  - Interactive chat with conversation persistence

  GitHub: https://github.com/rickeshtn/locallm-pascal
  Docker Hub: https://hub.docker.com/r/rickeshtn/large-model-international_release

  34 users already running it. Happy to answer technical questions!


Comments URL: https://news.ycombinator.com/item?id=45498552

Points: 1

# Comments: 0

Source: news.ycombinator.com

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Do You Want To Boost Your Business?

drop us a line and keep in touch

We are here to help

One of our technicians will be with you shortly.