this post was submitted on 11 Apr 2024
47 points (79.7% liked)

Technology

59132 readers
2951 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 1 points 6 months ago

Well that's a loaded question.

There are probably some websites that let you try out the model while they run it on their own equipment (or have it rented out through Amazon, etc.). But the biggest advantage to these models is being able to run it locally if you have the hardware to handle it (beefy GPU for quicker responses and a lot of RAM).

To quickly answer your question, you can download the model from here:
https://huggingface.co/TheBloke/dolphin-2.5-mixtral-8x7b-GGUF
I would recommend Q5_K_M.

But you'll also need some software to run it.

A large number of users are using "Text-Generation-WebUI" https://github.com/oobabooga/text-generation-webui
There's also "LM Studio" https://lmstudio.ai/
Ollama https://github.com/ollama/ollama
And more.

I know that LM Studio supports Both NVIDIA and AMD GPUs.
Text-Generation-WebUI can support AMD GPUs as well, it just requires some additional setup to get it working.

Some things to keep in mind...
Hardware requirements:
- RAM is the biggest limiting factor with which model you can run while your GPU/CPU will decide how quickly the LLM can respond.
- If you can fit the entire model inside of your GPU's VRAM you'll get the most speed. In this case I would suggest using a GPTQ model instead of GGUF https://huggingface.co/TheBloke/dolphin-2.5-mixtral-8x7b-GPTQ
- Even the newest consumer grade GPUs only have 24GB of VRAM right now (RTX 4090, RTX 3090, and RX 7900 XTX). And the next generation of Consumer GPUs are looking like they will be capped at 24GB of VRAM as well unless AMD decides this is their way of competing with NVIDIA.
GGUF models let you compensate for VRAM limitations by loading the model first in VRAM and anything leftover will get loaded into system RAM.

Context Length: Think of an LLM like something that only has a fixed amount of short term memory. The bigger you set the context length, the more short term memory you can give it (the maximum length you can set depends on the model you're using and setting it to the max also requires more RAM). Mixtral 8x7B models have a Max context length of 32k.