Build για local LLM με 70B parameters

Axelq · 9 Ιανουαρίου

Καλή χρονιά σε όλους.

Θέλω να φτιάξω σύστημα που να αντέξει να τρέχει LLM με 70B parameters με αρχεία που έχω από την εργασία μου που δε θέλω να τα ανεβάσω online.

Έχω δοκιμάσει με το LLama 3 με 8B parameters με τη RTX 4060 αλλά δεν είναι και πολύ "έξυπνο".

Στόχος είναι να μπορεί να τρέξει το Llama 3 70B (4-bit quantized model), με 1-2 RTX 3090 24GB ή 4 Refirbished Nvidia P40 (24GB)

Ευπρόσδεκτες όλες οι προτάσεις, ειδικά εάν έχετε φτιάξει παρόμοια συστήματα.

Επεξ/σία 9 Ιανουαρίου από Axelq

Sheogorath · 9 Ιανουαρίου

14 λεπτά πριν, Axelq είπε

Καλή χρονιά σε όλους.

Θέλω να φτιάξω σύστημα που να αντέξει να τρέχει LLM με 70B parameters με αρχεία που έχω από την εργασία μου που δε θέλω να τα ανεβάσω online.

Έχω δοκιμάσει με το LLama 3 με 8B parameters με τη RTX 4060 αλλά δεν είναι και πολύ "έξυπνο".

Στόχος είναι να μπορεί να τρέξει το Llama 3 70B (4-bit quantized model), με μία ή δύο RTX 3090 24GB.

Ευπρόσδεκτες όλες οι προτάσεις, ειδικά εάν έχετε φτιάξει παρόμοια συστήματα.

Καλησπέρα, το έχω ψάξει αρκετά και έχω δει άπειρα video. Εφτιαξα και ένα σερβερ χαμηλού κόστους πρόσφατα και πλέον βλέπω για αναβάθμιση GPUs.
Η πιο οικονομική λύση είναι πολλές RTX 4060ti 16GB. Μέχρι 4 κάνει scale, μέχρι 3 έχει νόημα. Τα 24GB Vram είναι πολύ οριακά. Θες 32+ και με 48 με 3 4060ti είσαι καλά. Σε επόμενο βήμα πας σε 3 4070ti super.

Στις φωτογραφίες το δικό μου "σεβερ". Δες πατέντες για ψύξη αν θες ιδέες.

Με την λογική ότι θες πολλές PCI express για 3-4 κάρτες, προτείνω τα παρακάτω.

https://www.skroutz.gr/s/40205887/ASRock-WRX80-Creator-R2-0-Wi-Fi-Motherboard-Extended-ATX-me-AMD-SP3-Socket.html
https://www.skroutz.gr/s/54359039/AMD-Ryzen-Threadripper-Pro-3945WX-4GHz-Epexergastis-12-Pyrinon-gia-Socket-sWRX8-Tray.html
https://www.skroutz.gr/s/20591891/G-Skill-Ripjaws-V-64GB-DDR4-RAM-me-4-Modules-4x16GB-kai-Tachytita-3600-gia-Desktop-F4-3600C18Q-64GVK.html

https://www.skroutz.gr/s/56618123/Corsair-RM1000x-1000W-Mayro-Trofodotiko-Ypologisti-Full-Modular.html

3X https://www.skroutz.gr/s/46260438/MSI-GeForce-RTX-4060-Ti-16GB-GDDR6-Ventus-2X-Black-OC-Karta-Grafikon-V517-005R.html Επειδή είναι 2slot, χωράνε 4 σύνολο.

Απο την άλλη αν βρεις φθηνά 2 3090 ή 4090, μαζι σου. Πας σε νεότερο σοκετ ΑΜ5 απλά θες διπλές Χ8 θύρες, και η μητρική είναι πάλι στα 450+ ευρώ.

Επεξ/σία 9 Ιανουαρίου από Sheogorath

Axelq · 9 Ιανουαρίου

Με την Intel Arc ήθελα να μάθω τι γίνεται και αν έχει νόημα να περιμένει κανείς την B580 με 24GB μου φημολογείται πως θα έχει σε αρκετά προσιτή τιμή.

Sheogorath · 9 Ιανουαρίου

Δεν θα πήγαινα σε intel είναι η αλήθεια, για DL/AI. Είναι νωρίς ακόμα.

hawkpilot · 10 Ιανουαρίου

17 hours ago, Sheogorath said:

Καλησπέρα, το έχω ψάξει αρκετά και έχω δει άπειρα video. Εφτιαξα και ένα σερβερ χαμηλού κόστους πρόσφατα και πλέον βλέπω για αναβάθμιση GPUs.
Η πιο οικονομική λύση είναι πολλές RTX 4060ti 16GB. Μέχρι 4 κάνει scale, μέχρι 3 έχει νόημα. Τα 24GB Vram είναι πολύ οριακά. Θες 32+ και με 48 με 3 4060ti είσαι καλά. Σε επόμενο βήμα πας σε 3 4070ti super.

Στις φωτογραφίες το δικό μου "σεβερ". Δες πατέντες για ψύξη αν θες ιδέες.

Με την λογική ότι θες πολλές PCI express για 3-4 κάρτες, προτείνω τα παρακάτω.

https://www.skroutz.gr/s/40205887/ASRock-WRX80-Creator-R2-0-Wi-Fi-Motherboard-Extended-ATX-me-AMD-SP3-Socket.html
https://www.skroutz.gr/s/54359039/AMD-Ryzen-Threadripper-Pro-3945WX-4GHz-Epexergastis-12-Pyrinon-gia-Socket-sWRX8-Tray.html
https://www.skroutz.gr/s/20591891/G-Skill-Ripjaws-V-64GB-DDR4-RAM-me-4-Modules-4x16GB-kai-Tachytita-3600-gia-Desktop-F4-3600C18Q-64GVK.html

https://www.skroutz.gr/s/56618123/Corsair-RM1000x-1000W-Mayro-Trofodotiko-Ypologisti-Full-Modular.html

3X https://www.skroutz.gr/s/46260438/MSI-GeForce-RTX-4060-Ti-16GB-GDDR6-Ventus-2X-Black-OC-Karta-Grafikon-V517-005R.html Επειδή είναι 2slot, χωράνε 4 σύνολο.

Απο την άλλη αν βρεις φθηνά 2 3090 ή 4090, μαζι σου. Πας σε νεότερο σοκετ ΑΜ5 απλά θες διπλές Χ8 θύρες, και η μητρική είναι πάλι στα 450+ ευρώ.

Πόσα t/s σου δίνουν οι 4060ti σε 70B LLM (και ποιο χρησιμοποιείς αν επιτρέπεται)?

Sheogorath · 10 Ιανουαρίου

2 ώρες πριν, hawkpilot είπε

Πόσα t/s σου δίνουν οι 4060ti σε 70B LLM (και ποιο χρησιμοποιείς αν επιτρέπεται)?

Δεν το έχω κάνει εγώ. Έχω βίντεο με scaling πιο πάνω. Δεν έχω καταφέρει να τρέξω 70Β στο δικό μου, και μάλλον θα το κάνω με δυο παλιές Tesla αν βρω γιατί δεν έχω να δώσω 2 χιλιάρικα σε κάρτες

panatha1369 · 10 Ιανουαρίου

Σκεφτομουν και εγω αλλα με 3060 12gb vram..Αλλα επειδη το ψαχνω μπορεις να παιξεις και με amd καρτες..

Sheogorath · 10 Ιανουαρίου

6 λεπτά πριν, panatha1369 είπε

Σκεφτομουν και εγω αλλα με 3060 12gb vram..Αλλα επειδη το ψαχνω μπορεις να παιξεις και με amd καρτες..

Επισήμως απλά μόνο με σειρά 6000 και πάνω. Οι 5000 δεν παίζουν επίσημα δυστυχώς.

Θα δοκιμάσω 5500ΧΤ σύντομα ελπίζω, και ενημερώνω.

daemonix · 20 Απριλίου

On 10/01/2025 at 06:30, hawkpilot said:

Πόσα t/s σου δίνουν οι 4060ti σε 70B LLM (και ποιο χρησιμοποιείς αν επιτρέπεται)?

I hope replying in English is ok as writing in Greek is a problem for me. I can read Greek though.

I have build and run LLMs on a number of systems with a combination of hardware (even on that weird NV GB200 ARM CPU/GPU thinky..)

1) you should be looking for VRAM Speed, this is the number one estimator for fast inference time (t/s). At the moment 3090 are really good, even compared to some entry level 50XX as they have around 900Gb memory speed.

2) How big are you local files? a) do you need to load them in the context window in one go? b) are you looking for building a RAG solution?

This point is important in calculating additional VRAM needed for the context window. A 70b model can easily use double the "4bit Q" VRAM if you need to load 32K tokens in memory for example.

Sheogorath · 20 Απριλίου

1 ώρα πριν, daemonix είπε

I hope replying in English is ok as writing in Greek is a problem for me. I can read Greek though.

I have build and run LLMs on a number of systems with a combination of hardware (even on that weird NV GB200 ARM CPU/GPU thinky..)

1) you should be looking for VRAM Speed, this is the number one estimator for fast inference time (t/s). At the moment 3090 are really good, even compared to some entry level 50XX as they have around 900Gb memory speed.

2) How big are you local files? a) do you need to load them in the context window in one go? b) are you looking for building a RAG solution?

This point is important in calculating additional VRAM needed for the context window. A 70b model can easily use double the "4bit Q" VRAM if you need to load 32K tokens in memory for example.

Hello!

Nice to have someone here that has experience with this. How are the requirements to use RAG different from using inference Vram wise? Does having Optane memory make a difference in reading a local dataset?

hawkpilot · 20 Απριλίου

2 hours ago, daemonix said:

I hope replying in English is ok as writing in Greek is a problem for me. I can read Greek though.

I have build and run LLMs on a number of systems with a combination of hardware (even on that weird NV GB200 ARM CPU/GPU thinky..)

1) you should be looking for VRAM Speed, this is the number one estimator for fast inference time (t/s). At the moment 3090 are really good, even compared to some entry level 50XX as they have around 900Gb memory speed.

2) How big are you local files? a) do you need to load them in the context window in one go? b) are you looking for building a RAG solution?

This point is important in calculating additional VRAM needed for the context window. A 70b model can easily use double the "4bit Q" VRAM if you need to load 32K tokens in memory for example.

Nice of you to participate in this post.

1) I currently have a 3090 Ti and intend to replace it with a 5090. The 3090 has a decent VRAM bandwidth (around 1 TB/sec if I'm not mistaken)

2) a) Local files are medium to large in size and need them in one go, b) yes.

Thank you

@topos · 20 Απριλίου

Μήπως πολλαπλές 5060ti 16gb είναι καλύτερες λόγο gddr7 σε σχέση με 4060ti 16gb.

Πιθανών να βγαίνουν και φθηνότερες πλεον καινούργιες.(Από nbb τις βρίσκεις με ~450ευρω και εάν έχεις επαγγελματικό αφμ βγαίνουν στα ~380ευρω)

daemonix · 20 Απριλίου

1 hour ago, Sheogorath said:

Hello!

Nice to have someone here that has experience with this. How are the requirements to use RAG different from using inference Vram wise? Does having Optane memory make a difference in reading a local dataset?

1) with "full knowledge" in memory (or context window) you feed you full text file(s) and your question in one go. So you need a context window (or max context length) big enough for the file. (you need to count your file in tokens, I only know terminal tools for this)

2) RAG is another story completely (way more complex too). you build a knowledge database from all your files and then you use "small parts" of the files when needed to "help" the LLM remember or "augment" (enhance maybe) the answer.

you only care about VRAM size and bandwidth. A simple SSD even sata is ok.

39 minutes ago, hawkpilot said:

Nice of you to participate in this post.

1) I currently have a 3090 Ti and intend to replace it with a 5090. The 3090 has a decent VRAM bandwidth (around 1 TB/sec if I'm not mistaken)

2) a) Local files are medium to large in size and need them in one go, b) yes.

Thank you

it might be better getting a second 3090ti. VRAN is way more important for LLM. 48GB is not much at all.
Asking "why sky is blue" is very easy as you can do it with a 1000 token context window. So total VRAM usage is = "size of model @ 4bit"
If you drop a 1000 line text file and then ask a question... lets say 6000 tokens for the file. You need 5 more Gbytes for the management of the window. (depending on the size of the model, Billion params)

2) find a tool for your OS that counts tokens. can test your files.

35 minutes ago, @topos said:

Μήπως πολλαπλές 5060ti 16gb είναι καλύτερες λόγο gddr7 σε σχέση με 4060ti 16gb.

Πιθανών να βγαίνουν και φθηνότερες πλεον καινούργιες.(Από nbb τις βρίσκεις με ~450ευρω και εάν έχεις επαγγελματικό αφμ βγαίνουν στα ~380ευρω)

No I dont think so! If you check memory bandwidth on wikipedia for 30XX, 40XX etc you will see. The last couple of days on Reddit a guy did a test with 5070 I think and 3090 is still the best option.

It might sound weird but "running" (I mean inference and not training) is mostly a memory size and speed thing. 3090 is the hot topic on all Local LLM groups now.
EDIT: 5070ti has good speed but 16gb is nothing. Hardly 24gb can serve an ok model. The small models are really not ment to do anything smart. Do test which level of intelligence works for you specific work!

Επεξ/σία 20 Απριλίου από daemonix

hawkpilot · 20 Απριλίου

14 minutes ago, daemonix said:

...

it might be better getting a second 3090ti. VRAN is way more important for LLM. 48GB is not much at all.
Asking "why sky is blue" is very easy as you can do it with a 1000 token context window. So total VRAM usage is = "size of model @ 4bit"
If you drop a 1000 line text file and then ask a question... lets say 6000 tokens for the file. You need 5 more Gbytes for the management of the window. (depending on the size of the model, Billion params)

2) find a tool for your OS that counts tokens. can test your files.

Yeah, that's what I read and think so. Getting another 3090 and using the NVlink is probably the best option aside power requirements.

daemonix · 20 Απριλίου

9 minutes ago, hawkpilot said:

Yeah, that's what I read and think so. Getting another 3090 and using the NVlink is probably the best option aside power requirements.

btw, NVLINK is not required. the crosstalk between PCI is more than ok for LLMs. The model layers are split between GPU and the data flowing are not that much.

I have an 80way system that only has NVLINK per 2 GPUs (for 4 pairs of 2 GPUs). Almost zero difference with NV talk OFF.

EDIT: you can run 3090 @ 90% power limit with almost no loss. if you google there are post with people running 50-60 watt off with 5% speed loss for LLMs.

Επεξ/σία 20 Απριλίου από daemonix

Σύνδεση

Build για local LLM με 70B parameters

Προτεινόμενες αναρτήσεις

Axelq

Sheogorath

Axelq

Sheogorath

hawkpilot

Sheogorath

panatha1369

Sheogorath

daemonix

Sheogorath

hawkpilot

@topos

daemonix

hawkpilot

daemonix

Δημιουργήστε ένα λογαριασμό ή συνδεθείτε για να σχολιάσετε

Δημιουργία λογαριασμού

Σύνδεση

Σύνδεση