|
2 | 2 | <feed xmlns="http://www.w3.org/2005/Atom"> |
3 | 3 | <id>/r/LocalLLaMA/.rss</id> |
4 | 4 | <title>LocalLlama</title> |
5 | | - <updated>2026-03-28T16:28:56+00:00</updated> |
| 5 | + <updated>2026-03-28T16:48:53+00:00</updated> |
6 | 6 | <link href="https://old.reddit.com/r/LocalLLaMA/" rel="alternate"/> |
7 | 7 | <generator uri="https://lkiesow.github.io/python-feedgen" version="1.0.0">python-feedgen</generator> |
8 | 8 | <icon>https://www.redditstatic.com/icon.png/</icon> |
|
34 | 34 | <published>2026-03-27T20:43:24+00:00</published> |
35 | 35 | </entry> |
36 | 36 | <entry> |
37 | | - <id>t3_1s64eux</id> |
38 | | - <title>Which is better : one highly capable LLM (100+B) or many smaller LLMs (>20B)</title> |
39 | | - <updated>2026-03-28T16:08:30+00:00</updated> |
| 37 | + <id>t3_1s65aif</id> |
| 38 | + <title>I started extracting system prompts at 16; now the repo is one of GitHub’s 50 most-starred.</title> |
| 39 | + <updated>2026-03-28T16:42:07+00:00</updated> |
40 | 40 | <author> |
41 | | - <name>/u/More_Chemistry3746</name> |
42 | | - <uri>https://old.reddit.com/user/More_Chemistry3746</uri> |
| 41 | + <name>/u/Independent-Box-898</name> |
| 42 | + <uri>https://old.reddit.com/user/Independent-Box-898</uri> |
43 | 43 | </author> |
44 | | - <content type="html"><!-- SC_OFF --><div class="md"><p>I'm thinking about either having multiple PCs that run smaller models, or one powerful machine that can run a large model. Let's assume both the small and large models run in Q4 with sufficient memory and good performance</p> </div><!-- SC_ON --> &#32; submitted by &#32; <a href="https://old.reddit.com/user/More_Chemistry3746"> /u/More_Chemistry3746 </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1s64eux/which_is_better_one_highly_capable_llm_100b_or/">[link]</a></span> &#32; <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1s64eux/which_is_better_one_highly_capable_llm_100b_or/">[comments]</a></span></content> |
45 | | - <link href="https://old.reddit.com/r/LocalLLaMA/comments/1s64eux/which_is_better_one_highly_capable_llm_100b_or/"/> |
| 44 | + <content type="html"><!-- SC_OFF --><div class="md"><p>Hi!</p> <p>I just published a new post about how my repo, system-prompts-and-models-of-ai-tools, ended up in GitHub’s top 50 most-starred projects of all time.</p> <p>What started as a curiosity project turned into a repo of system prompts from major AI products like Cursor, Devin, Windsurf, Claude Code, GitHub Copilot, v0, Lovable, Replit, Perplexity, Manus, Trae, and others.</p> <p>The post isn’t really about the star count itself, it’s more about what I learned from reading hundreds of system prompts and watching how people started using the repo.</p> <p>A few of the main takeaways:</p> <ul> <li>most major AI tools are much more similar under the hood than they look from the outside</li> <li>prompts aren’t just “wrappers” anymore: in many cases they’re part of the actual product logic</li> <li>the industry’s prompt-level defenses are starting to converge in ways that raise real security questions</li> </ul> <p>I also touch on the ethics of publishing these prompts, the weird maintenance challenge of keeping a repo like this updated, and why I think prompt-level security matters a lot more now that agents can use tools and take real actions.</p> <p>If you’re interested in AI agents, prompt engineering, transparency, or just how these systems are actually put together, I think you’ll find it interesting.</p> <p>Links:</p> <ul> <li>Blog post: <a href="https://medium.com/@lucknite/how-a-niche-ai-repo-ended-up-in-githubs-top-50-d66e4338b380">How a Niche AI Repo Ended Up in GitHub’s Top 50</a></li> <li>Repo: <a href="https://github.com/x1xhlol/system-prompts-and-models-of-ai-tools">https://github.com/x1xhlol/system-prompts-and-models-of-ai-tools</a></li> </ul> </div><!-- SC_ON --> &#32; submitted by &#32; <a href="https://old.reddit.com/user/Independent-Box-898"> /u/Independent-Box-898 </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1s65aif/i_started_extracting_system_prompts_at_16_now_the/">[link]</a></span> &#32; <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1s65aif/i_started_extracting_system_prompts_at_16_now_the/">[comments]</a></span></content> |
| 45 | + <link href="https://old.reddit.com/r/LocalLLaMA/comments/1s65aif/i_started_extracting_system_prompts_at_16_now_the/"/> |
46 | 46 | <category term="LocalLLaMA" label="r/LocalLLaMA"/> |
47 | | - <published>2026-03-28T16:08:30+00:00</published> |
| 47 | + <published>2026-03-28T16:42:07+00:00</published> |
48 | 48 | </entry> |
49 | 49 | <entry> |
50 | 50 | <id>t3_1s57ky1</id> |
|
293 | 293 | <category term="LocalLLaMA" label="r/LocalLLaMA"/> |
294 | 294 | <published>2026-03-28T14:47:31+00:00</published> |
295 | 295 | </entry> |
296 | | - <entry> |
297 | | - <id>t3_1s62g5v</id> |
298 | | - <title>A simple explanation of the key idea behind TurboQuant</title> |
299 | | - <updated>2026-03-28T14:53:13+00:00</updated> |
300 | | - <author> |
301 | | - <name>/u/-p-e-w-</name> |
302 | | - <uri>https://old.reddit.com/user/-p-e-w-</uri> |
303 | | - </author> |
304 | | - <content type="html"><!-- SC_OFF --><div class="md"><p>TurboQuant (<a href="https://arxiv.org/abs/2504.19874">Zandieh et al. 2025</a>) has been all the rage in the past two days, and I've seen lots of comments here attempting to explain the magic behind it. Many of those comments boil down to &quot;dude, it's polar coordinates!!!&quot;, and that's really misleading. The most important part has nothing to do with polar coordinates (although they are emphasized in Google's blog post, so the confusion is understandable).</p> <p>TurboQuant is a vector quantization algorithm. It turns a vector of numbers into another vector of numbers that takes up less memory.</p> <p>Quantization is a fairly basic operation. If you have an <em>n</em>-dimensional vector that looks like this:</p> <pre><code>0.2374623 0.7237428 0.5434738 0.1001233 ... </code></pre> <p>Then a quantized version of that vector may look like this:</p> <pre><code>0.237 0.723 0.543 0.100 ... </code></pre> <p>Notice how I simply shaved off the last four digits of each number? That's already an example of a crude quantization process. Obviously, there are far more sophisticated schemes, including grouping coefficients in blocks, adaptive thresholds, calibrated precision based on experimental data etc., but at its core, quantization always involves reducing coefficient precision.</p> <p>Here is the key idea behind TurboQuant: <strong>Before quantizing a vector, we randomly rotate it in the <em>n</em>-dimensional space it resides in.</strong> The corresponding counter-rotation is applied during dequantization.</p> <p>That's it.</p> <p>Now you probably feel that I must have left out an important detail. Surely the rotation can't be <em>completely</em> random? Maybe it's sampled from a particular distribution, or somehow input-dependent? Or perhaps there is another operation that goes hand in hand with it?</p> <p>Nope. I didn't leave anything out. <em>Just applying a random rotation to the vector dramatically improves quantization performance.</em></p> <h2>But why?</h2> <p>Because <strong>the magnitudes of the coefficients of state vectors in language models aren't distributed uniformly among the vector dimensions.</strong> It's very common to see vectors that look like this:</p> <pre><code>0.0000023 0.9999428 &lt;-- !!! 0.0000738 0.0000003 ... </code></pre> <p>This phenomenon has many names, and it shows up everywhere in transformer research. You can read about &quot;massive activations&quot; (<a href="https://arxiv.org/abs/2402.17762">Sun et al. 2024</a>) and &quot;attention sinks&quot; (e.g. <a href="https://arxiv.org/abs/2410.10781">Gu et al. 2024</a>) for a deeper analysis.</p> <p>What matters for the purposes of this explanation is: <strong>Vectors with this type of quasi-sparse structure are terrible targets for component quantization.</strong> Reducing precision in such a vector effectively turns the massive component into 1 (assuming the vector is normalized), and all other components into 0. That is, quantization &quot;snaps&quot; the vector to its nearest cardinal direction. This collapses the information content of the vector, as identifying a cardinal direction takes only <em>log2(2n)</em> bits, whereas the quantized vector can hold <em>kn</em> bits (assuming <em>k</em> bits per component).</p> <p>And that's where the random rotation comes in! Since most directions aren't near a cardinal direction (and this only becomes more true as the number of dimensions increases), a random rotation almost surely results in a vector that distributes the coefficient weight evenly across all components, meaning that quantization doesn't cause information loss beyond that expected from precision reduction.</p> <p>The TurboQuant paper proves this mathematically, and gives an exact description of the distribution behavior, but the intuitive understanding is much more straightforward than that.</p> <p>This idea isn't new in principle (QuIP is another quantization method that employs a similar trick), but TurboQuant combines it with a second step that eliminates biases that arise when quantized vectors that are optimal in a certain sense (MSE) are used to compute inner products, which is what happens in attention blocks. See the paper if you're interested in the details.</p> </div><!-- SC_ON --> &#32; submitted by &#32; <a href="https://old.reddit.com/user/-p-e-w-"> /u/-p-e-w- </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1s62g5v/a_simple_explanation_of_the_key_idea_behind/">[link]</a></span> &#32; <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1s62g5v/a_simple_explanation_of_the_key_idea_behind/">[comments]</a></span></content> |
305 | | - <link href="https://old.reddit.com/r/LocalLLaMA/comments/1s62g5v/a_simple_explanation_of_the_key_idea_behind/"/> |
306 | | - <category term="LocalLLaMA" label="r/LocalLLaMA"/> |
307 | | - <published>2026-03-28T14:53:13+00:00</published> |
308 | | - </entry> |
309 | 296 | <entry> |
310 | 297 | <id>t3_1s60wel</id> |
311 | 298 | <title>Me waiting for TurboQuant be like</title> |
|
319 | 306 | <category term="LocalLLaMA" label="r/LocalLLaMA"/> |
320 | 307 | <published>2026-03-28T13:51:00+00:00</published> |
321 | 308 | </entry> |
| 309 | + <entry> |
| 310 | + <id>t3_1s62g5v</id> |
| 311 | + <title>A simple explanation of the key idea behind TurboQuant</title> |
| 312 | + <updated>2026-03-28T14:53:13+00:00</updated> |
| 313 | + <author> |
| 314 | + <name>/u/-p-e-w-</name> |
| 315 | + <uri>https://old.reddit.com/user/-p-e-w-</uri> |
| 316 | + </author> |
| 317 | + <content type="html"><!-- SC_OFF --><div class="md"><p>TurboQuant (<a href="https://arxiv.org/abs/2504.19874">Zandieh et al. 2025</a>) has been all the rage in the past two days, and I've seen lots of comments here attempting to explain the magic behind it. Many of those comments boil down to &quot;dude, it's polar coordinates!!!&quot;, and that's really misleading. The most important part has nothing to do with polar coordinates (although they are emphasized in Google's blog post, so the confusion is understandable).</p> <p>TurboQuant is a vector quantization algorithm. It turns a vector of numbers into another vector of numbers that takes up less memory.</p> <p>Quantization is a fairly basic operation. If you have an <em>n</em>-dimensional vector that looks like this:</p> <pre><code>0.2374623 0.7237428 0.5434738 0.1001233 ... </code></pre> <p>Then a quantized version of that vector may look like this:</p> <pre><code>0.237 0.723 0.543 0.100 ... </code></pre> <p>Notice how I simply shaved off the last four digits of each number? That's already an example of a crude quantization process. Obviously, there are far more sophisticated schemes, including grouping coefficients in blocks, adaptive thresholds, calibrated precision based on experimental data etc., but at its core, quantization always involves reducing coefficient precision.</p> <p>Here is the key idea behind TurboQuant: <strong>Before quantizing a vector, we randomly rotate it in the <em>n</em>-dimensional space it resides in.</strong> The corresponding counter-rotation is applied during dequantization.</p> <p>That's it.</p> <p>Now you probably feel that I must have left out an important detail. Surely the rotation can't be <em>completely</em> random? Maybe it's sampled from a particular distribution, or somehow input-dependent? Or perhaps there is another operation that goes hand in hand with it?</p> <p>Nope. I didn't leave anything out. <em>Just applying a random rotation to the vector dramatically improves quantization performance.</em></p> <h2>But why?</h2> <p>Because <strong>the magnitudes of the coefficients of state vectors in language models aren't distributed uniformly among the vector dimensions.</strong> It's very common to see vectors that look like this:</p> <pre><code>0.0000023 0.9999428 &lt;-- !!! 0.0000738 0.0000003 ... </code></pre> <p>This phenomenon has many names, and it shows up everywhere in transformer research. You can read about &quot;massive activations&quot; (<a href="https://arxiv.org/abs/2402.17762">Sun et al. 2024</a>) and &quot;attention sinks&quot; (e.g. <a href="https://arxiv.org/abs/2410.10781">Gu et al. 2024</a>) for a deeper analysis.</p> <p>What matters for the purposes of this explanation is: <strong>Vectors with this type of quasi-sparse structure are terrible targets for component quantization.</strong> Reducing precision in such a vector effectively turns the massive component into 1 (assuming the vector is normalized), and all other components into 0. That is, quantization &quot;snaps&quot; the vector to its nearest cardinal direction. This collapses the information content of the vector, as identifying a cardinal direction takes only <em>log2(2n)</em> bits, whereas the quantized vector can hold <em>kn</em> bits (assuming <em>k</em> bits per component).</p> <p>And that's where the random rotation comes in! Since most directions aren't near a cardinal direction (and this only becomes more true as the number of dimensions increases), a random rotation almost surely results in a vector that distributes the coefficient weight evenly across all components, meaning that quantization doesn't cause information loss beyond that expected from precision reduction.</p> <p>The TurboQuant paper proves this mathematically, and gives an exact description of the distribution behavior, but the intuitive understanding is much more straightforward than that.</p> <p>This idea isn't new in principle (QuIP is another quantization method that employs a similar trick), but TurboQuant combines it with a second step that eliminates biases that arise when quantized vectors that are optimal in a certain sense (MSE) are used to compute inner products, which is what happens in attention blocks. See the paper if you're interested in the details.</p> </div><!-- SC_ON --> &#32; submitted by &#32; <a href="https://old.reddit.com/user/-p-e-w-"> /u/-p-e-w- </a> <br /> <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1s62g5v/a_simple_explanation_of_the_key_idea_behind/">[link]</a></span> &#32; <span><a href="https://old.reddit.com/r/LocalLLaMA/comments/1s62g5v/a_simple_explanation_of_the_key_idea_behind/">[comments]</a></span></content> |
| 318 | + <link href="https://old.reddit.com/r/LocalLLaMA/comments/1s62g5v/a_simple_explanation_of_the_key_idea_behind/"/> |
| 319 | + <category term="LocalLLaMA" label="r/LocalLLaMA"/> |
| 320 | + <published>2026-03-28T14:53:13+00:00</published> |
| 321 | + </entry> |
322 | 322 | <entry> |
323 | 323 | <id>t3_1mpk2va</id> |
324 | 324 | <title>Announcing LocalLlama discord server & bot!</title> |
|
0 commit comments