Gemini 2.5 Pro Rate-Limited on Free Tier After 10–15 Prompts (Also via API) #2436
-
Hey guys, On the free tier, Gemini gives access to 2.5 Pro for a few prompts (10–15), but then quickly rate-limits and switches to Flash. This is happening even with API. Is this intended behavior? Can we get clarity or control over model access? Thanks. |
Beta Was this translation helpful? Give feedback.
Replies: 11 comments 12 replies
-
Likely intended, see some Tickets here about it: very reasonable limitation, business wise. |
Beta Was this translation helpful? Give feedback.
-
I can understand if these limitations are due to the huge load from the influx of users. However, what I cannot understand is the complete lack of transparency or specifics on this issue. Why is it that in all the issues where people ask about this, there are zero comments from the developers or from Google, and the issues suggesting improvements to transparency are simply closed without anyone even addressing the core of the problem? Meanwhile, the official documentation still says there are 1000 requests per day, but it’s unclear what these limits actually apply to. Is it for the Pro model? For Flash? Are search queries also limited to 15 requests...? I was thrilled when I first saw the tool, thinking to myself that this was a real game changer from Google. But now it’s starting to seem more and more like just another marketing campaign with blatant deception... |
Beta Was this translation helpful? Give feedback.
-
From my brief testing, I think the way Gemini CLI works can sometimes include huge amounts of data as context and so quickly consume the free limit for 2.5 pro. I noticed when using it with a large existing codebase, its context left indicator went from 99% left to less than 40% left without having received all that much output. since 40% left would mean that around 600 000 tokens of context are being used, I can only assume that it was attaching the entire codebase as context for each prompt, but maybe I'm mistake. When I used it as my planning assistant (only one markdown document in the current directory), this context usage was reduced to only one or two percent and I could continue using 2.5 pro for much longer. Hopefully someone can provide more information on this, but my assumption would be that either they are aiming for a fairly costly agent that has very good overall "awareness" of a given codebase due to it having the full thing in it's context window, or they are still working on making context management more efficient. |
Beta Was this translation helpful? Give feedback.
-
I discovered what the issue was for me. I had an incomplete .gitignore file in the project that I experienced the rapid exceeding of my free limit with. By default, gemini-cli will use this to set it's own ignore pattern for files that should be excluded from being included in it's context. For me, it was my node_modules folder which was VERY large and explains the huge amount of input tokens I was using. Have a look at your .gitignore at the time to exclude this as a possible cause of your issue. Cheers. |
Beta Was this translation helpful? Give feedback.
-
I was very enthusiast about Gemini CLI. I tried with my personal google account, all looked well and promising. |
Beta Was this translation helpful? Give feedback.
-
Howdy, folks. 👋 Ryan here from the Gemini CLI team. Hopefully, I can demystify the behavior. For those devs using the free tier by logging in with your Google account, our goal is to deliver the best possible experience at the keyboard – ideally, one where you never have to stop work because you hit a limit. To do that, we have to balance model choice with capacity. Thus, the free tier uses a blend of Gemini 2.5 Pro and Flash. For example, we might use Flash to determine the complexity of a request before routing a request to the model for the “official” response. After all, Pro is overkill for a lot of really simple steps (e.g. “start the npm server”) better routed to Flash. Pro is better suited to big, complex tasks that require reasoning (e.g. "write integration tests for We also fallback from Pro to Flash when there are two or more slow responses. Because of the (frankly, overwhelming 🤗 ) developer response in the first week of availability, our service has returned more slow response times than we'd like. But we're working to add capacity quickly. Our error rate is now well below 1%. We’re at the beginning of our release journey. There are still a lot of improvements we can make to improve planning and orchestration. If we get it right, you won’t have to think about which model is being used. But if you want to use a specific model, you can always use an API Key. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the clarification. The plan sounds good and reasonable, assuming that the flash model works well. However, in practice, the flash model is, frankly speaking, unusable. The model forgets or ignores its own context, doesn’t understand synonyms, lies, and can’t handle elementary tasks like replacing text in a file—it either doesn’t realize it has already made the change and gets stuck in an endless loop repeating the same action, or it fails to learn from its own mistakes, even though all the necessary context is present. I can’t understand where you got the 1% figure from; it’s at least ten times higher, if not more. The only thing that helps is to write a detailed “poem” describing exactly what I want the model to do. But by the time I’ve written this “poem,” taking into account all the model’s previous mistakes, it would be easier for me to write a script to solve the problem or just spend 20 minutes doing the task manually rather than 30 minutes crafting and testing a prompt, hoping the model won’t destroy my data. Even the very first public models from OpenAI or Llama weren’t this dumb. With the pro model, there are far fewer such errors (though they still happen), and I’m willing to spend an extra five minutes on the pro model rather than boxing with the flash model. And if you want to train your models on user data, flash will only create even more problems. It would be best to disable flash altogether, if financially possible. Or at the very least, don’t switch users to flash model under the pretext of slow responses. |
Beta Was this translation helpful? Give feedback.
-
A very elegant summary of our experiences this is: #2436 (comment), thanks. BTW, speaking of loops: #2923: it has not happened to me today, after the CLI client update. A very informal tip: I do with Flash-level Gemini AIs what I have done for years with e.g. https://github.com/OpenInterpreter/open-interpreter and such: more hand holding (putting myself in THEIR shoes mentally, aka orchestrating), via "anti-goldfish memory syndrome" set of artefacts or heuristics. E.g. this one, informal one too for now, works reasonably well: #2386 (comment). But indeed, User must remind AIs to check these too, now and then (= active read_file function call) as otherwise the AIs tend to treat these as ornamentation, some faux QA bumf (a sin of many human junior PMs, too...). |
Beta Was this translation helpful? Give feedback.
-
When will the Google AI Ultra subscription be available to link with gemini cli to allow for much more use of the pro version similar to claude code and the max plan? Until then this is unusable for me! |
Beta Was this translation helpful? Give feedback.
-
BTW, on free tier still (but using a couple of old accounts, now and then), and I figured out why the tokens may be consumed too fast. Virgin session (new day) and this Gemini makes this mistake:
In short, Gemini tries to ingest (read in full) all the files asap and of course: |
Beta Was this translation helpful? Give feedback.
-
Free tier product implementation seems to have flaw on CLI. Take a day or more off, and try to prompt context window with over 85% left, and it immediately outputs unhelpful messaging, staying in poor user experience with repeated loop of messages for over 30 minutes at a time. Simply the request should state first time, terminate process, and direct user to try again in X time. Have a request open for over 30 minutes of runtime: ![]() |
Beta Was this translation helpful? Give feedback.
Howdy, folks. 👋 Ryan here from the Gemini CLI team. Hopefully, I can demystify the behavior.
For those devs using the free tier by logging in with your Google account, our goal is to deliver the best possible experience at the keyboard – ideally, one where you never have to stop work because you hit a limit. To do that, we have to balance model choice with capacity. Thus, the free tier uses a blend of Gemini 2.5 Pro and Flash.
For example, we might use Flash to determine the complexity of a request before routing a request to the model for the “official” response. After all, Pro is overkill for a lot of really simple steps (e.g. “start the npm server”) better routed to Flash. Pro is bette…