How to decrease time to generate first token? #297
              
                Unanswered
              
          
                  
                    
                      VenkatLohithDasari
                    
                  
                
                  asked this question in
                Q&A
              
            Replies: 0 comments
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
        
    
Uh oh!
There was an error while loading. Please reload this page.
-
I have copied code from example_ws.py to enable text streaming. It's good and all but there is one big problem, It's takes a lot of time to generate the first token, The rest of the tokens are generated pretty fast around 15t/s. Is there any way to fix this problem?
My Chatbot works like this It takes the user message, detects the intent of the message then creates an appropriate prompt for that intent using f-strings...So it means the prompt always changes depending on context. Just saying this if this info is useful for why first token generation is slow!
I want to know what does generation of the first token depends upon. If there is some sort of conversion of the prompt into math equations. Maybe we can cache it by storing it into a variable? and let only the newly appended message to prompt be converted? Is that possible? Sorry If I sound dumb, I'm not AI programmer...
Beta Was this translation helpful? Give feedback.
All reactions