@@ -763,6 +763,8 @@ curl http://localhost:8080/v1/chat/completions \
763763
764764# ## POST `/v1/embeddings`: OpenAI-compatible embeddings API
765765
766+ This endpoint requires that the model uses a pooling different than type `none`.
767+
766768*Options:*
767769
768770See [OpenAI Embeddings API documentation](https://platform.openai.com/docs/api-reference/embeddings).
@@ -795,7 +797,45 @@ See [OpenAI Embeddings API documentation](https://platform.openai.com/docs/api-r
795797 }'
796798 ` ` `
797799
798- When `--pooling none` is used, the server will output an array of embeddings - one for each token in the input.
800+ # ## POST `/embeddings`: non-OpenAI-compatible embeddings API
801+
802+ This endpoint supports `--pooling none`. When used, the responses will contain the embeddings for all input tokens.
803+ Note that the response format is slightly different than `/v1/embeddings` - it does not have the `"data"` sub-tree and the
804+ embeddings are always returned as vector of vectors.
805+
806+ *Options:*
807+
808+ Same as the `/v1/embeddings` endpoint.
809+
810+ *Examples:*
811+
812+ Same as the `/v1/embeddings` endpoint.
813+
814+ **Response format**
815+
816+ ` ` ` json
817+ [
818+ {
819+ "index": 0,
820+ "embedding": [
821+ [ ... embeddings for token 0 ... ],
822+ [ ... embeddings for token 1 ... ],
823+ [ ... ]
824+ [ ... embeddings for token N-1 ... ],
825+ ]
826+ },
827+ ...
828+ {
829+ "index": P,
830+ "embedding": [
831+ [ ... embeddings for token 0 ... ],
832+ [ ... embeddings for token 1 ... ],
833+ [ ... ]
834+ [ ... embeddings for token N-1 ... ],
835+ ]
836+ }
837+ ]
838+ ` ` `
799839
800840# ## GET `/slots`: Returns the current slots processing state
801841
0 commit comments