You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .github/ISSUE_TEMPLATE/bug_report.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -66,8 +66,8 @@ Try the following:
66
66
3.`rm -rf _skbuild/` # delete any old builds
67
67
4.`python -m pip install .`
68
68
5.`cd ./vendor/llama.cpp`
69
-
6. Follow [llama.cpp's instructions](https://github.com/ggerganov/llama.cpp#build) to `cmake` llama.cpp
70
-
7. Run llama.cpp's `./main` with the same arguments you previously passed to llama-cpp-python and see if you can reproduce the issue. If you can, [log an issue with llama.cpp](https://github.com/ggerganov/llama.cpp/issues)
69
+
6. Follow [llama.cpp's instructions](https://github.com/ggml-org/llama.cpp#build) to `cmake` llama.cpp
70
+
7. Run llama.cpp's `./main` with the same arguments you previously passed to llama-cpp-python and see if you can reproduce the issue. If you can, [log an issue with llama.cpp](https://github.com/ggml-org/llama.cpp/issues)
Copy file name to clipboardExpand all lines: llama_cpp/llama.py
+4-4Lines changed: 4 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -161,7 +161,7 @@ def __init__(
161
161
n_ubatch: Physical batch size
162
162
n_threads: Number of threads to use for generation
163
163
n_threads_batch: Number of threads to use for batch processing
164
-
rope_scaling_type: RoPE scaling type, from `enum llama_rope_scaling_type`. ref: https://github.com/ggerganov/llama.cpp/pull/2054
164
+
rope_scaling_type: RoPE scaling type, from `enum llama_rope_scaling_type`. ref: https://github.com/ggml-org/llama.cpp/pull/2054
165
165
pooling_type: Pooling type, from `enum llama_pooling_type`.
166
166
rope_freq_base: RoPE base frequency, 0 = from model
167
167
rope_freq_scale: RoPE frequency scaling factor, 0 = from model
@@ -1774,7 +1774,7 @@ def create_completion(
1774
1774
max_tokens: The maximum number of tokens to generate. If max_tokens <= 0 or None, the maximum number of tokens to generate is unlimited and depends on n_ctx.
1775
1775
temperature: The temperature to use for sampling.
1776
1776
top_p: The top-p value to use for nucleus sampling. Nucleus sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
1777
-
min_p: The min-p value to use for minimum p sampling. Minimum P sampling as described in https://github.com/ggerganov/llama.cpp/pull/3841
1777
+
min_p: The min-p value to use for minimum p sampling. Minimum P sampling as described in https://github.com/ggml-org/llama.cpp/pull/3841
1778
1778
typical_p: The typical-p value to use for sampling. Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666.
1779
1779
logprobs: The number of logprobs to return. If None, no logprobs are returned.
1780
1780
echo: Whether to echo the prompt.
@@ -1871,7 +1871,7 @@ def __call__(
1871
1871
max_tokens: The maximum number of tokens to generate. If max_tokens <= 0 or None, the maximum number of tokens to generate is unlimited and depends on n_ctx.
1872
1872
temperature: The temperature to use for sampling.
1873
1873
top_p: The top-p value to use for nucleus sampling. Nucleus sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
1874
-
min_p: The min-p value to use for minimum p sampling. Minimum P sampling as described in https://github.com/ggerganov/llama.cpp/pull/3841
1874
+
min_p: The min-p value to use for minimum p sampling. Minimum P sampling as described in https://github.com/ggml-org/llama.cpp/pull/3841
1875
1875
typical_p: The typical-p value to use for sampling. Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666.
1876
1876
logprobs: The number of logprobs to return. If None, no logprobs are returned.
1877
1877
echo: Whether to echo the prompt.
@@ -1971,7 +1971,7 @@ def create_chat_completion(
1971
1971
temperature: The temperature to use for sampling.
1972
1972
top_p: The top-p value to use for nucleus sampling. Nucleus sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
1973
1973
top_k: The top-k value to use for sampling. Top-K sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
1974
-
min_p: The min-p value to use for minimum p sampling. Minimum P sampling as described in https://github.com/ggerganov/llama.cpp/pull/3841
1974
+
min_p: The min-p value to use for minimum p sampling. Minimum P sampling as described in https://github.com/ggml-org/llama.cpp/pull/3841
1975
1975
typical_p: The typical-p value to use for sampling. Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666.
1976
1976
stream: Whether to stream the results.
1977
1977
stop: A list of strings to stop generation when encountered.
# /// Apply chat template. Inspired by hf apply_chat_template() on python.
3237
3249
# /// Both "model" and "custom_template" are optional, but at least one is required. "custom_template" has higher precedence than "model"
3238
-
# /// NOTE: This function does not use a jinja parser. It only support a pre-defined list of template. See more: https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template
3250
+
# /// NOTE: This function does not use a jinja parser. It only support a pre-defined list of template. See more: https://github.com/ggml-org/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template
3239
3251
# /// @param tmpl A Jinja template to use for this chat. If this is nullptr, the model’s default chat template will be used instead.
3240
3252
# /// @param chat Pointer to a list of multiple llama_chat_message
3241
3253
# /// @param n_msg Number of llama_chat_message in this chat
@@ -3375,8 +3387,8 @@ class llama_sampler_i(ctypes.Structure):
3375
3387
3376
3388
3377
3389
# struct llama_sampler {
3378
-
# struct llama_sampler_i * iface;
3379
-
# llama_sampler_context_t ctx;
3390
+
# const struct llama_sampler_i * iface;
3391
+
# llama_sampler_context_t ctx;
3380
3392
# };
3381
3393
classllama_sampler(ctypes.Structure):
3382
3394
_fields_= [
@@ -3410,6 +3422,16 @@ class llama_sampler(ctypes.Structure):
# /// @details Mirostat 1.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.
3631
3660
# /// @param candidates A vector of `llama_token_data` containing the candidate tokens, their probabilities (p), and log-odds (logit) for the current position in the generated text.
3632
3661
# /// @param tau The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.
# /// @details Lazy grammar sampler, introduced in https://github.com/ggml-org/llama.cpp/pull/9639
3751
+
# /// @param trigger_patterns A list of patterns that will trigger the grammar sampler. Pattern will be matched from the start of the generation output, and grammar sampler will be fed content starting from its first match group.
3752
+
# /// @param trigger_tokens A list of tokens that will trigger the grammar sampler. Grammar sampler will be fed content starting from the trigger token included.
0 commit comments