Llama.cpp#
- class langworks.middleware.llama_cpp.LlamaCpp#
Llama.cpp is an open source LLM inference engine designed to efficiently run on a wide range of hardware, including CPU-only and CPU+GPU hybrid configurations.
Langworks’ llama.cpp middleware provides a wrapper to access the LLMs served using llama.cpp server API.
Dependencies#
To employ the llama.cpp middleware additional depedencies need to be installed. You may do so using pip:
pip install langworks[llamacpp]
Be sure to employ llama.cpp with LLGuidance enabled to make full use of all features provided by Langworks. It is furthermore recommended to initialize llama.cpp with
--parallel
set to a value higher than1
, so that llama.cpp may handle multiple requests simulataneously.Fundamentals#
- __init__(url: str | Sequence[str], model: str, authenticator: Authenticator | Sequence[Authenticator] | None = None, timeout: int = 5, retries: int = 2, autoscale_threshold: tuple[float, float] = (0, 0), params: SamplingParams = None, output_cache: ScoredCache = None)#
Initializes the llama.cpp middleware.
Parameters#
Connection#
- url
URL or URLs of llama.cpp-instances where the model may be accessed.
Changed in version 0.3.0: A list of URLs may be specified, allowing to simultaneously access multiple inference endpoints.
- model
Name of the model as assigned by the instance.
- authenticator
Authenticator
that may be used to acquire an authentication key when the llama.cpp-instance requests for authentication. Optionally, a list of authenticators may also be provided, allowing to specify an authenticator per llama.cpp -instance as identified by the URLs in url.Changed in version 0.3.0: A list of authenticators may be specified, making it possible to manage authentication per llama.cpp-instance as specified using url.
- timeout
The number of seconds the client awaits the response of the llama.cpp-instance.
- retries
The number of times the client tries to submit the same request again after a timeout.
Balancing#
- autoscale_threshold
Pair specifying at what number of tasks per instance to scale up (first item) or scale down (second item). By default this is set to (0, 0), setting the middleware up to immediately scale up to use all resources, while never scaling down.
Added in version 0.3.0.
Configuration#
- params
Default sampling parameters to use when processing a prompt, specified using an instance of
SamplingParams
.- output_cache
Cache to use for caching previous prompt outputs. Should be a ScoredCache or subclass.
Methods#
- exec(query: str = None, role: str = None, guidance: str = None, history: Thread = None, context: dict = None, params: SamplingParams = None) tuple[Thread, dict[str, Any]] #
Generate a new message, following up on the message passed using the given guidance and sampling parameters.
Parameters#
- query
The query to prompt the LLM with, optionally formatted using Langworks’ static DSL.
- role
The role of the agent stating this query, usually ‘user’, ‘system’ or ‘assistant’.
- guidance
Template for the message to be generated, formatted using Langworks’ dyanmic DSL.
- history
Conversational history (thread) to prepend to the prompt.
- context
Context to reference when filling in the templated parts of the query, guidance and history. In case the Langwork or the input also define a context, the available contexts are merged. When duplicate attributes are observed, the value is copied from the most specific context, i.e. input context over Query context, and Query context over Langwork context.
- params
Sampling parameters, wrapped by a SamplingParams object, specifying how the LLM should select subsequent tokens.
- class langworks.middleware.llama_cpp.SamplingParams#
Specifies additional sampling parameters to be used when passing a prompt to llama.cpp.
Properties#
- dry: bool = False#
Enables sampling using the DRY repetition penalty, offering more fine-grained control over repetitions than
repetition_penalty
.See also
Weidmann, P.E. (2024). DRY: A modern repetition penalty that reliably prevents looping. oobabooga/text-generation-webui#5677.
- dry_base: float = 1.75#
Controls how quickly the DRY repetition penalty grows as repetitions repeat.
- dry_last_n: int = -1#
The number of previous tokens to consider when applying DRY sampling.
Note
Within llama.cpp this parameter is referred to as
dry_penalty_last_n
.
- dry_min_length: int = 2#
Minimal required length of repeated tokens subject to the DRY repetition penalty.
Note
Within llama.cpp this parameter is referred to as
dry_allowed_length
.
- dry_multiplier: float = 0.0#
Controls extent to which the DRY repetition penalty is applied.
- dry_sequence_breakers: list[str] = None#
List of character sequences, which if generated reset the DRY repetition penalty.
- frequency_penalty: float = 0.0#
A number between -2.0 and 2.0 that controls the number of times a tokens appears in the generated text, whereby a positive coefficient puts a constraint on repetition, whereas a negative coefficient encourages repetition.
- ignore_eos: bool = False#
Flag that controls whether or not generation should continue after the End of Sequence (EOS) token has been generated.
- include_stop: bool = True#
Flag that controls whether stop sequences are included in the output.
- logit_bias: dict[int, int] = None#
A dictionary indexed by encoded tokens, each assigned a number between -100 and 100 controlling the likelihood that a specific token is selected or ignored, with a positive number increasing this likelihood, whereas a negative number decreases it.
- logprobs: int = None#
The number of tokens to generate per position in the generated sequences, ordered by probability.
- max_p: float = 0.1#
Puts an upper bound on the probability of the sampled tokens under consideration, allowing to discard the most likely tokens, increasing originality.
Note
Within llama.cpp this parameter is referred to as
xtc_threshold
.See also
Weidmann, P.E. (2024). Exclude Top Choices (XTC): A sampler that boosts creativity, breaks writing clichés, and inhibits non-verbatim repetition. oobabooga/text-generation-webui#6335.
- max_p_allowance: float = 0.0#
Controls probability that the least likely sampled token above the threshold specified by
max_p
is included despite exceeding the threshold.Note
Within llama.cpp this parameter is referred to as
xtc_probability
.
- max_tokens: int = None#
The number of tokens that may be generated per generation.
- min_p: float = 0.0#
A number between 0.0 and 1.0, specifying what a token’s minimal likelihood must be to be considered for selection.
- mirostat: Literal[0, 1, 2] = 0#
Enables Mirostat sampling, controlling perplexity during text generation.
A value of
0
disables Mirostat,1
enables Mirostat, and2
enables Mirostat 2.0.See also
Basu et al. (2021). Mirostat: A Neural Text Decoding Algorithm that Directly Controls Perplexity. https://arxiv.org/abs/2007.14966
- mirostat_eta: float = 0.1#
Sets the Mirostat learning rate, parameter eta.
- mirostat_tau: float = 5.0#
Sets the Mirostat target entropy, parameter tau.
- overflow_keep: int = 0#
Specifies the number of tokens from the prompt to retain when the context size is exceeded and tokens need to be discarded. The number excludes the BOS token. When set to
-1
all tokens are retained.Note
Within llama.cpp this parameter is referred to as
n_keep
.
- presence_penalty: float = 0.0#
A number between -2.0 and 2.0 that controls the likelihood that tokens appear that have not yet appeared in the generated text, whereby a positive coeffient increases these odds, whereas a negative coefficient decreases them.
- repeat_last_n: int = 64#
The number of previous tokens to consider when applying
repetition_penalty
.
- repetition_penalty: float = 0.0#
A number between -2.0 and 2.0 that controls the degree of repetition, taking into account both the generated text and the initial prompt. A positive number encourages usage of new tokens, whereas a negative number favours repetition.
- samplers: list[Literal['dry', 'top_k', 'typ_p', 'top_p', 'min_p', 'xtc', 'temperature']] = None#
The order samplers are applied in.
- samplers_keep: int = 0#
The number of best tokens samplers must retain regardless of whether these tokens satisfy the conditions set by these samplers.
Note
Within llama.cpp this parameter is referred to as
min_keep
.
- seed: int = None#
A number used to initialize any pseudorandom number generations that the model may used. It can be used to enforce a degree of determinism even when using non-nil temperatures.
- stop: list[str] = None#
List of character sequences, which if generated stop further generation.
- temperature: float = 1.0#
A number between 0.0 and 2.0, with higher values increasing randomization, whereas lower values encourage deterministic output.
- temperature_dev: float = 0.0#
Maximum deviation allowed from
temperature
. When specified, the temperature is heuristically adjusted proportial to the spread of the probability of the sampled tokens under consideration. The extent of this adjustment is capped by the value temperature_dev is set to.Note
Within llama.cpp this parameter is referred to as
dynatemp_range
.See also
Zhu et al. (2023). Hot or Cold? Adaptive Temperature Sampling for Code Generation with Large Language Models. https://doi.org/10.48550/arXiv.2309.02772.
- temperature_dev_exp: float = 1.0#
Controls the degree with which
temperature
is adjusted whentemperature_dev
is specified. Above 1.0, temperature is barely adjusted unless the p-values among the tokens under consideration vary considerable. Below 1.0, adjustment is more pronounced even if p-values are relatively close.Note
Within llama.cpp this parameter is referred to as
dynatemp_exponent
.
- top_k: int = -1#
The number of tokens to consider when deciding on the next token when generating.
- top_n_sigma: float = -1.0#
Puts an upper bound on pre-softmax scores, stating the number of standard deviations a pre-softmax score may exceed the mean to be considered for sampling. Doing so is purported to increase coherence at higher temperatures.
See also
Tang et al. (2024). Top-nσ: Not All Logits Are You Need. https://doi.org/10.48550/arXiv.2411.07641.
- top_p: float = 1.0#
A number between 0.0 and 1.0, specifying what is considered a top percentile, constraining the selection of tokens to tokens with probabilities considered among these top percentiles. For example, when 0.2 is specified, a selection is made from the tokens with probabilities among the top 20 percentiles.
- typical_p: float = 1.0#
Specifies a range around the average probability of the sampled tokens under consideration, excluding tokens outside of this range. typical_p is generally applied after other samplers have already been applied.
See also
Meister et al. (2022). Locally Typical Sampling. https://arxiv.org/abs/2202.00666.