vLLM#

class langworks.middleware.vllm.vLLM#

vLLM is an open source library for efficiently serving various LLMs using self-managed hardware. Langworks’ vLLM middleware provides a wrapper to access these LLMs using vLLM’s server API.

Dependencies#

To employ the vLLM middleware additional depedencies need to be installed. You may do so using pip:

pip install langworks[vllm]

Fundamentals#

__init__(url: str | Sequence[str], model: str, authenticator: Authenticator | Sequence[Authenticator] | None = None, timeout: int = 5, retries: int = 2, autoscale_threshold: tuple[float, float] = (0, 0), params: SamplingParams = None, output_cache: ScoredCache = None)#

Initializes the vLLM middleware.

Parameters#

Connection#

url

URL or URLs of vLLM-instances where the model may be accessed.

Changed in version 0.3.0: A list of URLs may be specified, allowing to simultaneously access multiple inference endpoints.

model

Name of the model as used in the Hugging Face repository.

authenticator

Authenticator that may be used to acquire an authentication key when the vLLM-instance requests for authentication. Optionally, a list of authenticators may also be provided, allowing to specify an authenticator per vLLM-instance as identified by the URLs in url.

Changed in version 0.3.0: A list of authenticators may be specified, making it possible to manage authentication per vLLM-instance as specified using url.

timeout

The number of seconds the client awaits the response of the vLLM-instance.

retries

The number of times the client tries to submit the same request again after a timeout.

Balancing#

autoscale_threshold

Pair specifying at what number of tasks per instance to scale up (first item) or scale down (second item). By default this is set to (0, 0), setting the middleware up to immediately scale up to use all resources, while never scaling down.

Added in version 0.3.0.

Configuration#

params

Default sampling parameters to use when processing a prompt, specified using an instance of SamplingParams.

output_cache

Cache to use for caching previous prompt outputs. Should be a ScoredCache or subclass.

Methods#

exec(query: str = None, role: str = None, guidance: str = None, history: Thread = None, context: dict = None, params: SamplingParams = None) tuple[Thread, dict[str, Any]]#

Generate a new message, following up on the message passed using the given guidance and sampling parameters.

Parameters#

query

The query to prompt the LLM with, optionally formatted using Langworks’ static DSL.

role

The role of the agent stating this query, usually ‘user’, ‘system’ or ‘assistant’.

guidance

Template for the message to be generated, formatted using Langworks’ dyanmic DSL.

history

Conversational history (thread) to prepend to the prompt.

context

Context to reference when filling in the templated parts of the query, guidance and history. In case the Langwork or the input also define a context, the available contexts are merged. When duplicate attributes are observed, the value is copied from the most specific context, i.e. input context over Query context, and Query context over Langwork context.

params

Sampling parameters, wrapped by a SamplingParams object, specifying how the LLM should select subsequent tokens.

class langworks.middleware.vllm.SamplingParams#

Specifies additional sampling parameters to be used when passing a prompt to vLLM.

Properties#

allowed_tokens: list[int] = None#

List of encoded tokens from which the LLM may sample.

Note

Within vLLM this parameter is referred to as allowed_token_ids.

bad_words: list[str] = None#

List of words that are not allowed to be generated.

frequency_penalty: float = 0.0#

A number between -2.0 and 2.0 that controls the number of times a tokens appears in the generated text, whereby a positive coefficient puts a constraint on repetition, whereas a negative coefficient encourages repetition.

ignore_eos: bool = False#

Flag that controls whether or not generation should continue after the End of Sequence (EOS) token has been generated.

include_stop: bool = True#

Flag that controls whether stop sequences are included in the output.

logit_bias: dict[int, int] = None#

A dictionary indexed by encoded tokens, each assigned a number between -100 and 100 controlling the likelihood that a specific token is selected or ignored, with a positive number increasing this likelihood, whereas a negative number decreases it.

logprobs: int = None#

The number of tokens to generate per position in the generated sequences, ordered by probability.

max_tokens: int = None#

The number of tokens that may be generated per generation.

min_p: float = 0.0#

A number between 0.0 and 1.0, specifying what a token’s minimal likelihood must be to be considered for selection.

min_tokens: int = 0#

The minimal number of tokens that may be generated per generation.

presence_penalty: float = 0.0#

A number between -2.0 and 2.0 that controls the likelihood that tokens appear that have not yet appeared in the generated text, whereby a positive coeffient increases these odds, whereas a negative coefficient decreases them.

repetition_penalty: float = 0.0#

A number between -2.0 and 2.0 that controls the degree of repetition, taking into account both the generated text and the initial prompt. A positive number encourages usage of new tokens, whereas a negative number favours repetition.

seed: int = None#

A number used to initialize any pseudorandom number generations that the model may used. It can be used to enforce a degree of determinism even when using non-nil temperatures.

stop: list[str] = None#

List of character sequences, which if generated stop further generation.

stop_tokens: list[int] = None#

List of encoded tokens, which if generated stop further generation.

Note

Within vLLM this parameter is referred to as stop_token_ids.

temperature: float = 1.0#

A number between 0.0 and 2.0, with higher values increasing randomization, whereas lower values encourage deterministic output.

top_k: int = -1#

The number of tokens to consider when deciding on the next token when generating.

top_p: float = 1.0#

A number between 0.0 and 1.0, specifying what is considered a top percentile, constraining the selection of tokens to tokens with probabilities considered among these top percentiles. For example, when 0.2 is specified, a selection is made from the tokens with probabilities among the top 20 percentiles.