Balancer#

class langworks.middleware.balancer.Balancer#

Middleware allowing to distribute queries among other middleware, allowing for load balancing. Optionally, load balancing may be enhanced with autoscaling, allowing to control at what rate middleware are made available or unavailable.

Fundamentals#

__init__(middleware: Sequence[Middleware], autoscale_threshold: tuple[float, float] = (0, 0))#

Initialized the Balancer.

Parameters#

middleware

Instantiated middleware to which queries may be distributed, giving priority to middleware specified first.

autoscale_threshold

Pair of thresholds specifying at what number of queries per middleware to scale up (first item) or scale down (second item). By default this is set to (0, 0), setting the balancer up to immediately scale up to use all resources, while never scaling down.

Methods#

exec(query: str = None, role: str = None, guidance: str = None, history: Thread = None, context: dict = None, params: SamplingParams = None) tuple[Thread, dict[str, Any]]#

Generate a new message, following up on the message passed using the given guidance and sampling parameters.

Parameters#

query

The query to prompt the LLM with, optionally formatted using Langworks’ static DSL.

role

The role of the agent stating this query, usually ‘user’, ‘system’ or ‘assistant’.

guidance

Template for the message to be generated, formatted using Langworks’ dyanmic DSL.

history

Conversational history (thread) to prepend to the prompt.

context

Context to reference when filling in the templated parts of the query, guidance and history. In case the Langwork or the input also define a context, the available contexts are merged. When duplicate attributes are observed, the value is copied from the most specific context, i.e. input context over Query context, and Query context over Langwork context.

params

Sampling parameters, wrapped by a SamplingParams object, specifying how the LLM should select subsequent tokens.