Basics#
Vecworks builds on top of Pypeworks, adding specialised nodes to easily interface with vector stores. These nodes employ various constructs to specify how these vector stores should be queried.
Retrievers#
A Retriever
is a specialised Pypeworks node designed to
query vector stores. These retrievers take the inputs passed by argument, vectorizing these inputs
to compare them with the vectors stored in the vector store. If this produces any matches,
retrievers help access associated metadata.
Regardless of the backend, all retrievers need to define the indices to query, the vectorizers to employ, and the metadata to retrieve:
retriever = Retriever(
index = [
vecworks.Index(
name = "INDEX_NAME",
bind = "ATTRIBUTE_NAME",
distance = vecworks.DISTANCES.cosine,
vectorizer = RemoteVectorizer(
url = "http://localhost:5000/",
vectorizer = "vectorizer-name"
),
density = vecworks.DENSITY.dense
)
],
retrieve = {
"METADATA": "metadata_alias"
}
)
In practice specific retrievers may need additional arguments to connect with a vector store. Refer to the documentation of the retriever you wish to employ for the full details.
Note that despite being designed to function as Pypeworks node, retrievers may also be used
standalone. To invoke a retriever, use the exec()
method as follows:
retriever.exec("query: What is a vector?")
Indices#
Vector stores retain their vector data using indices. These data structures are specifically designed to allow for efficient similarity matching during lookups. Software interfacing with vector stores must therefore specify what indices to access.
In Vecworks access to indices is controlled through Index
objects.
These objects provide a generalized interface detailing what index to access (using name
), how
to connect arguments to the index (using bind
), and how to match data with the index (using
vectorizer
, distance
and density
).
Density#
Vectors may be represented either densely or sparsely. A dense vector stores each element in a continuous block of memory, allowing for more efficient processing of calculations involving entire vectors. On the other hand, a sparse vector only stores non-zero elements, requiring less storage than dense vectors, allowing to work with high dimensional vector that otherwise would not fit into memory.
As the specific application determines whether computation or storage needs to be optimized,
Vecworks requires the user to specify whether to use dense or sparse vectors. This is done using
the DENSITY
enumeration.
Distance#
Distance functions are used to calculate the (dis)similarity between two vectors, i.e. a vector
passed by argument and an indexed vector. As each distance function has its own specific uses,
Vecworks supports various functions using DISTANCES
, namely:
Name |
Workings |
Use cases |
---|---|---|
|
Compares the orientation of two vectors by calculating the angle between two non-zero vectors. |
|
|
Takes two equal-length vectors, multiplying both vectors’ elements pairwise, summing up the results. Multiplied with -1 to allow a higher inner product to represent greater similarity. |
|
|
Takes two equal-length vectors, calculating pairwise the absolute difference between both vectors’ elements, summing up the results to produce a single distance measure. |
|
|
Takes two equal-length vectors, calculating pairwise the absolute difference between both vectors’ elements, squaring up the results before them up into a single distance measure. |
|
|
Counts the number of positions at which two vectors’ corresponding elements are different. |
|
|
Considers vectors as sets, calculating the degree of overlap between both vectors/sets. |
|
Vectorizers#
A Vectorizer
helps transform data of any kind to a
vector representation. The specifics of this transformation depend on the specific algorithms and
models applied. These algorithms and models may be accessed through various vectorizers shipped
with Vecworks. Refer to the documentation of these vectorizers for the specifics.
Do note that indices do not require a vectorizer to be specified. If a vectorizer is not specified, the retriever assumes any data relayed to indices without vectorizer to already have been vectorized (represented by a NumPy ndarray, a SciPy sparray or a sparse SparseArray). Accordingly, the following code is valid too:
Retriever(
index = [
vecworks.Index(
name = "INDEX_NAME",
bind = "ATTRIBUTE_NAME",
distance = vecworks.DISTANCES.cosine,
density = vecworks.DENSITY.dense
)
],
retrieve = {
"METADATA": "metadata_alias"
}
)
On the other hand, vectorizers may also be used standalone. All vectorizers are required to
implement the transform()
method. With this
method you may transform any data directly without first having to pass this data through a
retriever or the pipeworks that embeds this retriever. This allows for the following use:
from vecworks.vectorizers.sbert import sbertVectorizer
vectorizer = sbertVectorizer.create_from_string("all-MiniLM-L6-v2")
print(
vectorizer.transform("Vectors may seem complex, but can be very useful!")
)