Basics#

Vecworks builds on top of Pypeworks, adding specialised nodes to easily interface with vector stores. These nodes employ various constructs to specify how these vector stores should be queried.

Retrievers#

A Retriever is a specialised Pypeworks node designed to query vector stores. These retrievers take the inputs passed by argument, vectorizing these inputs to compare them with the vectors stored in the vector store. If this produces any matches, retrievers help access associated metadata.

Regardless of the backend, all retrievers need to define the indices to query, the vectorizers to employ, and the metadata to retrieve:

retriever = Retriever(

    index = [

        vecworks.Index(

            name         = "INDEX_NAME",
            bind         = "ATTRIBUTE_NAME",

            distance     = vecworks.DISTANCES.cosine,

            vectorizer   = RemoteVectorizer(
                url        = "http://localhost:5000/",
                vectorizer = "vectorizer-name"
            ),

            density      = vecworks.DENSITY.dense

        )

    ],

    retrieve = {
        "METADATA": "metadata_alias"
    }

)

In practice specific retrievers may need additional arguments to connect with a vector store. Refer to the documentation of the retriever you wish to employ for the full details.

Note that despite being designed to function as Pypeworks node, retrievers may also be used standalone. To invoke a retriever, use the exec() method as follows:

retriever.exec("query: What is a vector?")

Indices#

Vector stores retain their vector data using indices. These data structures are specifically designed to allow for efficient similarity matching during lookups. Software interfacing with vector stores must therefore specify what indices to access.

In Vecworks access to indices is controlled through Index objects. These objects provide a generalized interface detailing what index to access (using name), how to connect arguments to the index (using bind), and how to match data with the index (using vectorizer, distance and density).

Density#

Vectors may be represented either densely or sparsely. A dense vector stores each element in a continuous block of memory, allowing for more efficient processing of calculations involving entire vectors. On the other hand, a sparse vector only stores non-zero elements, requiring less storage than dense vectors, allowing to work with high dimensional vector that otherwise would not fit into memory.

As the specific application determines whether computation or storage needs to be optimized, Vecworks requires the user to specify whether to use dense or sparse vectors. This is done using the DENSITY enumeration.

Distance#

Distance functions are used to calculate the (dis)similarity between two vectors, i.e. a vector passed by argument and an indexed vector. As each distance function has its own specific uses, Vecworks supports various functions using DISTANCES, namely:

Name

Workings

Use cases

cosine

Compares the orientation of two vectors by calculating the angle between two non-zero vectors.

  • Semantic text matching

  • Document classification

nip

negative inner product

Takes two equal-length vectors, multiplying both vectors’ elements pairwise, summing up the results. Multiplied with -1 to allow a higher inner product to represent greater similarity.

  • Structural text similarity

  • Ranking and recommendations

l1

Manhattan distance

Takes two equal-length vectors, calculating pairwise the absolute difference between both vectors’ elements, summing up the results to produce a single distance measure.

  • Image matching

  • Histogram matching

l2

Euclidean distance

Takes two equal-length vectors, calculating pairwise the absolute difference between both vectors’ elements, squaring up the results before them up into a single distance measure.

  • Nearest neighbor search

hamming

Counts the number of positions at which two vectors’ corresponding elements are different.

  • Error detection

  • Fuzzy matching

jaccard

Considers vectors as sets, calculating the degree of overlap between both vectors/sets.

  • Network analysis

  • Keyword search

Vectorizers#

A Vectorizer helps transform data of any kind to a vector representation. The specifics of this transformation depend on the specific algorithms and models applied. These algorithms and models may be accessed through various vectorizers shipped with Vecworks. Refer to the documentation of these vectorizers for the specifics.

Do note that indices do not require a vectorizer to be specified. If a vectorizer is not specified, the retriever assumes any data relayed to indices without vectorizer to already have been vectorized (represented by a NumPy ndarray, a SciPy sparray or a sparse SparseArray). Accordingly, the following code is valid too:

Retriever(

    index = [

        vecworks.Index(

            name         = "INDEX_NAME",
            bind         = "ATTRIBUTE_NAME",

            distance     = vecworks.DISTANCES.cosine,
            density      = vecworks.DENSITY.dense

        )

    ],

    retrieve = {
        "METADATA": "metadata_alias"
    }

)

On the other hand, vectorizers may also be used standalone. All vectorizers are required to implement the transform() method. With this method you may transform any data directly without first having to pass this data through a retriever or the pipeworks that embeds this retriever. This allows for the following use:

from vecworks.vectorizers.sbert import sbertVectorizer

vectorizer = sbertVectorizer.create_from_string("all-MiniLM-L6-v2")

print(
    vectorizer.transform("Vectors may seem complex, but can be very useful!")
)