Similarity Search with Oracle’s Vector Datatype

Author: Bobby Curtis Published on: February 27, 2024 Published in: AI

In my last two posts I showed you what the Oracle Vector Datatype is and how to update existing data after adding a vector datatype column. In this post, we are going to quickly look at how to use those examples to implement a “similarity search” based on the vector stored in the vg_vec1 column of the table.

Assumptions

The prerequisites for this should be pretty straight forward, but I’ll make note of them here for reference.

Oracle Database 23.4 (limited availability at the moment)
python3.11 or later (what I’m using non-venv setup)
python_oracledb (2.0.0 or later (limited availability))
LLM API Key (Using Cohere)

With the prerequisites defined and ready to use, lets take a look at the code need implement a “similarity search” on top of the video game data stored in vector.video_games_vec.

If we do a quick query of vector.video_games_vec table, we see that vectors are in place for every row of the table.

SQL> select id, title, vg_vec1 from VECTOR.VIDEO_GAMES_VEC order by id;

With vectors in place, we now need to write some code that will compare the data in column vg_vec1 with a search term or question we are looking for.

Python

First we need to import the packages and define the LLM API key we need for python:

#! /usr/local/bin/python3
import oracledb
import cohere
import os
import sys
import array

#set Cohere API key
api_key = “eOubN0Yyj8ORoDk6MYjr"
co = cohere.Client(api_key)

As you can tell, we are importing the new oracledb package for python along with the package for cohere. The other packages are needed and are already installed with python. Within the same section, we are defining the API Key that is needed to run embeddings against Cohere.

Next, we are setting up a few definitions/functions for the process. I broke these into functions that partition what is happening with the script. There are two functions that we are going to define – database connection and vectorization.

Within the database connection function, we are simply making a call to connect to the database and return the connection for usage within the script. If connection cannot be made, then the connection is not returned.

#define database connection function

def database_connection():

    try:
        connection = oracledb.connect(
            user="<user>”,
            password="<password>”,
            dsn="100.166.255.254:1521/<service>”
            )
        print('connected’)
        return connection
    except:
        print('Could not make a connection’)

The next function is used to create embedding for the search term or question that we are looking for in the “similarity search”. Breaking this out into a function helps simplify the code and relegates the vectorization to a single step. Then the vector array of embeddings are retuned for further processing.

#define LLM embedding model

def cohere_vectorize(vInput):

    data =  vInput

    response = co.embed(
        texts=[data],
        model='embed-english-light-v3.0’,
        input_type=“search_query"
        )

    vector_value = response.embeddings[0]
    return vector_value

Lastly, we create a function called main(). This is just to compartmentalize main code into a single spot. Something, I like to do from time to time.

Within the main function is where all the work is done for the “similarity search”. We setup the connection to the database, the select statement that is going to be used, and what to do with the search.

def main():
    connection = database_connection()

    with connection.cursor() as select_cursor:
        select_stmt = """select title, genres, license
                        from vector.video_games_vec 
                        order by vector_distance(vg_vec1, :1, DOT) 
                        fetch first :2 rows only”""

        while True:

            prompt_input = input('Search Games or Game Question: ‘)
            if (prompt_input == "exit”):
                break

            if (prompt_input == "”):
                continue

            fetch_num = input("How many records to fetch: “)

            vec = cohere_vectorize(prompt_input)
            vec2 = array.array("d", vec)

            for(title, genres, license) in select_cursor.execute(select_stmt, [vec2, fetch_num]):
                print("\t" + title + "|" + genres + "|" + license)

A couple of things to understand about the code above. The select _stmt parameter defines the select statement that is used to query the database table and return a result set based on the vectors. Within this select statement, we are using vector_distance to compare two vectors – stored vector and new embedding vector for search. Then the algorithm used for the search – DOT. All this is done in the order by clause. Additionally, the query is limiting the results based on the first number or rows fetched.

To give a better view of this query, it looks like this:

select title, genres, license

from vector.video_games_vec

order by vector_distance(vg_vec1, :1, DOT)

fetch first :2 rows only

You will notice that we are using :1 and :2 within the SQL statement. These are used by the cursor.execute function in python to pass values.

Another thing to point out is that the returned embedding needs to be converted to a format that can be used by SQL. This is done by taking the embedding and converting it to a FLOAT64 value.

This is done with the line in the main function as:

vec2 = arrary.array(“d”, vec)

Seems trivial, but is needed for the search to work.

Lastly, we simply make a call to the main function for execution. Once execute, we can provide our search term or question related to a video game and get information back on the video game.

Note: One thing to note, the for loop in the main function that executes the select statement needs to be updated if more columns are added to the select statement.

Look at the below video, you will see that we are searching for various video games and looking to get the name of the video game, the genre it belongs in, and if it requires a license. The output is prefixed with a tab and broken up with pipe symbols.

See this simple vector search in action: here

When it comes down to the effectiveness of the search, it will be based on the quality of your data. If you have clean data, then your searches will be more efficient and yield better results.

Similarity Search with Oracle’s Vector Datatype

Assumptions

Python

Articles

Resources

Products

Company

Assumptions

Python

Related Posts

Articles

Resources

Products

Company