Skip links

Similarity Search with Oracle’s Vector Datatype, Python, and Flask

 

If you have been following the last few blog posts, this is the final one with regards to setting up similarity search with Oracle’s upcoming vector datatype. This unique datatype enables you to be similarity search within your applications quickly and easily while keeping everything secure behind Oracle standard security of the Oracle database.

If you have not keep up with the last few posts, you can go back and review the other three parts. These posts were designed to provide you with the basics of using Oracle’s Vector datatype, updating existing tables, and using similarity search from the command line.

Part 1: https://rheodata.com/vector-datatype/
Part 2: https://rheodata.com/oracle-vector-datatype-updating-table-data/
Part 3:https://rheodata.com/similarity-search-oracle-vector-datatype/

In this post, we are going to look at extending Part 3 by creating a simple Flask Application to do the similarity search through a web page. To do this, there had to be a few minor changes to the previously illustrated Python code. Let’s dive in and see how a similarity search can be done via a web page.

Prerequisites:

Like Part 3, the prerequisites are with additional added for Flask:

  • Oracle Database 23.4 (limited availability) 
  • Python 3.11 or later 
  • python_oracledb (2.0.0 or later (limited availability) 
  • LLM API Key (Cohere) 
  • Flask 3.0.2 
  • Werkzeug 3.0.1 

With the prerequisites set, we can now start looking at the code that will define the following application (see image below).

 

Python/HTML

This time around we are going to look at two different files – HTML and Python. This is what makes up the Flask Application we are using for this simple similarity search. The underlying table being used is the same as Part 3 – vector.video_games_vec.

Before we jump into the Python code, we need to define a template for the HTML page (index.html). This is the main page of the application.

<html>
  <head>
  <meta charset="UTF-8”>
  <meta name="viewport" content="width=device-width, initial-scale=1.0”>
  <style>
      body {
          font-family: 'Arial', sans-serif;
          background-color: #f4f4f4;
          margin: 0;
          padding: 0;
          box-sizing: border-box;
      }

      h1 {
          text-align: center;
          color: #333;
        }
      form {
          max-width: 450px;
          margin: 20px auto;
          background-color: #1C75BC;
          padding: 20px;
          border-radius: 8px;
          box-shadow: 0 0 10px rgba(0, 0, 0, 0.1);
        }
      label {
          display: block;
          margin-bottom: 10px;
          font-weight: bold;
      }

      input {
          width: 100%;
          padding: 10px;
          margin-bottom: 15px;
          border: 1px solid #ccc;
          border-radius: 4px;
          box-sizing: border-box;
        }
      button {
          background-color: #fff;
          color: #1C75BC;
          padding: 10px 15px;
          border: none;
          border-radius: 4px;
          cursor: pointer;
        }
      button:hover {
          background-color: #1C75BC;
        }
      h2 {
          margin-top: 20px;
          color: #333;
        }
        p {           color: #555;
        }

      /* New style for the output boundary */
      .output-container {
          max-width: 600px;
          margin: 20px auto;
          background-color: #fff;
          padding: 20px;
          border-radius: 8px;
          box-shadow: 0 0 10px rgba(0, 0, 0, 0.1);
      }
  </style>
  </head>
  <body>
      <br>
      <h1>Similarity Search</h1>
      <form method="post" action="{{ url_for('index') }}”>
          {{ form.hidden_tag() }}
          <label for="text”>
              <h3>Video Game Search:</h3>
          </label>
          {{ form.text(size=40) }}
          {{ form.submit() }}
        </form>
      {% if output %}
          <div class="output-container”>
              <h2>Video Games:</h2>
              {% for i in output %}
              <p>{{ i }}</p>
              {% endfor %}
          </div>
      {% endif %}
  </body>
</html>

Besides the CSS information, the key items to review ar the items in curly brackets ({{ }}). This is how Flask setups and uses items returned from the Python code. If you would like more on Flask and how it works with HTML – check out this page: https://flask.palletsprojects.com/en/3.0.x/

Now for Python … Finally!

The python code in this example is similar to the one in Part 3; however, it has been broken down into a few more functions to make it easier to use with Flask.

The first thing that needs to be done is import all the required packages for the application to work:

import oracledb
import cohere
import array
import time
import secrets
from flask import Flask, render_template, request, redirect, url_for
from flask_wtf import FlaskForm
from wtforms import StringField, SubmitField
from wtforms.validators import DataRequired

What you will notice here is the import of Flask and Werkzeug related items. There a good bit of items that needed to be imported but makes the application easier to develop.

To keep items simple, we defined the database connection into its own function. This allows us to call for the connection and get the connection in return. This is a simplified function, but keep in mind that this will only work against an Oracle Database 23c (23.4 – Limited Availability).

#define database connection function
def database_connection():
  try:
      connection = oracledb.connect(
          user="vector”,
          password=“<pwd>”,
          dsn=“xxx.xxx.xxx.xxx:1521/freepdb1”
          )
      print('connected’)
      return connection
  except:
        print('Could not make a connection’)

Next, we are going to define how to vectorize the video game title we are going to search for. Breaking this out into separate function allows us to call and return the vector value at anytime within the application.

#define LLM embedding model
def cohere_vectorize(vInput):
  co = cohere.Client("0Yyj8ORoDk6MYjSb8”)
  data = vInput
  response = co.embed(
    texts=[data],
      model='embed-english-light-v3.0’,
      input_type=“search_query"
      )
  vector_value = response.embeddings[0]
    return vector_value

The variable “vInput” is the title that we want to vectorize for our search.

The last function that we are going to define is a function to vectorize the video game title then turns it into a FLOAT64 vector that can be used with SQL for searching the database. Then return the vector string.

def exec_vec(text):
      vec = cohere_vectorize(text)
      vec2 = array.array("d", vec)
        return vec2
Application

With the functions we needed defined, we can now setup the application to perform the search of the vector.video_games_vec table.

First thing we need to do, is define the application. This done with the following statements:

app = Flask(__name__)

app.secret_key = secrets.token_hex(16)

Then we need to define a class for the form itself:

class Form(FlaskForm):

  text = StringField('Video Game Search: ', validators=[DataRequired()])
    submit = SubmitField('Submit’)

Next, tell the application how to route to the page and what cURL functions to use:

@app.route('/', methods=('GET', 'POST’))

Lastly, we need to define a function for the index.html page. This is simply called index(). This function sets up the following:

  1. calls the form class 
  2. variables/lists needed 
  3. SQL statement to use 
  4. opens the database connection 
  5. validate the form 
  6. retrieve the required information 
  7. closes the database connection 

The index() function looks as follows:

def index():

  form = Form()
  output_titles = []
  output_ids = []
  output_genres = []
  output_console = []
    binds = []
  select_stmt = """select id, title, genres, console
                  from vector.video_games_vec
                  order by vector_distance(vg_vec1, :1, DOT), id
                  fetch first 5 rows only”""

    connection = database_connection()
  if form.validate_on_submit():
      title = exec_vec(form.text.data)

      with connection.cursor() as cursor:
          for (id, title, genere, console,) in cursor.execute(select_stmt, [title]):
              output_ids.append(id)
              output_titles.append(title)
              output_genres.append(genere)
              output_console.append(console)
              #print(output_titles)
              #put all columns in a single list
      binds = list(zip(output_ids, output_titles, output_genres, output_console))
        #print(binds)
      return render_template('index.html', form=form, output=binds)
    connection.close()
    return render_template('index.html', form=form, output=None)
if __name__ == "__main__”:
    app.run(debug=True)

 

A couple of key items to point out in this function.

The first is the SQL statement. The select statement defines what we are looking for in the vector.video_games_vec table. In this case we are looking for ID, TITLE, GENRES, and CONSOLE. Then we are looking and ordering by the distance between each title by using the VECTOR_DATABASE function using DOT notation, then ordering by ID. Lastly, we are only fetching the first five rows only.

When this application executes this SQL statement against the vector.video_games_vec table, we will be using TITLE to find the video game in the table. All rows that are returned are then broken into four different lists. Then the lists are zipped together to give us all the information for the record via the binds list.

Lastly, we are telling the application to return the binds list to the output area on the index.html page before closing the connect to the database.

You can use this link to see similarity search in action:

Video Game Similarity Search

 

Other items to understand

You may have noticed that some video games returned more titles than expected. This is because of:

  • We are limiting the result set to the first five rows. This is normal behavior when the result set doesn’t have exactly five of the title.   
  • The proximity of the additional titles compared to the title being searched for