Skip to content

aiden-liu/rag-postgres

Repository files navigation

Why this repo

I've been building AI related skills lately, like playing around with AI platforms, LLM models, and learning Langchain. Try to build something that usable, so RAG is a good start.

This post is a good instruction of how.

Chucking

This notebook is awesome. And visualisation as well, for different type of chunkers.

Embeddings

An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness, and large distances suggest low relatedness.

In this demo, we use Azure OpenAI embedding model text-embedding-ada-002.

Pgvector

For tsvector column in table document_chunk parsing document, see Postgres doc here.

At the same page, see also:

  • to_tsquery, for parsing queries;
  • ts_rank, for ranking search results;
  • ts_headline, for highlighting results;

An example can be found here.

To calculate the distance or similarity between two vectors, pgvector supports below operators:

  • [vector] <-> [vector]: L2 distance, or Euclidean Distance
  • [vector] <+> [vector]: L1 distance, or Manhattan Distance, or Taxicab Distance
  • [vector] <=> [vector]: cosine distance, equals (1 - cosine similariy) where cosine similariy is the cosine value of the angle between two vectors.
  • [vector] <#> [vector]: inner product, returns negative value from the normal calculation result, since Postgres only supports ASC order index scans on operators.

For understanding vector distance and similarity, this blog is pretty neat.

Questions

  1. How to manage outdated documents?
  2. How to evaluate search results?
  3. How to tune the model, on places like embedding calculate operators, what else?

About

A RAG play with pgvector, azure doc intelligence and azure openai

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published