feat(indexers): Added alternate scoring metrics by rileyok-ons · Pull Request #112 · datasciencecampus/classifai

rileyok-ons · 2026-01-23T12:09:38Z

✨ Summary

Users May have use cases where want to use different search metrics, i.e. dotprod, cosine (normalised dotprod), to do this we have added a scoring method attribute to the vectorstore that is accessed in search to calculate the desired metric.

The scoring logic has been abstracted out to a score method to return scored output on given query using the vectorstores chosen metric
We have created a type alias of a literal with all metrics for typehints and checking

📜 Changes Introduced

Feature implementation (feat:) / bug fix (fix:) / refactoring (chore:) / documentation (docs:) / testing (test:)
Updates to tests and/or documentation
Terraform changes (if applicable)

✅ Checklist

Please confirm you've completed these checks before requesting a review.

Code passes linting with Ruff
Security checks pass using Bandit
API and Unit tests are written and pass using pytest
Terraform files (if applicable) follow best practices and have been validated (terraform fmt & terraform validate)
DocStrings follow Google-style and are added as per Pylint recommendations
Documentation has been updated if needed

🔍 How to Test

Run with different metrics to see the difference

…rStore

…ype alias

…oned in docstring

src/classifai/indexers/main.py

lukeroantreeONS · 2026-01-27T10:49:58Z

src/classifai/indexers/main.py

        return result_df

-    def search(self, query: VectorStoreSearchInput, n_results=10, batch_size=8) -> VectorStoreSearchOutput:
+    def _check_norm_vdb(self):


I like this functionality a lot, but I think it should be the vectoriser's job to output embeddings in the desired form, not the vector store changing them after the fact.
My preference would be to update the Vectorisers' .transform() methods to take an optional (default False) normalise argument, which applies this normalisation if set to True.

I disagree slightly here.

An informed user, who knows their embedding model already outputs normalized embeddings, should then be able to just use the dotproduct metric, which would give them the effects of cosine similarity without having to do the extra norm checks and steps they would need if they set to a cosine metric.

also i think its a good idea to keep the vectorisers pure and not overcomplicate the logic argument logic - whereas the vectorstore responsible for housing, reloading and metric calculations of the vectors probably should be keeping a note on whether the vectors are normalised or not

An informed user, who knows their embedding model already outputs normalized embeddings, should then be able to just use the dotproduct metric, which would give them the effects of cosine similarity without having to do the extra norm checks and steps they would need if they set to a cosine metric.

I'm not sure I follow what you mean; if a user knows their embedding model already outputs normalised embeddings, they could just not set the normalise flag when creating the Vectoriser.

also i think its a good idea to keep the vectorisers pure and not overcomplicate the logic argument logic

This is an operation that happens directly on the vectors, a step before any use in a vector store or scoring. I think it fits in well with the task of the Vectoriser, and avoids the other issues you discussed - such as any need to duplicate vectors in the vector store and set/read metadata flags about whether the vector store is normalised.

Lets talk about it in our call later 👍

frayle-ons

Looks really good so far! Very eloquent logic that could be extended to many metrics or eventually replaced with something like FAISS.

Just one things about how we record and reload the vectorstore and how we track whether its already normalised or not

frayle-ons · 2026-01-27T12:07:24Z

src/classifai/indexers/main.py

        vector_store.vectoriser_class = metadata["vectoriser_class"]
        vector_store.hooks = {}
-
+        vector_store._check_norm_vdb()


I'd favour a 'normalise once' approach -

when the VDB is being constructed by _create_vector_store_index(), it checks if the user specified a metric that requires normalised vectors and normalises the created collection and then saves them to the polars df/parquet file.

Then we'd record the 'metric' used in the metadata file

when the parquet is loaded back in with from_filespace() we know to use the appropriate metric already as its stored in the metadata file and theres no need to redo the normalisation

so i'd also take the 'metric_setting' parameter out of the class method from_filespace() and rely just on the metadata file.

this would mean less operations every time we load the vectorstore in, after initial creation - potentially at the cost of losing the magnitude information and not being able to get it back without running the build step again with a different metric

Resolved by adding normalize meta field, if choosing cosine with un-normed will norm but will warn user

src/classifai/indexers/types.py

frayle-ons

Propose we rework to have only 2 metrics - Inner product and L2 distance.

Inner product is the more general class of operation - it would imply cosine similarity is happening when IP is performed on normalised vectors, and that dotproduct is happening when the vectors aren't normalised.

Similar case for L2.

So we would have just 2 metrics rn ['IP', 'L2']. And then we should just not touch the vectorisers module at all for now, and let users make their own custom vectoriser if they really want to normalise and not just use the base GCP or Huggingface models etc.

This would be more in line with Faiss semantic search process of being agnostic about Vectoriser behaviour (it only assumes you're correctly using the same vectoriser), and may help us integrate that library down the road:

https://www.pinecone.io/learn/series/faiss/composite-indexes/

frayle-ons · 2026-01-29T14:13:02Z

src/classifai/indexers/types.py

What if we scrapped all 6 of these and just had ['IP', 'L2'].

I think L2 squared and IP squared should be a downstream postprocessing hook as its just a common scoring operation

Seen this suggested previously, if we want this can sort

I'm happy with that plan 👍

I'd like if we added an example to one of the notebooks showing a way of wrapping one of the Vectorisers to add normalisation though, to tide users over until we properly offer normalisation as an option.

I can add that to this PR tomorrow, if nobody objects.

What if we scrapped all 6 of these and just had ['IP', 'L2'].

Would you be okay with renaming 'IP'->'dot' for this? I think 'dot' would be more easily understood by users via docstrings without needing to explore documentation etc. to find out / confirm IP = Inner Product

I would suggest we just leave it - if users really really want it they can make their own custom vectoriser that wraps the hugging face vectoriser - but if you really wanted to you could update the custom_vectoriser demo notebook to have a section on this and show how to do it to the hugging face class?

We do already have one user group requesting this functionality (and currently using a custom wrapped HF Vectoriser to achieve it), so I think it is worth adding to the docs.

So... they did use a custom vectoriser? 😀

Not sure what the correct answer is, we definitely don't want to be adding a variant of every Vectoriser called VectoriserX_normalised, or a wrapper for each class. Maybe 1 utility wrapper that wraps round all our Vectoriser class imps.... but what is the benefit/tradeoffs of that new class, which we'd have to add more docs and ensure it's compatible forever, versus guiding users in how to do it with our existing custom vectoriser / base class architecture.

Yes, that I made for them as a one-off solution as the package doesn't yet offer that - I'm saying it would be useful to have that knowledge made accessible in the documentation for other users

frayle-ons

This is looking really good! Would just request we remove any logic or attributes for normalisation for now completely.

Leave it completely agnostic so that whatever vectors the the user's vectoriser is producing is what determines if cosine or dot prod is performed.

In all cases use L2^2 = ||A||^2 + ||B||^2 - 2(a.b)

instead of L2^2 = 1 + 1 - 2(a.b)

rileyok-ons added 5 commits January 22, 2026 11:13

feat(indexers): Added configurable scoring metric attributre to Vecto…

2bfd909

…rStore

chore: added required metrics list

b2a0d95

chore: refactored scoring to a method and changed metric literal to t…

4e177ee

…ype alias

chore: set default gcpvectorizer task type to classification as menti…

4414401

…oned in docstring

added normalization step for when using cosine distance

a84575f

rileyok-ons linked an issue Jan 23, 2026 that may be closed by this pull request

Different Scoring Functions #111

Open

rileyok-ons marked this pull request as ready for review January 26, 2026 12:51

Merge branch main into 111-different-scoring-functions

3488a2c

lukeroantreeONS reviewed Jan 27, 2026

View reviewed changes

frayle-ons requested changes Jan 27, 2026

View reviewed changes

Tom-Owen-ONS reviewed Jan 27, 2026

View reviewed changes

src/classifai/indexers/types.py Outdated Show resolved Hide resolved

chore: refactored types to enum

c0a2170

rileyok-ons changed the title ~~111 different scoring functions~~ feat(indexers): Added alternate scoring metrics Jan 28, 2026

github-actions bot added the enhancement New feature or request label Jan 28, 2026

rileyok-ons added 3 commits January 28, 2026 09:50

chore: moved type imports to conditionals

293d25d

chore: added types as annotations, setup better norm logic

a464b35

fix: added not to normalize check

f982f4c

frayle-ons requested changes Jan 29, 2026

View reviewed changes

rileyok-ons added 3 commits January 29, 2026 15:18

Merge branch main into 111-different-scoring-functions

4981fb8

chore: added l2 dist

df4e8db

chore: renamed metrics again

cb9c561

frayle-ons requested changes Feb 3, 2026

View reviewed changes

Conversation

rileyok-ons commented Jan 23, 2026

✨ Summary

📜 Changes Introduced

✅ Checklist

🔍 How to Test

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

frayle-ons left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rileyok-ons Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

frayle-ons left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

frayle-ons left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rileyok-ons Jan 28, 2026 •

edited

Loading