Skip to content

Conversation

@PouyaMohseni
Copy link
Contributor

@PouyaMohseni PouyaMohseni commented Nov 13, 2025

  • In feature requests: ranking, fuzzy matching #407, the ranking and fuzzy matching of search results were discussed. The system previously used n-gram matching to compare the query with language labels.
  • This adds exact matching which, in contrast to n-gram matching, requires an exact match of the term.
  • In the final results, exact matches are weighted twice as strongly as n-gram matches.

@dchiller
Copy link
Contributor

Can you explain how in the PR description? The solution discussed in the issue doesn't seem to be the solution you went with here...that's fine! But you should briefly discuss the steps here.

Copy link
Contributor

@dchiller dchiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you say more about why we are adding this? What problem does this solve?

I see in the linked issue that both fuzzy matching and ranking are discussed. It looks to me like this is more related to the ranking than the fuzzy matching. Is that right?

What is the type of ranking that we are hoping to see? And how does this approach get us towards that ranking?

For example, I see in the issue that we want a search of tar to return more the instrument with the name "tar" over the ones with the name "guitar". I imagine your solution will achieve this result. But what if I search "ta"? Do I still want "tar" to come up before "guitar" in the results? Will this achieve that? Would something like Solr's Edge N-gram Tokenizer be more what we want?

I'm not sure that that's exactly what we want, but it certainly seems to me like a case where "tar" returns "tar" over "guitar", but "ta" doesn't isn't necessarily what we want.

<field name="text" type="text_general" indexed="true" stored="true" multiValued="true"/>

<!-- exact match field -->
<field name="text_exact" type="text_vector" indexed="true" stored="true" multiValued="true"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this field really need to be stored? Are we ever going to use its value? Is it really a multiValued field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both text (text_ngram) and text_exact should not be stored but are multiValued.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both text (text_ngram) and text_exact should not be stored but are multiValued.

Agreed about not storing them. Not sure about multiValued though... what are the multiple values?

As far as I can see, both are creating by coping a number of other fields in to this field and indexing the result...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe text fields, particularly those that are used as targets for copyfields, should generally be multivalued fields. As fields are copied into the text field I believe they are kept as a distinct value in that field, and not simply appended as one big string.

See: https://solr.apache.org/guide/solr/latest/indexing-guide/copy-fields.html

"In the example above, if the text destination field has data of its own in the input documents, the contents of the cat field will be added as additional values – just as if all of the values had originally been specified by the client. Remember to configure your fields as multivalued="true" if they will ultimately get multiple values (either from a multivalued source or from multiple copyField directives)."

@PouyaMohseni PouyaMohseni marked this pull request as draft December 2, 2025 17:34
- remove unused fieldType
- add and weight text_exact in \select compared to text_ngram
- moved wikidata_id_s to text_exact from text_ngram
@PouyaMohseni
Copy link
Contributor Author

Here, exact_match gives higher weight to queries that match the labels or aliases exactly, without changing with the overall matching and ranking performed by the n-gram. For example, ta results in a higher rank for guitar, as before.

@PouyaMohseni PouyaMohseni marked this pull request as ready for review December 4, 2025 17:52
@PouyaMohseni PouyaMohseni requested a review from dchiller December 9, 2025 22:43
<fieldType name="text_general_rev" class="solr.TextField" positionIncrementGap="100">

<!-- Exact match text field with term vectors -->
<fieldType name="text_vector" class="solr.TextField" positionIncrementGap="100" termVectors="true">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of termVectors = true here? What is our current use-case for them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I removed that. It was originally used for experimenting with highlighting matched parts, and I forgot to clear it.

@PouyaMohseni PouyaMohseni marked this pull request as draft December 17, 2025 17:11
@PouyaMohseni PouyaMohseni marked this pull request as ready for review December 17, 2025 19:36
@kyrieb-ekat
Copy link
Contributor

quick note about the failed E2E test, which looks like Google Translate isn't able to find the element .nav-link:has-text("À propos"), resulting in timeouts when checking its visibility. Is there an actual navigation link for "Á propos" when the site language is in French? Is is an encoding thing with the language switch, with the accent present?

@yinanazhou
Copy link
Member

I've updated the E2E test in another PR. Should not be a problem after the changes get merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants