Vectors and Cosine Similarity

Hello all,

I am trying to search for some text which is stored in lists, each sentence is stored as separate element of the list. Now i want to search in this list and return the most appropriate match to the search string. the search string is any input that is entered by user into a text-box.

  • Is it possible to search for sentences based on cosine similarity and return only the sentence with the best cosine similarity match?

  • is this package usable?

  • Can anyone show me a short example?

Thank You.

This is how I’d do it:

  1. Compute a score of each sentence.
  2. Sort the sentences based on the score.
  3. Pick the one(s) with the highest score.

Example: https://ellie-app.com/92xVQCGMzJGa1

I don’t know anything about vectors and cosines, though.

To use cosine similarity, your sentences need to be embedded in a vector space. It doesn’t look like that’s the case for you.

An alternative is to filter by levenshtein distance, implemented as an elm package, or to use the Jaro Winlker function (also has an elm package)

You can use the above two packages to provide the score that @lydell talks about.

2 Likes

Thank you for the reply. Can you have a look at this elli and show me how to start?

What if I want to use cosine similarity, how to embed these sentences into a vector space. any start point will be highly appreciated.

I’ve integrated levenshtein here.

If your goal is to search by keywords through a set of sentences though, this may not be the best approach. For instance for the word “world”, the first result doesn’t actually include that word, because levenshtein distance counts the number of “edits” to the one string to arrive at the other. So even if two strings have the same words, the edit distance can be very large because of other words in the string.

the edit distance between word and world is much shorter than world and our world, even though “our world” should probably be ranked higher when the query is “world”.

The field that studies the ranking of documents based on a query is Information Retrieval. A simple metric that is closer to a search engine is bag of words. It just counts how many words of the query are in the sentences, and picks the best. Maybe that’s fun to experiment with?

2 Likes

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.