Vectors and Cosine Similarity

ahmadsherzai · June 1, 2020, 1:18pm

Hello all,

I am trying to search for some text which is stored in lists, each sentence is stored as separate element of the list. Now i want to search in this list and return the most appropriate match to the search string. the search string is any input that is entered by user into a text-box.

Is it possible to search for sentences based on cosine similarity and return only the sentence with the best cosine similarity match?
is this package usable?
Can anyone show me a short example?

Thank You.

lydell · June 1, 2020, 5:53pm

This is how I’d do it:

Compute a score of each sentence.
Sort the sentences based on the score.
Pick the one(s) with the highest score.

Example: https://ellie-app.com/92xVQCGMzJGa1

I don’t know anything about vectors and cosines, though.

folkertdev · June 1, 2020, 5:53pm

To use cosine similarity, your sentences need to be embedded in a vector space. It doesn’t look like that’s the case for you.

An alternative is to filter by levenshtein distance, implemented as an elm package, or to use the Jaro Winlker function (also has an elm package)

You can use the above two packages to provide the score that @lydell talks about.

ahmadsherzai · June 1, 2020, 8:43pm

Thank you for the reply. Can you have a look at this elli and show me how to start?

What if I want to use cosine similarity, how to embed these sentences into a vector space. any start point will be highly appreciated.

folkertdev · June 1, 2020, 9:15pm

I’ve integrated levenshtein here.

If your goal is to search by keywords through a set of sentences though, this may not be the best approach. For instance for the word “world”, the first result doesn’t actually include that word, because levenshtein distance counts the number of “edits” to the one string to arrive at the other. So even if two strings have the same words, the edit distance can be very large because of other words in the string.

the edit distance between word and world is much shorter than world and our world, even though “our world” should probably be ranked higher when the query is “world”.

The field that studies the ranking of documents based on a query is Information Retrieval. A simple metric that is closer to a search engine is bag of words. It just counts how many words of the query are in the sentences, and picks the best. Maybe that’s fun to experiment with?

system · June 11, 2020, 9:15pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to get the most relevant text when searching? Learn	5	690	November 18, 2020
Local Pattern Matching Learn	9	750	April 15, 2020
Deeper Search Index for Elm Packages Show and Tell	4	971	December 18, 2021
Is NLP or Machine Learning possible in Elm? Learn	5	1655	May 8, 2020
Chatbot and data processing in elm Request Feedback	2	740	March 19, 2020

Vectors and Cosine Similarity

Related topics