With the latest issue of @alexkorban’s Elm Weekly the first article caught my attention as I have been recently in the need for such deduplication of the List.
The article is saying that using the List |> sort |> and fold can be faster than Set |> fromList |> toList. And proves it with Benchmarking those two approaches on the reverse List with few duplicities and list with a lots of duplicities:
But then I found out that the blogpost is from 2018, and I’m unable to run the old Ellie to see for myself. So I have recreated the benchmark and run it with the current Elm version and I have ran it.
What do you think the result looks like?
Well, it is one of those funny ones as… it is inconclusive
The result of benchmark for small number of duplicities used in the original article is still faster with sort |> fold method, but not 10x but 4x.
But the second part where you have a list with lots of duplicities is having different outcome than in the article as the Set |> fromList |> toList performs better by 33% compared to being 2x slower in the article version.
So the conclusion? It is interesting to know that the List |> sort |> fold is faster when you have fewer duplications, but it is hard to tell from what number of duplicities it is outperformed by the Set |> fromList |> toList approach.
Lesson from this short exercise: Trust no one! And rather try the hypothesis out by yourself if you have a chance.
Update
@Janiczek ran the benchmark with --optimize flag and here are the results
Just a note, you should not benchmark things in Ellie since it is compiled with --debug I believe. (or at least it used to be, I haven’t check for a while)
I just recently made optimizations in app that we have at work by replacing 2 lists by AnySet. I measured about 5x difference on our production data in measuring not even just the functions using some micro benchmark but measuring whole performance of view. In other words it means 10 lines of code change and render of whole app went from 1.5s to 300ms. In our case lists were causing worse than cubic complexities of algorithms all over the place.
I even think that de-duplication is not even a good thing to measure. It’s extremely naive case. What you should care about in any more real world benchmark is how your data structure will scale with increasing complexity over time. Some people say that rendering is the most expensive operation every app is doing. I disagree. Most render transitions are changing very little things in the DOM, setting one class, changing few nodes - that’s the common case. Meanwhile view code gets executed over and over after every model update. My advice - never List.Extra.find and don’t use AssocList.
Well, I’d say it depends on what you are doing. In a code base we have at work, we use some Dict String ComplexValue dicts which won’t exceed 20 elements (ok, in the very very very worst case, it could ended up to 50, but this really would be unlikely).
I really think we should use AssocList for this in order to be able to have some Key type more significant than String. Moreover, for really small dicts (like e.g. less than 10 elements), I even think assoclist would be faster than dict because when sing Dicts we might have an overhead (not sure however).
I would say “never say never” or make any other absolute statement when it comes to performance. It all ‘depends’. Much of the time it does not matter, especially as computers are so fast nowadays. Find the parts of your code base where it does matter, set up measurement, and then tweak → measure → tweak → measure until its good enough.
Of course, always helpful to have an insight into the big-O performance of various data structures. You do need a hypothesis to follow to make the tweaks you are going to measure, after all.
I think it’s generally better to distinguish between performance (how fast something runs) and efficiency (how efficient algorithm is - this is the big O).
For something that has at most 50 elements the AssocList will actually have better performance even with less efficiency. For instance single list cons is chear than single insert to set. Most problems come from how this data-structure gets used later though once you start query the data while iterating over other data.
Backtracking to the original example - collection with unique items where order doesn’t matter (list sort and de-duplicate). It imo doesn’t make sense to use List for anything like this even semantically. If one has collection of unique values with no arbitrary ordering, these values should be stored as Set and converted to list only when really necessary (like because one needs to render them into Html).
On the other hand these small collections (AssocList with 50 values) will unlikely be a bottle neck of any kind so over optimizing by using List instead of Set seem like a wasted effort. Generally the only good argument to do so is because of comparable limitations. There are some solutions to this problem even though they have their own trade-offs. I have 2 libraries which are sort of working around the lack of type-classes or ML style modules (to implement this the OCaml way) - any-set & any-dict.
Of course saying never was quite an over statement but I’ve seen a lot of elm code that was performing really badly over the years just because folks are reassured all the time that stuff like this doesn’t matter and then end up with completely unusable application without having slightest clue what is slowing it down. Almost in all case the rendering performance gets blamed because of window.requestAnimationFrame call that elm does prior to calling view code while in fact in most if not all cases I’ve seen ti was all in data-structure operations that are done in view function. Quite often then they go into Html.Lazy rabbit hole quite often without understanding that this abstraction us using referential not structural equality and use it in cases where it will never cache (eg constructing new record on every call of view etc). Anyway as I said, my advise is to use sets/dicts as that doesn’t require people to understand exponential growth which is hard to reason about intuitively (at least for me it still is).
EDIT: it might be worth mentioning that @kraklin and I were colleagues for several years working on the same projects sou our suspicion when we see someone recommending less efficient data structure comes from the shared experience.
Something that might help with this - if we had a wiki page for ‘Elm Data Structures’, listing all the core data structures, and user contributed packages too. I think someone did compile such a list at one time, maybe I’ll dig around and see if I can find it again. If this page also listed the big-O behaviour for all the common functions (insert, remove, find-by-key, find-by-value, append, …), it could serve as a handy guide when selecting data structures and understanding what is going on.
For example, recently I wanted to work with quite large arrays, but I don’t really know what the performance of slicing and appending the elm-coreArray is. Asking on Slack led me to believe its O(log n).
I appreciate what you are saying though - for people like me with a computer science background, this stuff is not difficult to understand. For people coming to Elm perhaps from a more creative web design background, this kind of thing may expose a gap in their knowledge.