As an experiment and learning exercise, I spent some time this week building a deeper search index for Elm packages.
Motivation: I often find that the official Elm package index search returns no results, I suspect because it only indexes package authors, names and summaries. Here, I’ve indexed all READMEs, module documentation, as well as exposed function and type names.
If anyone is interested, I also wrote a little blog post about the process. I had no experience with search before this project, and would be particularly glad for feedback on whether the vector space model is suitable for this kind of keyword search, or whether there are better alternatives to consider.
Nice project, I faced the same situation a few times. I think that it can be improved because I have tried looking for i18n and it didn’t find elm-i18next. Also, you don’t need to make a crawler, you can download the source database whenever you want:
As far as I can tell, the JSON you linked to only contains the package listings metadata (e.g. author, name, short description). Do you know if the READMEs and module documentation files are also available to retrieve in similar format?
elm-i18next is an interesting case, because the distinct string i18n never appears in the package documentation (so you’d have to search for i18next, which is not ideal). I think the lack of substring matching is a weakness of the generic vector space model I’m using for building the index, so I’ll take this as a reason to learn some other search indexing models
(I didn’t check the full source code of your scraper so I don’t know if you already use some of what I mention here.)
Firstly you can get list of all packages and their versions from these URLs. The since-URL is great for incremental updates as you can use the number of items you already have there and only get newer ones instead of all.
But that doesn’t seem to include README. To get README without scraping would probably require downloading it from GitHub repository. You can see the GitHub repository of package in