diff options
author | Camil Staps | 2021-06-23 21:11:09 +0200 |
---|---|---|
committer | Camil Staps | 2021-07-09 15:13:48 +0200 |
commit | 511d410970406f2dcd10f01a9892f5d0f9e09476 (patch) | |
tree | 9b8820f0222effa4dfd376498a589a67a0cd9050 /resources | |
parent | Add note about assistantships to Teaching section (diff) |
Add article about Cloogle
Diffstat (limited to 'resources')
-rw-r--r-- | resources/md/2021-07-26-cloogle-search-overview.md | 196 | ||||
-rw-r--r-- | resources/pug/finals/articles/2021-07-26-cloogle-search-overview.pug | 18 | ||||
-rw-r--r-- | resources/pug/finals/articles/index.pug | 7 | ||||
-rw-r--r-- | resources/pug/include/layout-articles.pug | 3 |
4 files changed, 224 insertions, 0 deletions
diff --git a/resources/md/2021-07-26-cloogle-search-overview.md b/resources/md/2021-07-26-cloogle-search-overview.md new file mode 100644 index 0000000..213e71c --- /dev/null +++ b/resources/md/2021-07-26-cloogle-search-overview.md @@ -0,0 +1,196 @@ +*[Cloogle][] is a search engine for the pure, functional programming language +[Clean][], similar to [Hoogle][] for [Haskell][]. In this post I go through the +design of the search backend and make a comparison with that of [Hoogle 5][].* + +[[toc]] + +# Overview + +Internally, Cloogle keeps the entire index in memory. This is possible because +it is relatively small. While we have different entry types for functions, type +definitions, classes, and other things, they are all stored in one big array, a +`NativeDB CloogleEntry Annotation`: + +```clean +:: *NativeDB v a :== *{!*Entry v a} + +:: Entry v a = + { value :: !v + , included :: !Bool + , annotations :: ![!a!] + } + +:: CloogleEntry + = FunctionEntry !FunctionEntry + | TypeDefEntry !TypeDefEntry + /* .. */ +``` + +Because the `NativeDB` is +[unique](https://en.wikipedia.org/wiki/Uniqueness_type), we can do mutable +updates on it. While searching, we keep all entries in the array, and use +`included` to signal whether an entry should be returned from a search result +or not. + +For a search request, we go multiple times through the database and toggle the +`included` flag as needed. For example, for the query +`map :: (a -> b) [a] -> [b]` we would first filter on the name `map` and then +on the type `(a -> b) [a] -> [b]`. The implementation of each filter is +independent of the others, so that each can be implemented in the way that is +most efficient for it. + +# Name search + +To do fuzzy searches for names, the index contains an `NGramIndex Index`: + +```clean +:: NGramIndex v = + { n :: !Int //* The parameter *n* for the size of the grams + , ci :: !Bool //* Whether matching is case-insensitive + , idx :: !Map String [v] //* The values + } +``` + +The values stored in this index are indices for the `NativeDB`. So if we would +search for `query`, we would take the 3-grams `que`, `uer`, and `ery` (*n* is +currently 3), and for each gram lookup the list of indices in `idx`. These +lists are then merged (they are sorted, so this can be done in O(n)), and we go +through the `NativeDB` to set `included` for the indices in the list. + +I experimented with a [trie](https://en.wikipedia.org/wiki/Trie) instead of a +binary tree for `idx`, but it wasn't faster. After all, we only need few +lookups in this structure. + +# Type search + +This is the complicated one. [Hoogle has a very fancy algorithm][Hoogle 5], +which creates fingerprints for type signatures based on various things, like +the arity, the number of type constructors, and the rarest type names. But +although I see the rationale behind it, I have always found that Hoogle tends +to find weird pairs similar, and I prefer something a bit more white-box. So +Cloogle returns *only* functions for which the type can be *unified* with the +query type. This has downsides (if you forget a parameter, you won't find your +function), but the advantage is that ranking is much easier. + +Originally, we would unify every type in the database with the query type, but +this is very slow. We now make use of an ordering on types. We say that +*t*<sub>1</sub> ≤ *t*<sub>2</sub> iff *t*<sub>2</sub> generalises +*t*<sub>1</sub>. From this follows that if the query type does not unify with +*t*<sub>2</sub>, it will also not unify with *t*<sub>1</sub>. This observation +allows us to prune away a lot of results. + +In the index, we keep a tree of all types, starting with the most general type, +`a`, in the root. The children are the more specific types, such as `Int`, +`c a`, and `a -> b`. This is represented in a simple data structure: + +```clean +:: TypeTree v = Node Type [v] [TypeTree v] +``` + +We keep a `TypeTree Index`. Isomorphic types are stored only once in the index, +and they keep a list of indices into the `NativeDB` of functions that have that +type. When looking for a query type, we do a depth-first walk over the type +tree. However, we prune a node's children when the node's type does not unify +with the query type. + +For example, consider the type tree below. If we search for `Char -> Int` we +only want to find nodes 1, 2, and 3. With the type tree, `a -> String` in node +6 does not unify, so we never visit nodes 7 and 8. + +```clean +Node "a" .. // 1 + [ Node "a -> b" .. // 2 + [ Node "a -> Int" .. // 3 + [ Node "Int -> Int" .. [] // 4 + , Node "String -> Int" .. [] // 5 + ] + , Node "a -> String" .. // 6 + [ Node "Int -> String" .. [] // 7 + , Node "String -> String" .. [] // 8 + ] + ] + ] +``` + +Observe that there are multiple ways to build the type tree. Instead of the +tree above, we could also group on type variable `a` first, such that node 3 +would have the type `Int -> b` and node 6 the type `String -> b`. How the type +is built depends on the order in which types are found by the index builder. It +does not really matter for the search algorithm. + +# Ranking + +The last interesting optimization concerns ranking. We go several times through +the index to determine which entries should be returned. Once we have the +result set we need to order it, but we do not want to have to reunify types to +determine the distance measure. This is what the `annotations` field in the +`NativeDB` is for: during search, you can leave annotations, which are then +picked up by the ranker to compute the distance measure. That way you only need +to look at the annotations instead of performing part of the search logic +again. Cloogle for instance keeps track of the unifier (for type search) and +the number of matching *n*-grams (for name search): + +```clean +:: Annotation + = MatchingNGramsQuery !Real //* The number of matching n-grams in the query + | MatchingNGramsResult !Real //* The number of matching n-grams in the result + | Unifier !Unifier //* For type search, the unifier + /* .. */ +``` + +It is not immediately obvious how these annotations should be turned into a +single distance measure. And we don't want to think about this. What Cloogle +does instead is define a list of variables based on which we might want to +rank, and define a list of ranking constraints. We then use a constraint solver +to compute the exact ranking formula. + +Some examples of ranking variables are: + +- The ratio of *n*-grams from the query that match in the result +- The ratio of *n*-grams from the result that match in the query +- The number of type variables in the unifier +- Whether the result comes from the standard library + +Some examples of ranking constraints are: + +- For the query `toInt`, prefer the class `toInt` over the function + `digitToInt`. +- For the query `Char`, prefer the built-in type `Char` over the class + `toChar`. + +Cloogle assumes that the ranking function is a sum of products of the ranking +variables with a weight. When it updates the index, as a final step it computes +the symbolic distance measure for all the ranking constraints, and passes these +to [Z3][] to find appropriate weights for all the ranking variables. + +If we come across a query that returns results in an odd order, we just need to +add a ranking constraint; we do not need to think and write a new formula +ourselves. Furthermore, the ranking constraints act as test cases for the +ranking function, so we are sure that changes to the formula will not introduce +regressions. + +# Conclusion + +Once results are ranked, we need to find detailed information for the first 15 +(or later ones, if a lower result page is requested). For example, we show the +exact unifier for type search. We cannot use annotations for this, because in +the type tree names of type variables are generalized. But because we need to +do this only for 15 results, we don't need to be very fast here. We simply +unify these results again to get the real, human-readable unifier, and like +this we fetch some other information as well. + +Overall, I'm quite happy with the orthogonal setup of different search +strategies. The implementation of name search and type search is completely +different, and each is optimized in different ways, but you can still combine +the search types in a single query. Also the annotations make the architecture +much simpler and avoid duplicating logic in the search and ranking functions. +Finally, using a constraint solver to determine the ranking weights has proven +to be a good basis for a solid ranking function. + +[Clean]: http://clean.cs.ru.nl/ +[Cloogle]: https://cloogle.org/ +[Haskell]: https://haskell.org/ +[Hoogle]: https://hoogle.haskell.org/ +[Hoogle 5]: https://neilmitchell.blogspot.com/2020/06/hoogle-searching-overview.html +[stats]: https://cloogle.org/stats/ +[Z3]: https://github.com/Z3Prover/z3 diff --git a/resources/pug/finals/articles/2021-07-26-cloogle-search-overview.pug b/resources/pug/finals/articles/2021-07-26-cloogle-search-overview.pug new file mode 100644 index 0000000..859f11c --- /dev/null +++ b/resources/pug/finals/articles/2021-07-26-cloogle-search-overview.pug @@ -0,0 +1,18 @@ +extends /layout-articles.pug + +block prepend title + | Cloogle search overview - + +block prepend menu + - var page = '2021-07-26-cloogle-search-overview' + +block append breadcrumbs + +breadcrumb('Cloogle search overview') + +block subtitle + | Cloogle search overview +block subtitleDate + | 26 July 2021 + +block page + include:markdown ../../../md/2021-07-26-cloogle-search-overview.md diff --git a/resources/pug/finals/articles/index.pug b/resources/pug/finals/articles/index.pug index b3100dd..3a5264c 100644 --- a/resources/pug/finals/articles/index.pug +++ b/resources/pug/finals/articles/index.pug @@ -15,6 +15,13 @@ block page | You can also go back to my #[a(href='/') home page]. h1 + a(href='2021-07-26-cloogle-search-overview.html'). + 26 July 2021: Cloogle search overview + blockquote. + Cloogle is a search engine for the pure, functional programming language Clean, similar to Hoogle for Haskell. + In this post I go through the design of the search backend and make a comparison with that of Hoogle 5. + + h1 a(href='2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-1-introduction.html'). 23 June 2021: Compiling Clean in the browser with WebAssembly blockquote. diff --git a/resources/pug/include/layout-articles.pug b/resources/pug/include/layout-articles.pug index 10feaab..c35e38b 100644 --- a/resources/pug/include/layout-articles.pug +++ b/resources/pug/include/layout-articles.pug @@ -8,6 +8,9 @@ block append menu +menu( {name: 'Home', link: ''}, {name: 2021, menu: [ + { name: 'Cloogle search overview', + year: 2021, month: 7, day: 26 + }, { name: 'Compiling Clean in the browser with WebAssembly, part 3: Putting it all together', year: 2021, month: 6, day: 23 }, |