diff options
author | Camil Staps | 2021-08-02 16:17:05 +0200 |
---|---|---|
committer | Camil Staps | 2021-08-02 16:17:05 +0200 |
commit | d8289ad28290bd3081d6cf64a3bc5c25e65c718c (patch) | |
tree | a2f70ed2c5c85c484b99988b0cd25ce38be1b48c /resources/md/2021-07-26-cloogle-search-overview.md | |
parent | Add extra info about type search (diff) |
Update Cloogle article date
Diffstat (limited to 'resources/md/2021-07-26-cloogle-search-overview.md')
-rw-r--r-- | resources/md/2021-07-26-cloogle-search-overview.md | 212 |
1 files changed, 0 insertions, 212 deletions
diff --git a/resources/md/2021-07-26-cloogle-search-overview.md b/resources/md/2021-07-26-cloogle-search-overview.md deleted file mode 100644 index 6fb6e6e..0000000 --- a/resources/md/2021-07-26-cloogle-search-overview.md +++ /dev/null @@ -1,212 +0,0 @@ -*[Cloogle][] is a search engine for the pure, functional programming language -[Clean][], similar to [Hoogle][] for [Haskell][]. In this post I go through the -design of the search backend and make a comparison with that of [Hoogle 5][].* - -[[toc]] - -# Overview - -Internally, Cloogle keeps the entire index in memory. This is possible because -it is relatively small. While we have different entry types for functions, type -definitions, classes, and other things, they are all stored in one big array, a -`NativeDB CloogleEntry Annotation`: - -```clean -:: *NativeDB v a :== *{!*Entry v a} - -:: Entry v a = - { value :: !v - , included :: !Bool - , annotations :: ![!a!] - } - -:: CloogleEntry - = FunctionEntry !FunctionEntry - | TypeDefEntry !TypeDefEntry - /* .. */ -``` - -Because the `NativeDB` is -[unique](https://en.wikipedia.org/wiki/Uniqueness_type), we can do mutable -updates on it. While searching, we keep all entries in the array, and use -`included` to signal whether an entry should be returned from a search result -or not. - -For a search request, we go multiple times through the database and toggle the -`included` flag as needed. For example, for the query -`map :: (a -> b) [a] -> [b]` we would first filter on the name `map` and then -on the type `(a -> b) [a] -> [b]`. The implementation of each filter is -independent of the others, so that each can be implemented in the way that is -most efficient for it. - -# Name search - -To do fuzzy searches for names, the index contains an `NGramIndex Index`: - -```clean -:: NGramIndex v = - { n :: !Int //* The parameter *n* for the size of the grams - , ci :: !Bool //* Whether matching is case-insensitive - , idx :: !Map String [v] //* The values - } -``` - -The values stored in this index are indices for the `NativeDB`. So if we would -search for `query`, we would take the 3-grams `que`, `uer`, and `ery` (*n* is -currently 3), and for each gram lookup the list of indices in `idx`. These -lists are then merged (they are sorted, so this can be done in O(n)), and we go -through the `NativeDB` to set `included` for the indices in the list. - -I experimented with a [trie](https://en.wikipedia.org/wiki/Trie) instead of a -binary tree for `idx`, but it wasn't faster. After all, we only need few -lookups in this structure. - -# Type search - -This is the complicated one. [Hoogle has a very fancy algorithm][Hoogle 5], -which creates fingerprints for type signatures based on various things, like -the arity, the number of type constructors, and the rarest type names. But -although I see the rationale behind it, I have always found that Hoogle tends -to find weird pairs similar, and I prefer something a bit more white-box. So -Cloogle returns *only* functions for which the type can be *unified* with the -query type. This has downsides (if you forget a parameter, you won't find your -function), but the advantage is that ranking is much easier. - -Originally, we would unify every type in the database with the query type, but -this is very slow. We now make use of an ordering on types. We say that -*t*<sub>1</sub> ≤ *t*<sub>2</sub> iff *t*<sub>2</sub> generalises -*t*<sub>1</sub>. From this follows that if the query type does not unify with -*t*<sub>2</sub>, it will also not unify with *t*<sub>1</sub>. This observation -allows us to prune away a lot of results. - -In the index, we keep a tree of all types, starting with the most general type, -`a`, in the root. The children are the more specific types, such as `Int`, -`c a`, and `a -> b`. This is represented in a simple data structure: - -```clean -:: TypeTree v = Node Type [v] [TypeTree v] -``` - -We keep a `TypeTree Index`. Isomorphic types are stored only once in the index, -and they keep a list of indices into the `NativeDB` of functions that have that -type. When looking for a query type, we do a depth-first walk over the type -tree. However, we prune a node's children when the node's type does not unify -with the query type. - -For example, consider the type tree below. If we search for `Char -> Int` we -only want to find nodes 1, 2, and 3. With the type tree, `a -> String` in node -6 does not unify, so we never visit nodes 7 and 8. - -```clean -Node "a" .. // 1 - [ Node "a -> b" .. // 2 - [ Node "a -> Int" .. // 3 - [ Node "Int -> Int" .. [] // 4 - , Node "String -> Int" .. [] // 5 - ] - , Node "a -> String" .. // 6 - [ Node "Int -> String" .. [] // 7 - , Node "String -> String" .. [] // 8 - ] - ] - ] -``` - -Observe that there are multiple ways to build the type tree. Instead of the -tree above, we could also group on type variable `a` first, such that node 3 -would have the type `Int -> b` and node 6 the type `String -> b`. How the type -is built depends on the order in which types are found by the index builder. It -does not really matter for the search algorithm. - -Besides this, there are some nice tricks to improve the results. For instance, -type synonyms are resolved in both the type tree and the query. In Hoogle, you -won't find `length :: Foldable t => t a -> Int` with the query `String -> Int`, -because it does not resolve `String -> Int` to `[Char] -> Int`. In Cloogle, the -same query will find `size :: (a e) -> Int | Array a e`, because it knows that -`String` is an `{#} Char` (an array of characters). - -Also, universally quantified type variables can be used to restrict the search -to functions that are polymorphic. By default, a query like `[a] -> [a]` will -return results like `indexList :: [a] -> [Int]`, because the user might not -realize that the function they are looking for cannot be defined in a -polymorphic way (or the function in the index does not have the most general -type). However, if you are sure you only want results where `a` is not unified, -you can use `A.a: [a] -> [a]`. `A` here is similar to Haskell's `forall`. The -unification system takes care of the rest. - -# Ranking - -The last interesting optimization concerns ranking. We go several times through -the index to determine which entries should be returned. Once we have the -result set we need to order it, but we do not want to have to reunify types to -determine the distance measure. This is what the `annotations` field in the -`NativeDB` is for: during search, you can leave annotations, which are then -picked up by the ranker to compute the distance measure. That way you only need -to look at the annotations instead of performing part of the search logic -again. Cloogle for instance keeps track of the unifier (for type search) and -the number of matching *n*-grams (for name search): - -```clean -:: Annotation - = MatchingNGramsQuery !Real //* The number of matching n-grams in the query - | MatchingNGramsResult !Real //* The number of matching n-grams in the result - | Unifier !Unifier //* For type search, the unifier - /* .. */ -``` - -It is not immediately obvious how these annotations should be turned into a -single distance measure. And we don't want to think about this. What Cloogle -does instead is define a list of variables based on which we might want to -rank, and define a list of ranking constraints. We then use a constraint solver -to compute the exact ranking formula. - -Some examples of ranking variables are: - -- The ratio of *n*-grams from the query that match in the result -- The ratio of *n*-grams from the result that match in the query -- The number of type variables in the unifier -- Whether the result comes from the standard library - -Some examples of ranking constraints are: - -- For the query `toInt`, prefer the class `toInt` over the function - `digitToInt`. -- For the query `Char`, prefer the built-in type `Char` over the class - `toChar`. - -Cloogle assumes that the ranking function is a sum of products of the ranking -variables with a weight. When it updates the index, as a final step it computes -the symbolic distance measure for all the ranking constraints, and passes these -to [Z3][] to find appropriate weights for all the ranking variables. - -If we come across a query that returns results in an odd order, we just need to -add a ranking constraint; we do not need to think and write a new formula -ourselves. Furthermore, the ranking constraints act as test cases for the -ranking function, so we are sure that changes to the formula will not introduce -regressions. - -# Conclusion - -Once results are ranked, we need to find detailed information for the first 15 -(or later ones, if a lower result page is requested). For example, we show the -exact unifier for type search. We cannot use annotations for this, because in -the type tree names of type variables are generalized. But because we need to -do this only for 15 results, we don't need to be very fast here. We simply -unify these results again to get the real, human-readable unifier, and like -this we fetch some other information as well. - -Overall, I'm quite happy with the orthogonal setup of different search -strategies. The implementation of name search and type search is completely -different, and each is optimized in different ways, but you can still combine -the search types in a single query. Also the annotations make the architecture -much simpler and avoid duplicating logic in the search and ranking functions. -Finally, using a constraint solver to determine the ranking weights has proven -to be a good basis for a solid ranking function. - -[Clean]: http://clean.cs.ru.nl/ -[Cloogle]: https://cloogle.org/ -[Haskell]: https://haskell.org/ -[Hoogle]: https://hoogle.haskell.org/ -[Hoogle 5]: https://neilmitchell.blogspot.com/2020/06/hoogle-searching-overview.html -[stats]: https://cloogle.org/stats/ -[Z3]: https://github.com/Z3Prover/z3 |