*[Cloogle][] is a search engine for the pure, functional programming language [Clean][], similar to [Hoogle][] for [Haskell][]. In this post I go through the design of the search backend and make a comparison with that of [Hoogle 5][].* [[toc]] # Overview Internally, Cloogle keeps the entire index in memory. This is possible because it is relatively small. While we have different entry types for functions, type definitions, classes, and other things, they are all stored in one big array, a `NativeDB CloogleEntry Annotation`: ```clean :: *NativeDB v a :== *{!*Entry v a} :: Entry v a = { value :: !v , included :: !Bool , annotations :: ![!a!] } :: CloogleEntry = FunctionEntry !FunctionEntry | TypeDefEntry !TypeDefEntry /* .. */ ``` Because the `NativeDB` is [unique](https://en.wikipedia.org/wiki/Uniqueness_type), we can do mutable updates on it. While searching, we keep all entries in the array, and use `included` to signal whether an entry should be returned from a search result or not. For a search request, we go multiple times through the database and toggle the `included` flag as needed. For example, for the query `map :: (a -> b) [a] -> [b]` we would first filter on the name `map` and then on the type `(a -> b) [a] -> [b]`. The implementation of each filter is independent of the others, so that each can be implemented in the way that is most efficient for it. # Name search To do fuzzy searches for names, the index contains an `NGramIndex Index`: ```clean :: NGramIndex v = { n :: !Int //* The parameter *n* for the size of the grams , ci :: !Bool //* Whether matching is case-insensitive , idx :: !Map String [v] //* The values } ``` The values stored in this index are indices for the `NativeDB`. So if we would search for `query`, we would take the 3-grams `que`, `uer`, and `ery` (*n* is currently 3), and for each gram lookup the list of indices in `idx`. These lists are then merged (they are sorted, so this can be done in O(n)), and we go through the `NativeDB` to set `included` for the indices in the list. I experimented with a [trie](https://en.wikipedia.org/wiki/Trie) instead of a binary tree for `idx`, but it wasn't faster. After all, we only need few lookups in this structure. # Type search This is the complicated one. [Hoogle has a very fancy algorithm][Hoogle 5], which creates fingerprints for type signatures based on various things, like the arity, the number of type constructors, and the rarest type names. But although I see the rationale behind it, I have always found that Hoogle tends to find weird pairs similar, and I prefer something a bit more white-box. So Cloogle returns *only* functions for which the type can be *unified* with the query type. This has downsides (if you forget a parameter, you won't find your function), but the advantage is that ranking is much easier. Originally, we would unify every type in the database with the query type, but this is very slow. We now make use of an ordering on types. We say that *t*1 ≤ *t*2 iff *t*2 generalises *t*1. From this follows that if the query type does not unify with *t*2, it will also not unify with *t*1. This observation allows us to prune away a lot of results. In the index, we keep a tree of all types, starting with the most general type, `a`, in the root. The children are the more specific types, such as `Int`, `c a`, and `a -> b`. This is represented in a simple data structure: ```clean :: TypeTree v = Node Type [v] [TypeTree v] ``` We keep a `TypeTree Index`. Isomorphic types are stored only once in the index, and they keep a list of indices into the `NativeDB` of functions that have that type. When looking for a query type, we do a depth-first walk over the type tree. However, we prune a node's children when the node's type does not unify with the query type. For example, consider the type tree below. If we search for `Char -> Int` we only want to find nodes 1, 2, and 3. With the type tree, `a -> String` in node 6 does not unify, so we never visit nodes 7 and 8. ```clean Node "a" .. // 1 [ Node "a -> b" .. // 2 [ Node "a -> Int" .. // 3 [ Node "Int -> Int" .. [] // 4 , Node "String -> Int" .. [] // 5 ] , Node "a -> String" .. // 6 [ Node "Int -> String" .. [] // 7 , Node "String -> String" .. [] // 8 ] ] ] ``` Observe that there are multiple ways to build the type tree. Instead of the tree above, we could also group on type variable `a` first, such that node 3 would have the type `Int -> b` and node 6 the type `String -> b`. How the type is built depends on the order in which types are found by the index builder. It does not really matter for the search algorithm. Besides this, there are some nice tricks to improve the results. For instance, type synonyms are resolved in both the type tree and the query. In Hoogle, you won't find `length :: Foldable t => t a -> Int` with the query `String -> Int`, because it does not resolve `String -> Int` to `[Char] -> Int`. In Cloogle, the same query will find `size :: (a e) -> Int | Array a e`, because it knows that `String` is an `{#} Char` (an array of characters). Also, universally quantified type variables can be used to restrict the search to functions that are polymorphic. By default, a query like `[a] -> [a]` will return results like `indexList :: [a] -> [Int]`, because the user might not realize that the function they are looking for cannot be defined in a polymorphic way (or the function in the index does not have the most general type). However, if you are sure you only want results where `a` is not unified, you can use `A.a: [a] -> [a]`. `A` here is similar to Haskell's `forall`. The unification system takes care of the rest. # Ranking The last interesting optimization concerns ranking. We go several times through the index to determine which entries should be returned. Once we have the result set we need to order it, but we do not want to have to reunify types to determine the distance measure. This is what the `annotations` field in the `NativeDB` is for: during search, you can leave annotations, which are then picked up by the ranker to compute the distance measure. That way you only need to look at the annotations instead of performing part of the search logic again. Cloogle for instance keeps track of the unifier (for type search) and the number of matching *n*-grams (for name search): ```clean :: Annotation = MatchingNGramsQuery !Real //* The number of matching n-grams in the query | MatchingNGramsResult !Real //* The number of matching n-grams in the result | Unifier !Unifier //* For type search, the unifier /* .. */ ``` It is not immediately obvious how these annotations should be turned into a single distance measure. And we don't want to think about this. What Cloogle does instead is define a list of variables based on which we might want to rank, and define a list of ranking constraints. We then use a constraint solver to compute the exact ranking formula. Some examples of ranking variables are: - The ratio of *n*-grams from the query that match in the result - The ratio of *n*-grams from the result that match in the query - The number of type variables in the unifier - Whether the result comes from the standard library Some examples of ranking constraints are: - For the query `toInt`, prefer the class `toInt` over the function `digitToInt`. - For the query `Char`, prefer the built-in type `Char` over the class `toChar`. Cloogle assumes that the ranking function is a sum of products of the ranking variables with a weight. When it updates the index, as a final step it computes the symbolic distance measure for all the ranking constraints, and passes these to [Z3][] to find appropriate weights for all the ranking variables. If we come across a query that returns results in an odd order, we just need to add a ranking constraint; we do not need to think and write a new formula ourselves. Furthermore, the ranking constraints act as test cases for the ranking function, so we are sure that changes to the formula will not introduce regressions. # Conclusion Once results are ranked, we need to find detailed information for the first 15 (or later ones, if a lower result page is requested). For example, we show the exact unifier for type search. We cannot use annotations for this, because in the type tree names of type variables are generalized. But because we need to do this only for 15 results, we don't need to be very fast here. We simply unify these results again to get the real, human-readable unifier, and like this we fetch some other information as well. Overall, I'm quite happy with the orthogonal setup of different search strategies. The implementation of name search and type search is completely different, and each is optimized in different ways, but you can still combine the search types in a single query. Also the annotations make the architecture much simpler and avoid duplicating logic in the search and ranking functions. Finally, using a constraint solver to determine the ranking weights has proven to be a good basis for a solid ranking function. [Clean]: https://clean-lang.org/ [Cloogle]: https://cloogle.org/ [Haskell]: https://haskell.org/ [Hoogle]: https://hoogle.haskell.org/ [Hoogle 5]: https://neilmitchell.blogspot.com/2020/06/hoogle-searching-overview.html [stats]: https://cloogle.org/stats/ [Z3]: https://github.com/Z3Prover/z3