From d8289ad28290bd3081d6cf64a3bc5c25e65c718c Mon Sep 17 00:00:00 2001 From: Camil Staps Date: Mon, 2 Aug 2021 16:17:05 +0200 Subject: Update Cloogle article date --- resources/md/2021-07-26-cloogle-search-overview.md | 212 --------------------- resources/md/2021-08-02-cloogle-search-overview.md | 212 +++++++++++++++++++++ .../2021-07-26-cloogle-search-overview.pug | 18 -- .../2021-08-02-cloogle-search-overview.pug | 18 ++ resources/pug/finals/articles/index.pug | 4 +- 5 files changed, 232 insertions(+), 232 deletions(-) delete mode 100644 resources/md/2021-07-26-cloogle-search-overview.md create mode 100644 resources/md/2021-08-02-cloogle-search-overview.md delete mode 100644 resources/pug/finals/articles/2021-07-26-cloogle-search-overview.pug create mode 100644 resources/pug/finals/articles/2021-08-02-cloogle-search-overview.pug diff --git a/resources/md/2021-07-26-cloogle-search-overview.md b/resources/md/2021-07-26-cloogle-search-overview.md deleted file mode 100644 index 6fb6e6e..0000000 --- a/resources/md/2021-07-26-cloogle-search-overview.md +++ /dev/null @@ -1,212 +0,0 @@ -*[Cloogle][] is a search engine for the pure, functional programming language -[Clean][], similar to [Hoogle][] for [Haskell][]. In this post I go through the -design of the search backend and make a comparison with that of [Hoogle 5][].* - -[[toc]] - -# Overview - -Internally, Cloogle keeps the entire index in memory. This is possible because -it is relatively small. While we have different entry types for functions, type -definitions, classes, and other things, they are all stored in one big array, a -`NativeDB CloogleEntry Annotation`: - -```clean -:: *NativeDB v a :== *{!*Entry v a} - -:: Entry v a = - { value :: !v - , included :: !Bool - , annotations :: ![!a!] - } - -:: CloogleEntry - = FunctionEntry !FunctionEntry - | TypeDefEntry !TypeDefEntry - /* .. */ -``` - -Because the `NativeDB` is -[unique](https://en.wikipedia.org/wiki/Uniqueness_type), we can do mutable -updates on it. While searching, we keep all entries in the array, and use -`included` to signal whether an entry should be returned from a search result -or not. - -For a search request, we go multiple times through the database and toggle the -`included` flag as needed. For example, for the query -`map :: (a -> b) [a] -> [b]` we would first filter on the name `map` and then -on the type `(a -> b) [a] -> [b]`. The implementation of each filter is -independent of the others, so that each can be implemented in the way that is -most efficient for it. - -# Name search - -To do fuzzy searches for names, the index contains an `NGramIndex Index`: - -```clean -:: NGramIndex v = - { n :: !Int //* The parameter *n* for the size of the grams - , ci :: !Bool //* Whether matching is case-insensitive - , idx :: !Map String [v] //* The values - } -``` - -The values stored in this index are indices for the `NativeDB`. So if we would -search for `query`, we would take the 3-grams `que`, `uer`, and `ery` (*n* is -currently 3), and for each gram lookup the list of indices in `idx`. These -lists are then merged (they are sorted, so this can be done in O(n)), and we go -through the `NativeDB` to set `included` for the indices in the list. - -I experimented with a [trie](https://en.wikipedia.org/wiki/Trie) instead of a -binary tree for `idx`, but it wasn't faster. After all, we only need few -lookups in this structure. - -# Type search - -This is the complicated one. [Hoogle has a very fancy algorithm][Hoogle 5], -which creates fingerprints for type signatures based on various things, like -the arity, the number of type constructors, and the rarest type names. But -although I see the rationale behind it, I have always found that Hoogle tends -to find weird pairs similar, and I prefer something a bit more white-box. So -Cloogle returns *only* functions for which the type can be *unified* with the -query type. This has downsides (if you forget a parameter, you won't find your -function), but the advantage is that ranking is much easier. - -Originally, we would unify every type in the database with the query type, but -this is very slow. We now make use of an ordering on types. We say that -*t*1 ≤ *t*2 iff *t*2 generalises -*t*1. From this follows that if the query type does not unify with -*t*2, it will also not unify with *t*1. This observation -allows us to prune away a lot of results. - -In the index, we keep a tree of all types, starting with the most general type, -`a`, in the root. The children are the more specific types, such as `Int`, -`c a`, and `a -> b`. This is represented in a simple data structure: - -```clean -:: TypeTree v = Node Type [v] [TypeTree v] -``` - -We keep a `TypeTree Index`. Isomorphic types are stored only once in the index, -and they keep a list of indices into the `NativeDB` of functions that have that -type. When looking for a query type, we do a depth-first walk over the type -tree. However, we prune a node's children when the node's type does not unify -with the query type. - -For example, consider the type tree below. If we search for `Char -> Int` we -only want to find nodes 1, 2, and 3. With the type tree, `a -> String` in node -6 does not unify, so we never visit nodes 7 and 8. - -```clean -Node "a" .. // 1 - [ Node "a -> b" .. // 2 - [ Node "a -> Int" .. // 3 - [ Node "Int -> Int" .. [] // 4 - , Node "String -> Int" .. [] // 5 - ] - , Node "a -> String" .. // 6 - [ Node "Int -> String" .. [] // 7 - , Node "String -> String" .. [] // 8 - ] - ] - ] -``` - -Observe that there are multiple ways to build the type tree. Instead of the -tree above, we could also group on type variable `a` first, such that node 3 -would have the type `Int -> b` and node 6 the type `String -> b`. How the type -is built depends on the order in which types are found by the index builder. It -does not really matter for the search algorithm. - -Besides this, there are some nice tricks to improve the results. For instance, -type synonyms are resolved in both the type tree and the query. In Hoogle, you -won't find `length :: Foldable t => t a -> Int` with the query `String -> Int`, -because it does not resolve `String -> Int` to `[Char] -> Int`. In Cloogle, the -same query will find `size :: (a e) -> Int | Array a e`, because it knows that -`String` is an `{#} Char` (an array of characters). - -Also, universally quantified type variables can be used to restrict the search -to functions that are polymorphic. By default, a query like `[a] -> [a]` will -return results like `indexList :: [a] -> [Int]`, because the user might not -realize that the function they are looking for cannot be defined in a -polymorphic way (or the function in the index does not have the most general -type). However, if you are sure you only want results where `a` is not unified, -you can use `A.a: [a] -> [a]`. `A` here is similar to Haskell's `forall`. The -unification system takes care of the rest. - -# Ranking - -The last interesting optimization concerns ranking. We go several times through -the index to determine which entries should be returned. Once we have the -result set we need to order it, but we do not want to have to reunify types to -determine the distance measure. This is what the `annotations` field in the -`NativeDB` is for: during search, you can leave annotations, which are then -picked up by the ranker to compute the distance measure. That way you only need -to look at the annotations instead of performing part of the search logic -again. Cloogle for instance keeps track of the unifier (for type search) and -the number of matching *n*-grams (for name search): - -```clean -:: Annotation - = MatchingNGramsQuery !Real //* The number of matching n-grams in the query - | MatchingNGramsResult !Real //* The number of matching n-grams in the result - | Unifier !Unifier //* For type search, the unifier - /* .. */ -``` - -It is not immediately obvious how these annotations should be turned into a -single distance measure. And we don't want to think about this. What Cloogle -does instead is define a list of variables based on which we might want to -rank, and define a list of ranking constraints. We then use a constraint solver -to compute the exact ranking formula. - -Some examples of ranking variables are: - -- The ratio of *n*-grams from the query that match in the result -- The ratio of *n*-grams from the result that match in the query -- The number of type variables in the unifier -- Whether the result comes from the standard library - -Some examples of ranking constraints are: - -- For the query `toInt`, prefer the class `toInt` over the function - `digitToInt`. -- For the query `Char`, prefer the built-in type `Char` over the class - `toChar`. - -Cloogle assumes that the ranking function is a sum of products of the ranking -variables with a weight. When it updates the index, as a final step it computes -the symbolic distance measure for all the ranking constraints, and passes these -to [Z3][] to find appropriate weights for all the ranking variables. - -If we come across a query that returns results in an odd order, we just need to -add a ranking constraint; we do not need to think and write a new formula -ourselves. Furthermore, the ranking constraints act as test cases for the -ranking function, so we are sure that changes to the formula will not introduce -regressions. - -# Conclusion - -Once results are ranked, we need to find detailed information for the first 15 -(or later ones, if a lower result page is requested). For example, we show the -exact unifier for type search. We cannot use annotations for this, because in -the type tree names of type variables are generalized. But because we need to -do this only for 15 results, we don't need to be very fast here. We simply -unify these results again to get the real, human-readable unifier, and like -this we fetch some other information as well. - -Overall, I'm quite happy with the orthogonal setup of different search -strategies. The implementation of name search and type search is completely -different, and each is optimized in different ways, but you can still combine -the search types in a single query. Also the annotations make the architecture -much simpler and avoid duplicating logic in the search and ranking functions. -Finally, using a constraint solver to determine the ranking weights has proven -to be a good basis for a solid ranking function. - -[Clean]: http://clean.cs.ru.nl/ -[Cloogle]: https://cloogle.org/ -[Haskell]: https://haskell.org/ -[Hoogle]: https://hoogle.haskell.org/ -[Hoogle 5]: https://neilmitchell.blogspot.com/2020/06/hoogle-searching-overview.html -[stats]: https://cloogle.org/stats/ -[Z3]: https://github.com/Z3Prover/z3 diff --git a/resources/md/2021-08-02-cloogle-search-overview.md b/resources/md/2021-08-02-cloogle-search-overview.md new file mode 100644 index 0000000..6fb6e6e --- /dev/null +++ b/resources/md/2021-08-02-cloogle-search-overview.md @@ -0,0 +1,212 @@ +*[Cloogle][] is a search engine for the pure, functional programming language +[Clean][], similar to [Hoogle][] for [Haskell][]. In this post I go through the +design of the search backend and make a comparison with that of [Hoogle 5][].* + +[[toc]] + +# Overview + +Internally, Cloogle keeps the entire index in memory. This is possible because +it is relatively small. While we have different entry types for functions, type +definitions, classes, and other things, they are all stored in one big array, a +`NativeDB CloogleEntry Annotation`: + +```clean +:: *NativeDB v a :== *{!*Entry v a} + +:: Entry v a = + { value :: !v + , included :: !Bool + , annotations :: ![!a!] + } + +:: CloogleEntry + = FunctionEntry !FunctionEntry + | TypeDefEntry !TypeDefEntry + /* .. */ +``` + +Because the `NativeDB` is +[unique](https://en.wikipedia.org/wiki/Uniqueness_type), we can do mutable +updates on it. While searching, we keep all entries in the array, and use +`included` to signal whether an entry should be returned from a search result +or not. + +For a search request, we go multiple times through the database and toggle the +`included` flag as needed. For example, for the query +`map :: (a -> b) [a] -> [b]` we would first filter on the name `map` and then +on the type `(a -> b) [a] -> [b]`. The implementation of each filter is +independent of the others, so that each can be implemented in the way that is +most efficient for it. + +# Name search + +To do fuzzy searches for names, the index contains an `NGramIndex Index`: + +```clean +:: NGramIndex v = + { n :: !Int //* The parameter *n* for the size of the grams + , ci :: !Bool //* Whether matching is case-insensitive + , idx :: !Map String [v] //* The values + } +``` + +The values stored in this index are indices for the `NativeDB`. So if we would +search for `query`, we would take the 3-grams `que`, `uer`, and `ery` (*n* is +currently 3), and for each gram lookup the list of indices in `idx`. These +lists are then merged (they are sorted, so this can be done in O(n)), and we go +through the `NativeDB` to set `included` for the indices in the list. + +I experimented with a [trie](https://en.wikipedia.org/wiki/Trie) instead of a +binary tree for `idx`, but it wasn't faster. After all, we only need few +lookups in this structure. + +# Type search + +This is the complicated one. [Hoogle has a very fancy algorithm][Hoogle 5], +which creates fingerprints for type signatures based on various things, like +the arity, the number of type constructors, and the rarest type names. But +although I see the rationale behind it, I have always found that Hoogle tends +to find weird pairs similar, and I prefer something a bit more white-box. So +Cloogle returns *only* functions for which the type can be *unified* with the +query type. This has downsides (if you forget a parameter, you won't find your +function), but the advantage is that ranking is much easier. + +Originally, we would unify every type in the database with the query type, but +this is very slow. We now make use of an ordering on types. We say that +*t*1 ≤ *t*2 iff *t*2 generalises +*t*1. From this follows that if the query type does not unify with +*t*2, it will also not unify with *t*1. This observation +allows us to prune away a lot of results. + +In the index, we keep a tree of all types, starting with the most general type, +`a`, in the root. The children are the more specific types, such as `Int`, +`c a`, and `a -> b`. This is represented in a simple data structure: + +```clean +:: TypeTree v = Node Type [v] [TypeTree v] +``` + +We keep a `TypeTree Index`. Isomorphic types are stored only once in the index, +and they keep a list of indices into the `NativeDB` of functions that have that +type. When looking for a query type, we do a depth-first walk over the type +tree. However, we prune a node's children when the node's type does not unify +with the query type. + +For example, consider the type tree below. If we search for `Char -> Int` we +only want to find nodes 1, 2, and 3. With the type tree, `a -> String` in node +6 does not unify, so we never visit nodes 7 and 8. + +```clean +Node "a" .. // 1 + [ Node "a -> b" .. // 2 + [ Node "a -> Int" .. // 3 + [ Node "Int -> Int" .. [] // 4 + , Node "String -> Int" .. [] // 5 + ] + , Node "a -> String" .. // 6 + [ Node "Int -> String" .. [] // 7 + , Node "String -> String" .. [] // 8 + ] + ] + ] +``` + +Observe that there are multiple ways to build the type tree. Instead of the +tree above, we could also group on type variable `a` first, such that node 3 +would have the type `Int -> b` and node 6 the type `String -> b`. How the type +is built depends on the order in which types are found by the index builder. It +does not really matter for the search algorithm. + +Besides this, there are some nice tricks to improve the results. For instance, +type synonyms are resolved in both the type tree and the query. In Hoogle, you +won't find `length :: Foldable t => t a -> Int` with the query `String -> Int`, +because it does not resolve `String -> Int` to `[Char] -> Int`. In Cloogle, the +same query will find `size :: (a e) -> Int | Array a e`, because it knows that +`String` is an `{#} Char` (an array of characters). + +Also, universally quantified type variables can be used to restrict the search +to functions that are polymorphic. By default, a query like `[a] -> [a]` will +return results like `indexList :: [a] -> [Int]`, because the user might not +realize that the function they are looking for cannot be defined in a +polymorphic way (or the function in the index does not have the most general +type). However, if you are sure you only want results where `a` is not unified, +you can use `A.a: [a] -> [a]`. `A` here is similar to Haskell's `forall`. The +unification system takes care of the rest. + +# Ranking + +The last interesting optimization concerns ranking. We go several times through +the index to determine which entries should be returned. Once we have the +result set we need to order it, but we do not want to have to reunify types to +determine the distance measure. This is what the `annotations` field in the +`NativeDB` is for: during search, you can leave annotations, which are then +picked up by the ranker to compute the distance measure. That way you only need +to look at the annotations instead of performing part of the search logic +again. Cloogle for instance keeps track of the unifier (for type search) and +the number of matching *n*-grams (for name search): + +```clean +:: Annotation + = MatchingNGramsQuery !Real //* The number of matching n-grams in the query + | MatchingNGramsResult !Real //* The number of matching n-grams in the result + | Unifier !Unifier //* For type search, the unifier + /* .. */ +``` + +It is not immediately obvious how these annotations should be turned into a +single distance measure. And we don't want to think about this. What Cloogle +does instead is define a list of variables based on which we might want to +rank, and define a list of ranking constraints. We then use a constraint solver +to compute the exact ranking formula. + +Some examples of ranking variables are: + +- The ratio of *n*-grams from the query that match in the result +- The ratio of *n*-grams from the result that match in the query +- The number of type variables in the unifier +- Whether the result comes from the standard library + +Some examples of ranking constraints are: + +- For the query `toInt`, prefer the class `toInt` over the function + `digitToInt`. +- For the query `Char`, prefer the built-in type `Char` over the class + `toChar`. + +Cloogle assumes that the ranking function is a sum of products of the ranking +variables with a weight. When it updates the index, as a final step it computes +the symbolic distance measure for all the ranking constraints, and passes these +to [Z3][] to find appropriate weights for all the ranking variables. + +If we come across a query that returns results in an odd order, we just need to +add a ranking constraint; we do not need to think and write a new formula +ourselves. Furthermore, the ranking constraints act as test cases for the +ranking function, so we are sure that changes to the formula will not introduce +regressions. + +# Conclusion + +Once results are ranked, we need to find detailed information for the first 15 +(or later ones, if a lower result page is requested). For example, we show the +exact unifier for type search. We cannot use annotations for this, because in +the type tree names of type variables are generalized. But because we need to +do this only for 15 results, we don't need to be very fast here. We simply +unify these results again to get the real, human-readable unifier, and like +this we fetch some other information as well. + +Overall, I'm quite happy with the orthogonal setup of different search +strategies. The implementation of name search and type search is completely +different, and each is optimized in different ways, but you can still combine +the search types in a single query. Also the annotations make the architecture +much simpler and avoid duplicating logic in the search and ranking functions. +Finally, using a constraint solver to determine the ranking weights has proven +to be a good basis for a solid ranking function. + +[Clean]: http://clean.cs.ru.nl/ +[Cloogle]: https://cloogle.org/ +[Haskell]: https://haskell.org/ +[Hoogle]: https://hoogle.haskell.org/ +[Hoogle 5]: https://neilmitchell.blogspot.com/2020/06/hoogle-searching-overview.html +[stats]: https://cloogle.org/stats/ +[Z3]: https://github.com/Z3Prover/z3 diff --git a/resources/pug/finals/articles/2021-07-26-cloogle-search-overview.pug b/resources/pug/finals/articles/2021-07-26-cloogle-search-overview.pug deleted file mode 100644 index 859f11c..0000000 --- a/resources/pug/finals/articles/2021-07-26-cloogle-search-overview.pug +++ /dev/null @@ -1,18 +0,0 @@ -extends /layout-articles.pug - -block prepend title - | Cloogle search overview - - -block prepend menu - - var page = '2021-07-26-cloogle-search-overview' - -block append breadcrumbs - +breadcrumb('Cloogle search overview') - -block subtitle - | Cloogle search overview -block subtitleDate - | 26 July 2021 - -block page - include:markdown ../../../md/2021-07-26-cloogle-search-overview.md diff --git a/resources/pug/finals/articles/2021-08-02-cloogle-search-overview.pug b/resources/pug/finals/articles/2021-08-02-cloogle-search-overview.pug new file mode 100644 index 0000000..416b075 --- /dev/null +++ b/resources/pug/finals/articles/2021-08-02-cloogle-search-overview.pug @@ -0,0 +1,18 @@ +extends /layout-articles.pug + +block prepend title + | Cloogle search overview - + +block prepend menu + - var page = '2021-08-02-cloogle-search-overview' + +block append breadcrumbs + +breadcrumb('Cloogle search overview') + +block subtitle + | Cloogle search overview +block subtitleDate + | 2 August 2021 + +block page + include:markdown ../../../md/2021-08-02-cloogle-search-overview.md diff --git a/resources/pug/finals/articles/index.pug b/resources/pug/finals/articles/index.pug index 3a5264c..b9fcfff 100644 --- a/resources/pug/finals/articles/index.pug +++ b/resources/pug/finals/articles/index.pug @@ -15,8 +15,8 @@ block page | You can also go back to my #[a(href='/') home page]. h1 - a(href='2021-07-26-cloogle-search-overview.html'). - 26 July 2021: Cloogle search overview + a(href='2021-08-02-cloogle-search-overview.html'). + 2 August 2021: Cloogle search overview blockquote. Cloogle is a search engine for the pure, functional programming language Clean, similar to Hoogle for Haskell. In this post I go through the design of the search backend and make a comparison with that of Hoogle 5. -- cgit v1.2.3