From d8289ad28290bd3081d6cf64a3bc5c25e65c718c Mon Sep 17 00:00:00 2001
From: Camil Staps
Date: Mon, 2 Aug 2021 16:17:05 +0200
Subject: Update Cloogle article date
---
resources/md/2021-07-26-cloogle-search-overview.md | 212 ---------------------
resources/md/2021-08-02-cloogle-search-overview.md | 212 +++++++++++++++++++++
.../2021-07-26-cloogle-search-overview.pug | 18 --
.../2021-08-02-cloogle-search-overview.pug | 18 ++
resources/pug/finals/articles/index.pug | 4 +-
5 files changed, 232 insertions(+), 232 deletions(-)
delete mode 100644 resources/md/2021-07-26-cloogle-search-overview.md
create mode 100644 resources/md/2021-08-02-cloogle-search-overview.md
delete mode 100644 resources/pug/finals/articles/2021-07-26-cloogle-search-overview.pug
create mode 100644 resources/pug/finals/articles/2021-08-02-cloogle-search-overview.pug
diff --git a/resources/md/2021-07-26-cloogle-search-overview.md b/resources/md/2021-07-26-cloogle-search-overview.md
deleted file mode 100644
index 6fb6e6e..0000000
--- a/resources/md/2021-07-26-cloogle-search-overview.md
+++ /dev/null
@@ -1,212 +0,0 @@
-*[Cloogle][] is a search engine for the pure, functional programming language
-[Clean][], similar to [Hoogle][] for [Haskell][]. In this post I go through the
-design of the search backend and make a comparison with that of [Hoogle 5][].*
-
-[[toc]]
-
-# Overview
-
-Internally, Cloogle keeps the entire index in memory. This is possible because
-it is relatively small. While we have different entry types for functions, type
-definitions, classes, and other things, they are all stored in one big array, a
-`NativeDB CloogleEntry Annotation`:
-
-```clean
-:: *NativeDB v a :== *{!*Entry v a}
-
-:: Entry v a =
- { value :: !v
- , included :: !Bool
- , annotations :: ![!a!]
- }
-
-:: CloogleEntry
- = FunctionEntry !FunctionEntry
- | TypeDefEntry !TypeDefEntry
- /* .. */
-```
-
-Because the `NativeDB` is
-[unique](https://en.wikipedia.org/wiki/Uniqueness_type), we can do mutable
-updates on it. While searching, we keep all entries in the array, and use
-`included` to signal whether an entry should be returned from a search result
-or not.
-
-For a search request, we go multiple times through the database and toggle the
-`included` flag as needed. For example, for the query
-`map :: (a -> b) [a] -> [b]` we would first filter on the name `map` and then
-on the type `(a -> b) [a] -> [b]`. The implementation of each filter is
-independent of the others, so that each can be implemented in the way that is
-most efficient for it.
-
-# Name search
-
-To do fuzzy searches for names, the index contains an `NGramIndex Index`:
-
-```clean
-:: NGramIndex v =
- { n :: !Int //* The parameter *n* for the size of the grams
- , ci :: !Bool //* Whether matching is case-insensitive
- , idx :: !Map String [v] //* The values
- }
-```
-
-The values stored in this index are indices for the `NativeDB`. So if we would
-search for `query`, we would take the 3-grams `que`, `uer`, and `ery` (*n* is
-currently 3), and for each gram lookup the list of indices in `idx`. These
-lists are then merged (they are sorted, so this can be done in O(n)), and we go
-through the `NativeDB` to set `included` for the indices in the list.
-
-I experimented with a [trie](https://en.wikipedia.org/wiki/Trie) instead of a
-binary tree for `idx`, but it wasn't faster. After all, we only need few
-lookups in this structure.
-
-# Type search
-
-This is the complicated one. [Hoogle has a very fancy algorithm][Hoogle 5],
-which creates fingerprints for type signatures based on various things, like
-the arity, the number of type constructors, and the rarest type names. But
-although I see the rationale behind it, I have always found that Hoogle tends
-to find weird pairs similar, and I prefer something a bit more white-box. So
-Cloogle returns *only* functions for which the type can be *unified* with the
-query type. This has downsides (if you forget a parameter, you won't find your
-function), but the advantage is that ranking is much easier.
-
-Originally, we would unify every type in the database with the query type, but
-this is very slow. We now make use of an ordering on types. We say that
-*t*1 ≤ *t*2 iff *t*2 generalises
-*t*1. From this follows that if the query type does not unify with
-*t*2, it will also not unify with *t*1. This observation
-allows us to prune away a lot of results.
-
-In the index, we keep a tree of all types, starting with the most general type,
-`a`, in the root. The children are the more specific types, such as `Int`,
-`c a`, and `a -> b`. This is represented in a simple data structure:
-
-```clean
-:: TypeTree v = Node Type [v] [TypeTree v]
-```
-
-We keep a `TypeTree Index`. Isomorphic types are stored only once in the index,
-and they keep a list of indices into the `NativeDB` of functions that have that
-type. When looking for a query type, we do a depth-first walk over the type
-tree. However, we prune a node's children when the node's type does not unify
-with the query type.
-
-For example, consider the type tree below. If we search for `Char -> Int` we
-only want to find nodes 1, 2, and 3. With the type tree, `a -> String` in node
-6 does not unify, so we never visit nodes 7 and 8.
-
-```clean
-Node "a" .. // 1
- [ Node "a -> b" .. // 2
- [ Node "a -> Int" .. // 3
- [ Node "Int -> Int" .. [] // 4
- , Node "String -> Int" .. [] // 5
- ]
- , Node "a -> String" .. // 6
- [ Node "Int -> String" .. [] // 7
- , Node "String -> String" .. [] // 8
- ]
- ]
- ]
-```
-
-Observe that there are multiple ways to build the type tree. Instead of the
-tree above, we could also group on type variable `a` first, such that node 3
-would have the type `Int -> b` and node 6 the type `String -> b`. How the type
-is built depends on the order in which types are found by the index builder. It
-does not really matter for the search algorithm.
-
-Besides this, there are some nice tricks to improve the results. For instance,
-type synonyms are resolved in both the type tree and the query. In Hoogle, you
-won't find `length :: Foldable t => t a -> Int` with the query `String -> Int`,
-because it does not resolve `String -> Int` to `[Char] -> Int`. In Cloogle, the
-same query will find `size :: (a e) -> Int | Array a e`, because it knows that
-`String` is an `{#} Char` (an array of characters).
-
-Also, universally quantified type variables can be used to restrict the search
-to functions that are polymorphic. By default, a query like `[a] -> [a]` will
-return results like `indexList :: [a] -> [Int]`, because the user might not
-realize that the function they are looking for cannot be defined in a
-polymorphic way (or the function in the index does not have the most general
-type). However, if you are sure you only want results where `a` is not unified,
-you can use `A.a: [a] -> [a]`. `A` here is similar to Haskell's `forall`. The
-unification system takes care of the rest.
-
-# Ranking
-
-The last interesting optimization concerns ranking. We go several times through
-the index to determine which entries should be returned. Once we have the
-result set we need to order it, but we do not want to have to reunify types to
-determine the distance measure. This is what the `annotations` field in the
-`NativeDB` is for: during search, you can leave annotations, which are then
-picked up by the ranker to compute the distance measure. That way you only need
-to look at the annotations instead of performing part of the search logic
-again. Cloogle for instance keeps track of the unifier (for type search) and
-the number of matching *n*-grams (for name search):
-
-```clean
-:: Annotation
- = MatchingNGramsQuery !Real //* The number of matching n-grams in the query
- | MatchingNGramsResult !Real //* The number of matching n-grams in the result
- | Unifier !Unifier //* For type search, the unifier
- /* .. */
-```
-
-It is not immediately obvious how these annotations should be turned into a
-single distance measure. And we don't want to think about this. What Cloogle
-does instead is define a list of variables based on which we might want to
-rank, and define a list of ranking constraints. We then use a constraint solver
-to compute the exact ranking formula.
-
-Some examples of ranking variables are:
-
-- The ratio of *n*-grams from the query that match in the result
-- The ratio of *n*-grams from the result that match in the query
-- The number of type variables in the unifier
-- Whether the result comes from the standard library
-
-Some examples of ranking constraints are:
-
-- For the query `toInt`, prefer the class `toInt` over the function
- `digitToInt`.
-- For the query `Char`, prefer the built-in type `Char` over the class
- `toChar`.
-
-Cloogle assumes that the ranking function is a sum of products of the ranking
-variables with a weight. When it updates the index, as a final step it computes
-the symbolic distance measure for all the ranking constraints, and passes these
-to [Z3][] to find appropriate weights for all the ranking variables.
-
-If we come across a query that returns results in an odd order, we just need to
-add a ranking constraint; we do not need to think and write a new formula
-ourselves. Furthermore, the ranking constraints act as test cases for the
-ranking function, so we are sure that changes to the formula will not introduce
-regressions.
-
-# Conclusion
-
-Once results are ranked, we need to find detailed information for the first 15
-(or later ones, if a lower result page is requested). For example, we show the
-exact unifier for type search. We cannot use annotations for this, because in
-the type tree names of type variables are generalized. But because we need to
-do this only for 15 results, we don't need to be very fast here. We simply
-unify these results again to get the real, human-readable unifier, and like
-this we fetch some other information as well.
-
-Overall, I'm quite happy with the orthogonal setup of different search
-strategies. The implementation of name search and type search is completely
-different, and each is optimized in different ways, but you can still combine
-the search types in a single query. Also the annotations make the architecture
-much simpler and avoid duplicating logic in the search and ranking functions.
-Finally, using a constraint solver to determine the ranking weights has proven
-to be a good basis for a solid ranking function.
-
-[Clean]: http://clean.cs.ru.nl/
-[Cloogle]: https://cloogle.org/
-[Haskell]: https://haskell.org/
-[Hoogle]: https://hoogle.haskell.org/
-[Hoogle 5]: https://neilmitchell.blogspot.com/2020/06/hoogle-searching-overview.html
-[stats]: https://cloogle.org/stats/
-[Z3]: https://github.com/Z3Prover/z3
diff --git a/resources/md/2021-08-02-cloogle-search-overview.md b/resources/md/2021-08-02-cloogle-search-overview.md
new file mode 100644
index 0000000..6fb6e6e
--- /dev/null
+++ b/resources/md/2021-08-02-cloogle-search-overview.md
@@ -0,0 +1,212 @@
+*[Cloogle][] is a search engine for the pure, functional programming language
+[Clean][], similar to [Hoogle][] for [Haskell][]. In this post I go through the
+design of the search backend and make a comparison with that of [Hoogle 5][].*
+
+[[toc]]
+
+# Overview
+
+Internally, Cloogle keeps the entire index in memory. This is possible because
+it is relatively small. While we have different entry types for functions, type
+definitions, classes, and other things, they are all stored in one big array, a
+`NativeDB CloogleEntry Annotation`:
+
+```clean
+:: *NativeDB v a :== *{!*Entry v a}
+
+:: Entry v a =
+ { value :: !v
+ , included :: !Bool
+ , annotations :: ![!a!]
+ }
+
+:: CloogleEntry
+ = FunctionEntry !FunctionEntry
+ | TypeDefEntry !TypeDefEntry
+ /* .. */
+```
+
+Because the `NativeDB` is
+[unique](https://en.wikipedia.org/wiki/Uniqueness_type), we can do mutable
+updates on it. While searching, we keep all entries in the array, and use
+`included` to signal whether an entry should be returned from a search result
+or not.
+
+For a search request, we go multiple times through the database and toggle the
+`included` flag as needed. For example, for the query
+`map :: (a -> b) [a] -> [b]` we would first filter on the name `map` and then
+on the type `(a -> b) [a] -> [b]`. The implementation of each filter is
+independent of the others, so that each can be implemented in the way that is
+most efficient for it.
+
+# Name search
+
+To do fuzzy searches for names, the index contains an `NGramIndex Index`:
+
+```clean
+:: NGramIndex v =
+ { n :: !Int //* The parameter *n* for the size of the grams
+ , ci :: !Bool //* Whether matching is case-insensitive
+ , idx :: !Map String [v] //* The values
+ }
+```
+
+The values stored in this index are indices for the `NativeDB`. So if we would
+search for `query`, we would take the 3-grams `que`, `uer`, and `ery` (*n* is
+currently 3), and for each gram lookup the list of indices in `idx`. These
+lists are then merged (they are sorted, so this can be done in O(n)), and we go
+through the `NativeDB` to set `included` for the indices in the list.
+
+I experimented with a [trie](https://en.wikipedia.org/wiki/Trie) instead of a
+binary tree for `idx`, but it wasn't faster. After all, we only need few
+lookups in this structure.
+
+# Type search
+
+This is the complicated one. [Hoogle has a very fancy algorithm][Hoogle 5],
+which creates fingerprints for type signatures based on various things, like
+the arity, the number of type constructors, and the rarest type names. But
+although I see the rationale behind it, I have always found that Hoogle tends
+to find weird pairs similar, and I prefer something a bit more white-box. So
+Cloogle returns *only* functions for which the type can be *unified* with the
+query type. This has downsides (if you forget a parameter, you won't find your
+function), but the advantage is that ranking is much easier.
+
+Originally, we would unify every type in the database with the query type, but
+this is very slow. We now make use of an ordering on types. We say that
+*t*1 ≤ *t*2 iff *t*2 generalises
+*t*1. From this follows that if the query type does not unify with
+*t*2, it will also not unify with *t*1. This observation
+allows us to prune away a lot of results.
+
+In the index, we keep a tree of all types, starting with the most general type,
+`a`, in the root. The children are the more specific types, such as `Int`,
+`c a`, and `a -> b`. This is represented in a simple data structure:
+
+```clean
+:: TypeTree v = Node Type [v] [TypeTree v]
+```
+
+We keep a `TypeTree Index`. Isomorphic types are stored only once in the index,
+and they keep a list of indices into the `NativeDB` of functions that have that
+type. When looking for a query type, we do a depth-first walk over the type
+tree. However, we prune a node's children when the node's type does not unify
+with the query type.
+
+For example, consider the type tree below. If we search for `Char -> Int` we
+only want to find nodes 1, 2, and 3. With the type tree, `a -> String` in node
+6 does not unify, so we never visit nodes 7 and 8.
+
+```clean
+Node "a" .. // 1
+ [ Node "a -> b" .. // 2
+ [ Node "a -> Int" .. // 3
+ [ Node "Int -> Int" .. [] // 4
+ , Node "String -> Int" .. [] // 5
+ ]
+ , Node "a -> String" .. // 6
+ [ Node "Int -> String" .. [] // 7
+ , Node "String -> String" .. [] // 8
+ ]
+ ]
+ ]
+```
+
+Observe that there are multiple ways to build the type tree. Instead of the
+tree above, we could also group on type variable `a` first, such that node 3
+would have the type `Int -> b` and node 6 the type `String -> b`. How the type
+is built depends on the order in which types are found by the index builder. It
+does not really matter for the search algorithm.
+
+Besides this, there are some nice tricks to improve the results. For instance,
+type synonyms are resolved in both the type tree and the query. In Hoogle, you
+won't find `length :: Foldable t => t a -> Int` with the query `String -> Int`,
+because it does not resolve `String -> Int` to `[Char] -> Int`. In Cloogle, the
+same query will find `size :: (a e) -> Int | Array a e`, because it knows that
+`String` is an `{#} Char` (an array of characters).
+
+Also, universally quantified type variables can be used to restrict the search
+to functions that are polymorphic. By default, a query like `[a] -> [a]` will
+return results like `indexList :: [a] -> [Int]`, because the user might not
+realize that the function they are looking for cannot be defined in a
+polymorphic way (or the function in the index does not have the most general
+type). However, if you are sure you only want results where `a` is not unified,
+you can use `A.a: [a] -> [a]`. `A` here is similar to Haskell's `forall`. The
+unification system takes care of the rest.
+
+# Ranking
+
+The last interesting optimization concerns ranking. We go several times through
+the index to determine which entries should be returned. Once we have the
+result set we need to order it, but we do not want to have to reunify types to
+determine the distance measure. This is what the `annotations` field in the
+`NativeDB` is for: during search, you can leave annotations, which are then
+picked up by the ranker to compute the distance measure. That way you only need
+to look at the annotations instead of performing part of the search logic
+again. Cloogle for instance keeps track of the unifier (for type search) and
+the number of matching *n*-grams (for name search):
+
+```clean
+:: Annotation
+ = MatchingNGramsQuery !Real //* The number of matching n-grams in the query
+ | MatchingNGramsResult !Real //* The number of matching n-grams in the result
+ | Unifier !Unifier //* For type search, the unifier
+ /* .. */
+```
+
+It is not immediately obvious how these annotations should be turned into a
+single distance measure. And we don't want to think about this. What Cloogle
+does instead is define a list of variables based on which we might want to
+rank, and define a list of ranking constraints. We then use a constraint solver
+to compute the exact ranking formula.
+
+Some examples of ranking variables are:
+
+- The ratio of *n*-grams from the query that match in the result
+- The ratio of *n*-grams from the result that match in the query
+- The number of type variables in the unifier
+- Whether the result comes from the standard library
+
+Some examples of ranking constraints are:
+
+- For the query `toInt`, prefer the class `toInt` over the function
+ `digitToInt`.
+- For the query `Char`, prefer the built-in type `Char` over the class
+ `toChar`.
+
+Cloogle assumes that the ranking function is a sum of products of the ranking
+variables with a weight. When it updates the index, as a final step it computes
+the symbolic distance measure for all the ranking constraints, and passes these
+to [Z3][] to find appropriate weights for all the ranking variables.
+
+If we come across a query that returns results in an odd order, we just need to
+add a ranking constraint; we do not need to think and write a new formula
+ourselves. Furthermore, the ranking constraints act as test cases for the
+ranking function, so we are sure that changes to the formula will not introduce
+regressions.
+
+# Conclusion
+
+Once results are ranked, we need to find detailed information for the first 15
+(or later ones, if a lower result page is requested). For example, we show the
+exact unifier for type search. We cannot use annotations for this, because in
+the type tree names of type variables are generalized. But because we need to
+do this only for 15 results, we don't need to be very fast here. We simply
+unify these results again to get the real, human-readable unifier, and like
+this we fetch some other information as well.
+
+Overall, I'm quite happy with the orthogonal setup of different search
+strategies. The implementation of name search and type search is completely
+different, and each is optimized in different ways, but you can still combine
+the search types in a single query. Also the annotations make the architecture
+much simpler and avoid duplicating logic in the search and ranking functions.
+Finally, using a constraint solver to determine the ranking weights has proven
+to be a good basis for a solid ranking function.
+
+[Clean]: http://clean.cs.ru.nl/
+[Cloogle]: https://cloogle.org/
+[Haskell]: https://haskell.org/
+[Hoogle]: https://hoogle.haskell.org/
+[Hoogle 5]: https://neilmitchell.blogspot.com/2020/06/hoogle-searching-overview.html
+[stats]: https://cloogle.org/stats/
+[Z3]: https://github.com/Z3Prover/z3
diff --git a/resources/pug/finals/articles/2021-07-26-cloogle-search-overview.pug b/resources/pug/finals/articles/2021-07-26-cloogle-search-overview.pug
deleted file mode 100644
index 859f11c..0000000
--- a/resources/pug/finals/articles/2021-07-26-cloogle-search-overview.pug
+++ /dev/null
@@ -1,18 +0,0 @@
-extends /layout-articles.pug
-
-block prepend title
- | Cloogle search overview -
-
-block prepend menu
- - var page = '2021-07-26-cloogle-search-overview'
-
-block append breadcrumbs
- +breadcrumb('Cloogle search overview')
-
-block subtitle
- | Cloogle search overview
-block subtitleDate
- | 26 July 2021
-
-block page
- include:markdown ../../../md/2021-07-26-cloogle-search-overview.md
diff --git a/resources/pug/finals/articles/2021-08-02-cloogle-search-overview.pug b/resources/pug/finals/articles/2021-08-02-cloogle-search-overview.pug
new file mode 100644
index 0000000..416b075
--- /dev/null
+++ b/resources/pug/finals/articles/2021-08-02-cloogle-search-overview.pug
@@ -0,0 +1,18 @@
+extends /layout-articles.pug
+
+block prepend title
+ | Cloogle search overview -
+
+block prepend menu
+ - var page = '2021-08-02-cloogle-search-overview'
+
+block append breadcrumbs
+ +breadcrumb('Cloogle search overview')
+
+block subtitle
+ | Cloogle search overview
+block subtitleDate
+ | 2 August 2021
+
+block page
+ include:markdown ../../../md/2021-08-02-cloogle-search-overview.md
diff --git a/resources/pug/finals/articles/index.pug b/resources/pug/finals/articles/index.pug
index 3a5264c..b9fcfff 100644
--- a/resources/pug/finals/articles/index.pug
+++ b/resources/pug/finals/articles/index.pug
@@ -15,8 +15,8 @@ block page
| You can also go back to my #[a(href='/') home page].
h1
- a(href='2021-07-26-cloogle-search-overview.html').
- 26 July 2021: Cloogle search overview
+ a(href='2021-08-02-cloogle-search-overview.html').
+ 2 August 2021: Cloogle search overview
blockquote.
Cloogle is a search engine for the pure, functional programming language Clean, similar to Hoogle for Haskell.
In this post I go through the design of the search backend and make a comparison with that of Hoogle 5.
--
cgit v1.2.3