Update Cloogle article date

author: Camil Staps 2021-08-02 16:17:05 +0200
committer: Camil Staps 2021-08-02 16:17:05 +0200
commit: d8289ad28290bd3081d6cf64a3bc5c25e65c718c (patch)
tree: a2f70ed2c5c85c484b99988b0cd25ce38be1b48c /resources/md/2021-07-26-cloogle-search-overview.md
parent: Add extra info about type search (diff)
1 files changed, 0 insertions, 212 deletions
diff --git a/resources/md/2021-07-26-cloogle-search-overview.md b/resources/md/2021-07-26-cloogle-search-overview.md
deleted file mode 100644
index 6fb6e6e..0000000
--- a/resources/md/2021-07-26-cloogle-search-overview.md
+++ /dev/null
@@ -1,212 +0,0 @@
-*[Cloogle][] is a search engine for the pure, functional programming language
-[Clean][], similar to [Hoogle][] for [Haskell][]. In this post I go through the
-design of the search backend and make a comparison with that of [Hoogle 5][].*
-
-[[toc]]
-
-# Overview
-
-Internally, Cloogle keeps the entire index in memory. This is possible because
-it is relatively small. While we have different entry types for functions, type
-definitions, classes, and other things, they are all stored in one big array, a
-`NativeDB CloogleEntry Annotation`:
-
-```clean
-:: *NativeDB v a :== *{!*Entry v a}
-
-:: Entry v a =
-	{ value       :: !v
-	, included    :: !Bool
-	, annotations :: ![!a!]
-	}
-
-:: CloogleEntry
-	= FunctionEntry !FunctionEntry
-	| TypeDefEntry !TypeDefEntry
-	/* .. */
-```
-
-Because the `NativeDB` is
-[unique](https://en.wikipedia.org/wiki/Uniqueness_type), we can do mutable
-updates on it. While searching, we keep all entries in the array, and use
-`included` to signal whether an entry should be returned from a search result
-or not.
-
-For a search request, we go multiple times through the database and toggle the
-`included` flag as needed. For example, for the query
-`map :: (a -> b) [a] -> [b]` we would first filter on the name `map` and then
-on the type `(a -> b) [a] -> [b]`. The implementation of each filter is
-independent of the others, so that each can be implemented in the way that is
-most efficient for it.
-
-# Name search
-
-To do fuzzy searches for names, the index contains an `NGramIndex Index`:
-
-```clean
-:: NGramIndex v =
-	{ n   :: !Int            //* The parameter *n* for the size of the grams
-	, ci  :: !Bool           //* Whether matching is case-insensitive
-	, idx :: !Map String [v] //* The values
-	}
-```
-
-The values stored in this index are indices for the `NativeDB`. So if we would
-search for `query`, we would take the 3-grams `que`, `uer`, and `ery` (*n* is
-currently 3), and for each gram lookup the list of indices in `idx`. These
-lists are then merged (they are sorted, so this can be done in O(n)), and we go
-through the `NativeDB` to set `included` for the indices in the list.
-
-I experimented with a [trie](https://en.wikipedia.org/wiki/Trie) instead of a
-binary tree for `idx`, but it wasn't faster. After all, we only need few
-lookups in this structure.
-
-# Type search
-
-This is the complicated one. [Hoogle has a very fancy algorithm][Hoogle 5],
-which creates fingerprints for type signatures based on various things, like
-the arity, the number of type constructors, and the rarest type names. But
-although I see the rationale behind it, I have always found that Hoogle tends
-to find weird pairs similar, and I prefer something a bit more white-box. So
-Cloogle returns *only* functions for which the type can be *unified* with the
-query type. This has downsides (if you forget a parameter, you won't find your
-function), but the advantage is that ranking is much easier.
-
-Originally, we would unify every type in the database with the query type, but
-this is very slow. We now make use of an ordering on types. We say that
-*t*<sub>1</sub> &le; *t*<sub>2</sub> iff *t*<sub>2</sub> generalises
-*t*<sub>1</sub>. From this follows that if the query type does not unify with
-*t*<sub>2</sub>, it will also not unify with *t*<sub>1</sub>. This observation
-allows us to prune away a lot of results.
-
-In the index, we keep a tree of all types, starting with the most general type,
-`a`, in the root. The children are the more specific types, such as `Int`,
-`c a`, and `a -> b`. This is represented in a simple data structure:
-
-```clean
-:: TypeTree v = Node Type [v] [TypeTree v]
-```
-
-We keep a `TypeTree Index`. Isomorphic types are stored only once in the index,
-and they keep a list of indices into the `NativeDB` of functions that have that
-type. When looking for a query type, we do a depth-first walk over the type
-tree. However, we prune a node's children when the node's type does not unify
-with the query type.
-
-For example, consider the type tree below. If we search for `Char -> Int` we
-only want to find nodes 1, 2, and 3. With the type tree, `a -> String` in node
-6 does not unify, so we never visit nodes 7 and 8.
-
-```clean
-Node "a" ..                                  // 1
-	[ Node "a -> b" ..                       // 2
-		[ Node "a -> Int" ..                 // 3
-			[ Node "Int -> Int" .. []        // 4
-			, Node "String -> Int" .. []     // 5
-			]
-		, Node "a -> String" ..              // 6
-			[ Node "Int -> String" .. []     // 7
-			, Node "String -> String" .. []  // 8
-			]
-		]
-	]
-```
-
-Observe that there are multiple ways to build the type tree. Instead of the
-tree above, we could also group on type variable `a` first, such that node 3
-would have the type `Int -> b` and node 6 the type `String -> b`. How the type
-is built depends on the order in which types are found by the index builder. It
-does not really matter for the search algorithm.
-
-Besides this, there are some nice tricks to improve the results. For instance,
-type synonyms are resolved in both the type tree and the query. In Hoogle, you
-won't find `length :: Foldable t => t a -> Int` with the query `String -> Int`,
-because it does not resolve `String -> Int` to `[Char] -> Int`. In Cloogle, the
-same query will find `size :: (a e) -> Int | Array a e`, because it knows that
-`String` is an `{#} Char` (an array of characters).
-
-Also, universally quantified type variables can be used to restrict the search
-to functions that are polymorphic. By default, a query like `[a] -> [a]` will
-return results like `indexList :: [a] -> [Int]`, because the user might not
-realize that the function they are looking for cannot be defined in a
-polymorphic way (or the function in the index does not have the most general
-type). However, if you are sure you only want results where `a` is not unified,
-you can use `A.a: [a] -> [a]`. `A` here is similar to Haskell's `forall`. The
-unification system takes care of the rest.
-
-# Ranking
-
-The last interesting optimization concerns ranking. We go several times through
-the index to determine which entries should be returned. Once we have the
-result set we need to order it, but we do not want to have to reunify types to
-determine the distance measure. This is what the `annotations` field in the
-`NativeDB` is for: during search, you can leave annotations, which are then
-picked up by the ranker to compute the distance measure. That way you only need
-to look at the annotations instead of performing part of the search logic
-again. Cloogle for instance keeps track of the unifier (for type search) and
-the number of matching *n*-grams (for name search):
-
-```clean
-:: Annotation
-	= MatchingNGramsQuery !Real  //* The number of matching n-grams in the query
-	| MatchingNGramsResult !Real //* The number of matching n-grams in the result
-	| Unifier !Unifier  //* For type search, the unifier
-	/* .. */
-```
-
-It is not immediately obvious how these annotations should be turned into a
-single distance measure. And we don't want to think about this. What Cloogle
-does instead is define a list of variables based on which we might want to
-rank, and define a list of ranking constraints. We then use a constraint solver
-to compute the exact ranking formula.
-
-Some examples of ranking variables are:
-
-- The ratio of *n*-grams from the query that match in the result
-- The ratio of *n*-grams from the result that match in the query
-- The number of type variables in the unifier
-- Whether the result comes from the standard library
-
-Some examples of ranking constraints are:
-
-- For the query `toInt`, prefer the class `toInt` over the function
-  `digitToInt`.
-- For the query `Char`, prefer the built-in type `Char` over the class
-  `toChar`.
-
-Cloogle assumes that the ranking function is a sum of products of the ranking
-variables with a weight. When it updates the index, as a final step it computes
-the symbolic distance measure for all the ranking constraints, and passes these
-to [Z3][] to find appropriate weights for all the ranking variables.
-
-If we come across a query that returns results in an odd order, we just need to
-add a ranking constraint; we do not need to think and write a new formula
-ourselves. Furthermore, the ranking constraints act as test cases for the
-ranking function, so we are sure that changes to the formula will not introduce
-regressions.
-
-# Conclusion
-
-Once results are ranked, we need to find detailed information for the first 15
-(or later ones, if a lower result page is requested). For example, we show the
-exact unifier for type search. We cannot use annotations for this, because in
-the type tree names of type variables are generalized. But because we need to
-do this only for 15 results, we don't need to be very fast here. We simply
-unify these results again to get the real, human-readable unifier, and like
-this we fetch some other information as well.
-
-Overall, I'm quite happy with the orthogonal setup of different search
-strategies. The implementation of name search and type search is completely
-different, and each is optimized in different ways, but you can still combine
-the search types in a single query. Also the annotations make the architecture
-much simpler and avoid duplicating logic in the search and ranking functions.
-Finally, using a constraint solver to determine the ranking weights has proven
-to be a good basis for a solid ranking function.
-
-[Clean]: http://clean.cs.ru.nl/
-[Cloogle]: https://cloogle.org/
-[Haskell]: https://haskell.org/
-[Hoogle]: https://hoogle.haskell.org/
-[Hoogle 5]: https://neilmitchell.blogspot.com/2020/06/hoogle-searching-overview.html
-[stats]: https://cloogle.org/stats/
-[Z3]: https://github.com/Z3Prover/z3
author	Camil Staps	2021-08-02 16:17:05 +0200
committer	Camil Staps	2021-08-02 16:17:05 +0200
commit	d8289ad28290bd3081d6cf64a3bc5c25e65c718c (patch)
tree	a2f70ed2c5c85c484b99988b0cd25ce38be1b48c /resources/md/2021-07-26-cloogle-search-overview.md
parent	Add extra info about type search (diff)