Add article about Cloogle

author: Camil Staps 2021-06-23 21:11:09 +0200
committer: Camil Staps 2021-07-09 15:13:48 +0200
commit: 511d410970406f2dcd10f01a9892f5d0f9e09476 (patch)
tree: 9b8820f0222effa4dfd376498a589a67a0cd9050 /resources
parent: Add note about assistantships to Teaching section (diff)
4 files changed, 224 insertions, 0 deletions
diff --git a/resources/md/2021-07-26-cloogle-search-overview.md b/resources/md/2021-07-26-cloogle-search-overview.md
new file mode 100644
index 0000000..213e71c
--- /dev/null
+++ b/resources/md/2021-07-26-cloogle-search-overview.md
@@ -0,0 +1,196 @@
+*[Cloogle][] is a search engine for the pure, functional programming language
+[Clean][], similar to [Hoogle][] for [Haskell][]. In this post I go through the
+design of the search backend and make a comparison with that of [Hoogle 5][].*
+
+[[toc]]
+
+# Overview
+
+Internally, Cloogle keeps the entire index in memory. This is possible because
+it is relatively small. While we have different entry types for functions, type
+definitions, classes, and other things, they are all stored in one big array, a
+`NativeDB CloogleEntry Annotation`:
+
+```clean
+:: *NativeDB v a :== *{!*Entry v a}
+
+:: Entry v a =
+	{ value       :: !v
+	, included    :: !Bool
+	, annotations :: ![!a!]
+	}
+
+:: CloogleEntry
+	= FunctionEntry !FunctionEntry
+	| TypeDefEntry !TypeDefEntry
+	/* .. */
+```
+
+Because the `NativeDB` is
+[unique](https://en.wikipedia.org/wiki/Uniqueness_type), we can do mutable
+updates on it. While searching, we keep all entries in the array, and use
+`included` to signal whether an entry should be returned from a search result
+or not.
+
+For a search request, we go multiple times through the database and toggle the
+`included` flag as needed. For example, for the query
+`map :: (a -> b) [a] -> [b]` we would first filter on the name `map` and then
+on the type `(a -> b) [a] -> [b]`. The implementation of each filter is
+independent of the others, so that each can be implemented in the way that is
+most efficient for it.
+
+# Name search
+
+To do fuzzy searches for names, the index contains an `NGramIndex Index`:
+
+```clean
+:: NGramIndex v =
+	{ n   :: !Int            //* The parameter *n* for the size of the grams
+	, ci  :: !Bool           //* Whether matching is case-insensitive
+	, idx :: !Map String [v] //* The values
+	}
+```
+
+The values stored in this index are indices for the `NativeDB`. So if we would
+search for `query`, we would take the 3-grams `que`, `uer`, and `ery` (*n* is
+currently 3), and for each gram lookup the list of indices in `idx`. These
+lists are then merged (they are sorted, so this can be done in O(n)), and we go
+through the `NativeDB` to set `included` for the indices in the list.
+
+I experimented with a [trie](https://en.wikipedia.org/wiki/Trie) instead of a
+binary tree for `idx`, but it wasn't faster. After all, we only need few
+lookups in this structure.
+
+# Type search
+
+This is the complicated one. [Hoogle has a very fancy algorithm][Hoogle 5],
+which creates fingerprints for type signatures based on various things, like
+the arity, the number of type constructors, and the rarest type names. But
+although I see the rationale behind it, I have always found that Hoogle tends
+to find weird pairs similar, and I prefer something a bit more white-box. So
+Cloogle returns *only* functions for which the type can be *unified* with the
+query type. This has downsides (if you forget a parameter, you won't find your
+function), but the advantage is that ranking is much easier.
+
+Originally, we would unify every type in the database with the query type, but
+this is very slow. We now make use of an ordering on types. We say that
+*t*<sub>1</sub> &le; *t*<sub>2</sub> iff *t*<sub>2</sub> generalises
+*t*<sub>1</sub>. From this follows that if the query type does not unify with
+*t*<sub>2</sub>, it will also not unify with *t*<sub>1</sub>. This observation
+allows us to prune away a lot of results.
+
+In the index, we keep a tree of all types, starting with the most general type,
+`a`, in the root. The children are the more specific types, such as `Int`,
+`c a`, and `a -> b`. This is represented in a simple data structure:
+
+```clean
+:: TypeTree v = Node Type [v] [TypeTree v]
+```
+
+We keep a `TypeTree Index`. Isomorphic types are stored only once in the index,
+and they keep a list of indices into the `NativeDB` of functions that have that
+type. When looking for a query type, we do a depth-first walk over the type
+tree. However, we prune a node's children when the node's type does not unify
+with the query type.
+
+For example, consider the type tree below. If we search for `Char -> Int` we
+only want to find nodes 1, 2, and 3. With the type tree, `a -> String` in node
+6 does not unify, so we never visit nodes 7 and 8.
+
+```clean
+Node "a" ..                                  // 1
+	[ Node "a -> b" ..                       // 2
+		[ Node "a -> Int" ..                 // 3
+			[ Node "Int -> Int" .. []        // 4
+			, Node "String -> Int" .. []     // 5
+			]
+		, Node "a -> String" ..              // 6
+			[ Node "Int -> String" .. []     // 7
+			, Node "String -> String" .. []  // 8
+			]
+		]
+	]
+```
+
+Observe that there are multiple ways to build the type tree. Instead of the
+tree above, we could also group on type variable `a` first, such that node 3
+would have the type `Int -> b` and node 6 the type `String -> b`. How the type
+is built depends on the order in which types are found by the index builder. It
+does not really matter for the search algorithm.
+
+# Ranking
+
+The last interesting optimization concerns ranking. We go several times through
+the index to determine which entries should be returned. Once we have the
+result set we need to order it, but we do not want to have to reunify types to
+determine the distance measure. This is what the `annotations` field in the
+`NativeDB` is for: during search, you can leave annotations, which are then
+picked up by the ranker to compute the distance measure. That way you only need
+to look at the annotations instead of performing part of the search logic
+again. Cloogle for instance keeps track of the unifier (for type search) and
+the number of matching *n*-grams (for name search):
+
+```clean
+:: Annotation
+	= MatchingNGramsQuery !Real  //* The number of matching n-grams in the query
+	| MatchingNGramsResult !Real //* The number of matching n-grams in the result
+	| Unifier !Unifier  //* For type search, the unifier
+	/* .. */
+```
+
+It is not immediately obvious how these annotations should be turned into a
+single distance measure. And we don't want to think about this. What Cloogle
+does instead is define a list of variables based on which we might want to
+rank, and define a list of ranking constraints. We then use a constraint solver
+to compute the exact ranking formula.
+
+Some examples of ranking variables are:
+
+- The ratio of *n*-grams from the query that match in the result
+- The ratio of *n*-grams from the result that match in the query
+- The number of type variables in the unifier
+- Whether the result comes from the standard library
+
+Some examples of ranking constraints are:
+
+- For the query `toInt`, prefer the class `toInt` over the function
+  `digitToInt`.
+- For the query `Char`, prefer the built-in type `Char` over the class
+  `toChar`.
+
+Cloogle assumes that the ranking function is a sum of products of the ranking
+variables with a weight. When it updates the index, as a final step it computes
+the symbolic distance measure for all the ranking constraints, and passes these
+to [Z3][] to find appropriate weights for all the ranking variables.
+
+If we come across a query that returns results in an odd order, we just need to
+add a ranking constraint; we do not need to think and write a new formula
+ourselves. Furthermore, the ranking constraints act as test cases for the
+ranking function, so we are sure that changes to the formula will not introduce
+regressions.
+
+# Conclusion
+
+Once results are ranked, we need to find detailed information for the first 15
+(or later ones, if a lower result page is requested). For example, we show the
+exact unifier for type search. We cannot use annotations for this, because in
+the type tree names of type variables are generalized. But because we need to
+do this only for 15 results, we don't need to be very fast here. We simply
+unify these results again to get the real, human-readable unifier, and like
+this we fetch some other information as well.
+
+Overall, I'm quite happy with the orthogonal setup of different search
+strategies. The implementation of name search and type search is completely
+different, and each is optimized in different ways, but you can still combine
+the search types in a single query. Also the annotations make the architecture
+much simpler and avoid duplicating logic in the search and ranking functions.
+Finally, using a constraint solver to determine the ranking weights has proven
+to be a good basis for a solid ranking function.
+
+[Clean]: http://clean.cs.ru.nl/
+[Cloogle]: https://cloogle.org/
+[Haskell]: https://haskell.org/
+[Hoogle]: https://hoogle.haskell.org/
+[Hoogle 5]: https://neilmitchell.blogspot.com/2020/06/hoogle-searching-overview.html
+[stats]: https://cloogle.org/stats/
+[Z3]: https://github.com/Z3Prover/z3
diff --git a/resources/pug/finals/articles/2021-07-26-cloogle-search-overview.pug b/resources/pug/finals/articles/2021-07-26-cloogle-search-overview.pug
new file mode 100644
index 0000000..859f11c
--- /dev/null
+++ b/resources/pug/finals/articles/2021-07-26-cloogle-search-overview.pug
@@ -0,0 +1,18 @@
+extends /layout-articles.pug
+
+block prepend title
+	| Cloogle search overview - 
+
+block prepend menu
+	- var page = '2021-07-26-cloogle-search-overview'
+
+block append breadcrumbs
+	+breadcrumb('Cloogle search overview')
+
+block subtitle
+	| Cloogle search overview
+block subtitleDate
+	| 26 July 2021
+
+block page
+	include:markdown ../../../md/2021-07-26-cloogle-search-overview.md
diff --git a/resources/pug/finals/articles/index.pug b/resources/pug/finals/articles/index.pug
index b3100dd..3a5264c 100644
--- a/resources/pug/finals/articles/index.pug
+++ b/resources/pug/finals/articles/index.pug
@@ -15,6 +15,13 @@ block page
 		| You can also go back to my #[a(href='/') home page].
 
 	h1
+		a(href='2021-07-26-cloogle-search-overview.html').
+			26 July 2021: Cloogle search overview
+	blockquote.
+		Cloogle is a search engine for the pure, functional programming language Clean, similar to Hoogle for Haskell.
+		In this post I go through the design of the search backend and make a comparison with that of Hoogle 5.
+
+	h1
 		a(href='2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-1-introduction.html').
 			23 June 2021: Compiling Clean in the browser with WebAssembly
 	blockquote.
diff --git a/resources/pug/include/layout-articles.pug b/resources/pug/include/layout-articles.pug
index 10feaab..c35e38b 100644
--- a/resources/pug/include/layout-articles.pug
+++ b/resources/pug/include/layout-articles.pug
@@ -8,6 +8,9 @@ block append menu
 	+menu(
 		{name: 'Home', link: ''},
 		{name: 2021, menu: [
+			{ name: 'Cloogle search overview',
+				year: 2021, month: 7, day: 26
+			},
 			{ name: 'Compiling Clean in the browser with WebAssembly, part 3: Putting it all together',
 				year: 2021, month: 6, day: 23
 			},
author	Camil Staps	2021-06-23 21:11:09 +0200
committer	Camil Staps	2021-07-09 15:13:48 +0200
commit	511d410970406f2dcd10f01a9892f5d0f9e09476 (patch)
tree	9b8820f0222effa4dfd376498a589a67a0cd9050 /resources
parent	Add note about assistantships to Teaching section (diff)