diff options
| author | Camil Staps | 2021-06-23 21:11:09 +0200 | 
|---|---|---|
| committer | Camil Staps | 2021-07-09 15:13:48 +0200 | 
| commit | 511d410970406f2dcd10f01a9892f5d0f9e09476 (patch) | |
| tree | 9b8820f0222effa4dfd376498a589a67a0cd9050 /resources | |
| parent | Add note about assistantships to Teaching section (diff) | |
Add article about Cloogle
Diffstat (limited to 'resources')
| -rw-r--r-- | resources/md/2021-07-26-cloogle-search-overview.md | 196 | ||||
| -rw-r--r-- | resources/pug/finals/articles/2021-07-26-cloogle-search-overview.pug | 18 | ||||
| -rw-r--r-- | resources/pug/finals/articles/index.pug | 7 | ||||
| -rw-r--r-- | resources/pug/include/layout-articles.pug | 3 | 
4 files changed, 224 insertions, 0 deletions
| diff --git a/resources/md/2021-07-26-cloogle-search-overview.md b/resources/md/2021-07-26-cloogle-search-overview.md new file mode 100644 index 0000000..213e71c --- /dev/null +++ b/resources/md/2021-07-26-cloogle-search-overview.md @@ -0,0 +1,196 @@ +*[Cloogle][] is a search engine for the pure, functional programming language +[Clean][], similar to [Hoogle][] for [Haskell][]. In this post I go through the +design of the search backend and make a comparison with that of [Hoogle 5][].* + +[[toc]] + +# Overview + +Internally, Cloogle keeps the entire index in memory. This is possible because +it is relatively small. While we have different entry types for functions, type +definitions, classes, and other things, they are all stored in one big array, a +`NativeDB CloogleEntry Annotation`: + +```clean +:: *NativeDB v a :== *{!*Entry v a} + +:: Entry v a = +	{ value       :: !v +	, included    :: !Bool +	, annotations :: ![!a!] +	} + +:: CloogleEntry +	= FunctionEntry !FunctionEntry +	| TypeDefEntry !TypeDefEntry +	/* .. */ +``` + +Because the `NativeDB` is +[unique](https://en.wikipedia.org/wiki/Uniqueness_type), we can do mutable +updates on it. While searching, we keep all entries in the array, and use +`included` to signal whether an entry should be returned from a search result +or not. + +For a search request, we go multiple times through the database and toggle the +`included` flag as needed. For example, for the query +`map :: (a -> b) [a] -> [b]` we would first filter on the name `map` and then +on the type `(a -> b) [a] -> [b]`. The implementation of each filter is +independent of the others, so that each can be implemented in the way that is +most efficient for it. + +# Name search + +To do fuzzy searches for names, the index contains an `NGramIndex Index`: + +```clean +:: NGramIndex v = +	{ n   :: !Int            //* The parameter *n* for the size of the grams +	, ci  :: !Bool           //* Whether matching is case-insensitive +	, idx :: !Map String [v] //* The values +	} +``` + +The values stored in this index are indices for the `NativeDB`. So if we would +search for `query`, we would take the 3-grams `que`, `uer`, and `ery` (*n* is +currently 3), and for each gram lookup the list of indices in `idx`. These +lists are then merged (they are sorted, so this can be done in O(n)), and we go +through the `NativeDB` to set `included` for the indices in the list. + +I experimented with a [trie](https://en.wikipedia.org/wiki/Trie) instead of a +binary tree for `idx`, but it wasn't faster. After all, we only need few +lookups in this structure. + +# Type search + +This is the complicated one. [Hoogle has a very fancy algorithm][Hoogle 5], +which creates fingerprints for type signatures based on various things, like +the arity, the number of type constructors, and the rarest type names. But +although I see the rationale behind it, I have always found that Hoogle tends +to find weird pairs similar, and I prefer something a bit more white-box. So +Cloogle returns *only* functions for which the type can be *unified* with the +query type. This has downsides (if you forget a parameter, you won't find your +function), but the advantage is that ranking is much easier. + +Originally, we would unify every type in the database with the query type, but +this is very slow. We now make use of an ordering on types. We say that +*t*<sub>1</sub> ≤ *t*<sub>2</sub> iff *t*<sub>2</sub> generalises +*t*<sub>1</sub>. From this follows that if the query type does not unify with +*t*<sub>2</sub>, it will also not unify with *t*<sub>1</sub>. This observation +allows us to prune away a lot of results. + +In the index, we keep a tree of all types, starting with the most general type, +`a`, in the root. The children are the more specific types, such as `Int`, +`c a`, and `a -> b`. This is represented in a simple data structure: + +```clean +:: TypeTree v = Node Type [v] [TypeTree v] +``` + +We keep a `TypeTree Index`. Isomorphic types are stored only once in the index, +and they keep a list of indices into the `NativeDB` of functions that have that +type. When looking for a query type, we do a depth-first walk over the type +tree. However, we prune a node's children when the node's type does not unify +with the query type. + +For example, consider the type tree below. If we search for `Char -> Int` we +only want to find nodes 1, 2, and 3. With the type tree, `a -> String` in node +6 does not unify, so we never visit nodes 7 and 8. + +```clean +Node "a" ..                                  // 1 +	[ Node "a -> b" ..                       // 2 +		[ Node "a -> Int" ..                 // 3 +			[ Node "Int -> Int" .. []        // 4 +			, Node "String -> Int" .. []     // 5 +			] +		, Node "a -> String" ..              // 6 +			[ Node "Int -> String" .. []     // 7 +			, Node "String -> String" .. []  // 8 +			] +		] +	] +``` + +Observe that there are multiple ways to build the type tree. Instead of the +tree above, we could also group on type variable `a` first, such that node 3 +would have the type `Int -> b` and node 6 the type `String -> b`. How the type +is built depends on the order in which types are found by the index builder. It +does not really matter for the search algorithm. + +# Ranking + +The last interesting optimization concerns ranking. We go several times through +the index to determine which entries should be returned. Once we have the +result set we need to order it, but we do not want to have to reunify types to +determine the distance measure. This is what the `annotations` field in the +`NativeDB` is for: during search, you can leave annotations, which are then +picked up by the ranker to compute the distance measure. That way you only need +to look at the annotations instead of performing part of the search logic +again. Cloogle for instance keeps track of the unifier (for type search) and +the number of matching *n*-grams (for name search): + +```clean +:: Annotation +	= MatchingNGramsQuery !Real  //* The number of matching n-grams in the query +	| MatchingNGramsResult !Real //* The number of matching n-grams in the result +	| Unifier !Unifier  //* For type search, the unifier +	/* .. */ +``` + +It is not immediately obvious how these annotations should be turned into a +single distance measure. And we don't want to think about this. What Cloogle +does instead is define a list of variables based on which we might want to +rank, and define a list of ranking constraints. We then use a constraint solver +to compute the exact ranking formula. + +Some examples of ranking variables are: + +- The ratio of *n*-grams from the query that match in the result +- The ratio of *n*-grams from the result that match in the query +- The number of type variables in the unifier +- Whether the result comes from the standard library + +Some examples of ranking constraints are: + +- For the query `toInt`, prefer the class `toInt` over the function +  `digitToInt`. +- For the query `Char`, prefer the built-in type `Char` over the class +  `toChar`. + +Cloogle assumes that the ranking function is a sum of products of the ranking +variables with a weight. When it updates the index, as a final step it computes +the symbolic distance measure for all the ranking constraints, and passes these +to [Z3][] to find appropriate weights for all the ranking variables. + +If we come across a query that returns results in an odd order, we just need to +add a ranking constraint; we do not need to think and write a new formula +ourselves. Furthermore, the ranking constraints act as test cases for the +ranking function, so we are sure that changes to the formula will not introduce +regressions. + +# Conclusion + +Once results are ranked, we need to find detailed information for the first 15 +(or later ones, if a lower result page is requested). For example, we show the +exact unifier for type search. We cannot use annotations for this, because in +the type tree names of type variables are generalized. But because we need to +do this only for 15 results, we don't need to be very fast here. We simply +unify these results again to get the real, human-readable unifier, and like +this we fetch some other information as well. + +Overall, I'm quite happy with the orthogonal setup of different search +strategies. The implementation of name search and type search is completely +different, and each is optimized in different ways, but you can still combine +the search types in a single query. Also the annotations make the architecture +much simpler and avoid duplicating logic in the search and ranking functions. +Finally, using a constraint solver to determine the ranking weights has proven +to be a good basis for a solid ranking function. + +[Clean]: http://clean.cs.ru.nl/ +[Cloogle]: https://cloogle.org/ +[Haskell]: https://haskell.org/ +[Hoogle]: https://hoogle.haskell.org/ +[Hoogle 5]: https://neilmitchell.blogspot.com/2020/06/hoogle-searching-overview.html +[stats]: https://cloogle.org/stats/ +[Z3]: https://github.com/Z3Prover/z3 diff --git a/resources/pug/finals/articles/2021-07-26-cloogle-search-overview.pug b/resources/pug/finals/articles/2021-07-26-cloogle-search-overview.pug new file mode 100644 index 0000000..859f11c --- /dev/null +++ b/resources/pug/finals/articles/2021-07-26-cloogle-search-overview.pug @@ -0,0 +1,18 @@ +extends /layout-articles.pug + +block prepend title +	| Cloogle search overview -  + +block prepend menu +	- var page = '2021-07-26-cloogle-search-overview' + +block append breadcrumbs +	+breadcrumb('Cloogle search overview') + +block subtitle +	| Cloogle search overview +block subtitleDate +	| 26 July 2021 + +block page +	include:markdown ../../../md/2021-07-26-cloogle-search-overview.md diff --git a/resources/pug/finals/articles/index.pug b/resources/pug/finals/articles/index.pug index b3100dd..3a5264c 100644 --- a/resources/pug/finals/articles/index.pug +++ b/resources/pug/finals/articles/index.pug @@ -15,6 +15,13 @@ block page  		| You can also go back to my #[a(href='/') home page].  	h1 +		a(href='2021-07-26-cloogle-search-overview.html'). +			26 July 2021: Cloogle search overview +	blockquote. +		Cloogle is a search engine for the pure, functional programming language Clean, similar to Hoogle for Haskell. +		In this post I go through the design of the search backend and make a comparison with that of Hoogle 5. + +	h1  		a(href='2021-06-23-compiling-clean-in-the-browser-with-webassembly-part-1-introduction.html').  			23 June 2021: Compiling Clean in the browser with WebAssembly  	blockquote. diff --git a/resources/pug/include/layout-articles.pug b/resources/pug/include/layout-articles.pug index 10feaab..c35e38b 100644 --- a/resources/pug/include/layout-articles.pug +++ b/resources/pug/include/layout-articles.pug @@ -8,6 +8,9 @@ block append menu  	+menu(  		{name: 'Home', link: ''},  		{name: 2021, menu: [ +			{ name: 'Cloogle search overview', +				year: 2021, month: 7, day: 26 +			},  			{ name: 'Compiling Clean in the browser with WebAssembly, part 3: Putting it all together',  				year: 2021, month: 6, day: 23  			}, | 
