*[Cloogle][] is a search engine for the pure, functional programming language
[Clean][], similar to [Hoogle][] for [Haskell][]. In this post I go through the
design of the search backend and make a comparison with that of [Hoogle 5][].*

[[toc]]

# Overview

Internally, Cloogle keeps the entire index in memory. This is possible because
it is relatively small. While we have different entry types for functions, type
definitions, classes, and other things, they are all stored in one big array, a
`NativeDB CloogleEntry Annotation`:

```clean
:: *NativeDB v a :== *{!*Entry v a}

:: Entry v a =
	{ value       :: !v
	, included    :: !Bool
	, annotations :: ![!a!]
	}

:: CloogleEntry
	= FunctionEntry !FunctionEntry
	| TypeDefEntry !TypeDefEntry
	/* .. */
```

Because the `NativeDB` is
[unique](https://en.wikipedia.org/wiki/Uniqueness_type), we can do mutable
updates on it. While searching, we keep all entries in the array, and use
`included` to signal whether an entry should be returned from a search result
or not.

For a search request, we go multiple times through the database and toggle the
`included` flag as needed. For example, for the query
`map :: (a -> b) [a] -> [b]` we would first filter on the name `map` and then
on the type `(a -> b) [a] -> [b]`. The implementation of each filter is
independent of the others, so that each can be implemented in the way that is
most efficient for it.

# Name search

To do fuzzy searches for names, the index contains an `NGramIndex Index`:

```clean
:: NGramIndex v =
	{ n   :: !Int            //* The parameter *n* for the size of the grams
	, ci  :: !Bool           //* Whether matching is case-insensitive
	, idx :: !Map String [v] //* The values
	}
```

The values stored in this index are indices for the `NativeDB`. So if we would
search for `query`, we would take the 3-grams `que`, `uer`, and `ery` (*n* is
currently 3), and for each gram lookup the list of indices in `idx`. These
lists are then merged (they are sorted, so this can be done in O(n)), and we go
through the `NativeDB` to set `included` for the indices in the list.

I experimented with a [trie](https://en.wikipedia.org/wiki/Trie) instead of a
binary tree for `idx`, but it wasn't faster. After all, we only need few
lookups in this structure.

# Type search

This is the complicated one. [Hoogle has a very fancy algorithm][Hoogle 5],
which creates fingerprints for type signatures based on various things, like
the arity, the number of type constructors, and the rarest type names. But
although I see the rationale behind it, I have always found that Hoogle tends
to find weird pairs similar, and I prefer something a bit more white-box. So
Cloogle returns *only* functions for which the type can be *unified* with the
query type. This has downsides (if you forget a parameter, you won't find your
function), but the advantage is that ranking is much easier.

Originally, we would unify every type in the database with the query type, but
this is very slow. We now make use of an ordering on types. We say that
*t*<sub>1</sub> &le; *t*<sub>2</sub> iff *t*<sub>2</sub> generalises
*t*<sub>1</sub>. From this follows that if the query type does not unify with
*t*<sub>2</sub>, it will also not unify with *t*<sub>1</sub>. This observation
allows us to prune away a lot of results.

In the index, we keep a tree of all types, starting with the most general type,
`a`, in the root. The children are the more specific types, such as `Int`,
`c a`, and `a -> b`. This is represented in a simple data structure:

```clean
:: TypeTree v = Node Type [v] [TypeTree v]
```

We keep a `TypeTree Index`. Isomorphic types are stored only once in the index,
and they keep a list of indices into the `NativeDB` of functions that have that
type. When looking for a query type, we do a depth-first walk over the type
tree. However, we prune a node's children when the node's type does not unify
with the query type.

For example, consider the type tree below. If we search for `Char -> Int` we
only want to find nodes 1, 2, and 3. With the type tree, `a -> String` in node
6 does not unify, so we never visit nodes 7 and 8.

```clean
Node "a" ..                                  // 1
	[ Node "a -> b" ..                       // 2
		[ Node "a -> Int" ..                 // 3
			[ Node "Int -> Int" .. []        // 4
			, Node "String -> Int" .. []     // 5
			]
		, Node "a -> String" ..              // 6
			[ Node "Int -> String" .. []     // 7
			, Node "String -> String" .. []  // 8
			]
		]
	]
```

Observe that there are multiple ways to build the type tree. Instead of the
tree above, we could also group on type variable `a` first, such that node 3
would have the type `Int -> b` and node 6 the type `String -> b`. How the type
is built depends on the order in which types are found by the index builder. It
does not really matter for the search algorithm.

Besides this, there are some nice tricks to improve the results. For instance,
type synonyms are resolved in both the type tree and the query. In Hoogle, you
won't find `length :: Foldable t => t a -> Int` with the query `String -> Int`,
because it does not resolve `String -> Int` to `[Char] -> Int`. In Cloogle, the
same query will find `size :: (a e) -> Int | Array a e`, because it knows that
`String` is an `{#} Char` (an array of characters).

Also, universally quantified type variables can be used to restrict the search
to functions that are polymorphic. By default, a query like `[a] -> [a]` will
return results like `indexList :: [a] -> [Int]`, because the user might not
realize that the function they are looking for cannot be defined in a
polymorphic way (or the function in the index does not have the most general
type). However, if you are sure you only want results where `a` is not unified,
you can use `A.a: [a] -> [a]`. `A` here is similar to Haskell's `forall`. The
unification system takes care of the rest.

# Ranking

The last interesting optimization concerns ranking. We go several times through
the index to determine which entries should be returned. Once we have the
result set we need to order it, but we do not want to have to reunify types to
determine the distance measure. This is what the `annotations` field in the
`NativeDB` is for: during search, you can leave annotations, which are then
picked up by the ranker to compute the distance measure. That way you only need
to look at the annotations instead of performing part of the search logic
again. Cloogle for instance keeps track of the unifier (for type search) and
the number of matching *n*-grams (for name search):

```clean
:: Annotation
	= MatchingNGramsQuery !Real  //* The number of matching n-grams in the query
	| MatchingNGramsResult !Real //* The number of matching n-grams in the result
	| Unifier !Unifier  //* For type search, the unifier
	/* .. */
```

It is not immediately obvious how these annotations should be turned into a
single distance measure. And we don't want to think about this. What Cloogle
does instead is define a list of variables based on which we might want to
rank, and define a list of ranking constraints. We then use a constraint solver
to compute the exact ranking formula.

Some examples of ranking variables are:

- The ratio of *n*-grams from the query that match in the result
- The ratio of *n*-grams from the result that match in the query
- The number of type variables in the unifier
- Whether the result comes from the standard library

Some examples of ranking constraints are:

- For the query `toInt`, prefer the class `toInt` over the function
  `digitToInt`.
- For the query `Char`, prefer the built-in type `Char` over the class
  `toChar`.

Cloogle assumes that the ranking function is a sum of products of the ranking
variables with a weight. When it updates the index, as a final step it computes
the symbolic distance measure for all the ranking constraints, and passes these
to [Z3][] to find appropriate weights for all the ranking variables.

If we come across a query that returns results in an odd order, we just need to
add a ranking constraint; we do not need to think and write a new formula
ourselves. Furthermore, the ranking constraints act as test cases for the
ranking function, so we are sure that changes to the formula will not introduce
regressions.

# Conclusion

Once results are ranked, we need to find detailed information for the first 15
(or later ones, if a lower result page is requested). For example, we show the
exact unifier for type search. We cannot use annotations for this, because in
the type tree names of type variables are generalized. But because we need to
do this only for 15 results, we don't need to be very fast here. We simply
unify these results again to get the real, human-readable unifier, and like
this we fetch some other information as well.

Overall, I'm quite happy with the orthogonal setup of different search
strategies. The implementation of name search and type search is completely
different, and each is optimized in different ways, but you can still combine
the search types in a single query. Also the annotations make the architecture
much simpler and avoid duplicating logic in the search and ranking functions.
Finally, using a constraint solver to determine the ranking weights has proven
to be a good basis for a solid ranking function.

[Clean]: https://clean-lang.org/
[Cloogle]: https://cloogle.org/
[Haskell]: https://haskell.org/
[Hoogle]: https://hoogle.haskell.org/
[Hoogle 5]: https://neilmitchell.blogspot.com/2020/06/hoogle-searching-overview.html
[stats]: https://cloogle.org/stats/
[Z3]: https://github.com/Z3Prover/z3