blob: 337b9fe6aa7d9eecc4e34c6fa5660fd28c5ed370 (
plain) (
blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
# Implementation
## Feasibility
The Plan mentions the following:
> We consider a vector space where every possible search field represents a binary parameter.
> A vector has `1` for the parameter if and only if it is included in the search (excluded from the blacklist).
> We will then run a hill-climbing algorithm through this higher-dimensional vector space
> in order to find a vector (an index setting) for which the ranking results are best.
Soon after we began trying to implement this feature using a locally run version of nordlys, we encountered some issues.
The most notable being that our machines were unable to index the full DB-Pedia set in a reasonable amount of time, using a reasonable amount of resources.
When we encountered this issue, we decided that the best options was using a subset of the DB-Pedia dataset.
The subset that we settled on is the subset that has relevance scores assigned to them for any query.
We then only considered the result of a given query in our assessment.
The above has the added benefit that the relevance (both the human assessment and the score) are precomputed.
This meant that simply parsing the files that are provided by nordlys is enough to implement any kind of field selected assessment.
Unfortunately, it turned out that hill-climbing was also out of the scope of the assignment.
Having only 2 programmers, both of whom have not a lot of experience in implementing such algorithms, made the task slightly to much work.
Instead, we decided to take a different approach and statically analyse the importance of all fields.
The meansure that we use take the form of:

Where `relevance` is the bm25 relevance that is stored by nordlys, `D` is the set of documents, `Q` the set of queries, `tf` the function that counts the amount of times any of the query terms was found in that field and `|f|` the size of the field.
## Implementation
## Intermediate Result
|