diff options
-rw-r--r-- | Implementation.md | 63 |
1 files changed, 38 insertions, 25 deletions
diff --git a/Implementation.md b/Implementation.md index ae527d7..e340ace 100644 --- a/Implementation.md +++ b/Implementation.md @@ -1,30 +1,43 @@ # Implementation ## Feasibility -The Plan mentions the following: -> We consider a vector space where every possible search field represents a binary parameter. -> A vector has `1` for the parameter if and only if it is included in the search (excluded from the blacklist). -> We will then run a hill-climbing algorithm through this higher-dimensional vector space -> in order to find a vector (an index setting) for which the ranking results are best. - -Soon after we began trying to implement this feature using a locally run version of nordlys, we encountered some issues. -The most notable being that our machines were unable to index the full DB-Pedia set in a reasonable amount of time, using a reasonable amount of resources. -When we encountered this issue, we decided that the best options was using a subset of the DB-Pedia dataset. - -The subset that we settled on is the subset that has relevance scores assigned to them for any query. -We then only considered the result of a given query in our assessment. - -The above has the added benefit that the relevance (both the human assessment and the score) are precomputed. -This meant that simply parsing the files that are provided by nordlys is enough to implement any kind of field selected assessment. - -Unfortunately, it turned out that hill-climbing was also out of the scope of the assignment. -Having only 2 programmers, both of whom have not a lot of experience in implementing such algorithms, made the task slightly to much work. -Instead, we decided to take a different approach and statically analyse the importance of all fields. -The meansure that we use take the form of: - - - -Where `relevance` is the bm25 relevance that is stored by nordlys, `D` is the set of documents, `Q` the set of queries, `tf` the function that counts the amount of times any of the query terms was found in that field and `|f|` the size of the field. +Our [Plan](Plan.md) mentions the following: + +> We consider a vector space where every possible search field represents a +> binary parameter. A vector has `1` for the parameter if and only if it is +> included in the search (excluded from the blacklist). We will then run a +> hill-climbing algorithm through this higher-dimensional vector space in order +> to find a vector (an index setting) for which the ranking results are best. + +Soon after we began trying to implement this feature using a locally run +version of Nordlys, we encountered some issues, the most notable being that +our machines were unable to index the full DBPedia set in a reasonable amount +of time, using a reasonable amount of resources. When we encountered this +issue, we decided that the best option was to use a subset of the DBPedia +dataset. + +The subset that we settled on is the subset that has relevance scores assigned +to them for any query. We then only consider the result of a given query in our +assessment. + +The above has the additional benefit that the relevance judgements (both the +human assessment and the score) need not be computed. This meant that simply +parsing the files that are provided by Nordlys is enough to implement any kind +of field selected assessment. + +Unfortunately, it turned out that we also did not have resources to implement a +hill-climbing algorithm. Having only 2 programmers made the task slightly too +much work. Instead, we decided to take a different approach and statically +analyse the importance of all fields. The measure that we use takes the form +of: + + + +Where *relevance* is the BM25 relevance that is stored by Nordlys, *D* is the +set of documents, *Q* the set of queries, *tf* the function that counts the +amount of times any of the query terms was found in that field and |*f*| the +size of the field. ## Code @@ -43,7 +56,7 @@ We will now discuss the implementation of each of these files. ### `scrape.py` - In this file we read lines from `stdin`. These lines are supposed to come - from a BM25 run. That way, we only download DBPedia entities that we + from a BM25 run. That way, we only download DBPedia entities that we actually need. ```python |