1 files changed, 288 insertions, 0 deletions
diff --git a/Implementation.md b/Implementation.md
new file mode 100644
index 0000000..c19f119
--- /dev/null
+++ b/Implementation.md
@@ -0,0 +1,288 @@
+# Implementation
+
+## Feasibility
+Our [Plan](Plan.md) mentions the following:
+
+> We consider a vector space where every possible search field represents a
+> binary parameter. A vector has `1` for the parameter if and only if it is
+> included in the search (excluded from the blacklist). We will then run a
+> hill-climbing algorithm through this higher-dimensional vector space in order
+> to find a vector (an index setting) for which the ranking results are best.
+
+Soon after we began trying to implement this feature using a locally run
+version of Nordlys, we encountered some issues, the most notable being that
+our machines were unable to index the full DBPedia set in a reasonable amount
+of time, using a reasonable amount of resources. When we encountered this
+issue, we decided that the best option was to use a subset of the DBPedia
+dataset.
+
+The subset that we settled on is the subset that has relevance scores assigned
+to them for any query. We then only consider the result of a given query in our
+assessment.
+
+The above has the additional benefit that the relevance judgements (both the
+human assessment and the score) need not be computed. This meant that simply
+parsing the files that are provided by Nordlys is enough to implement any kind
+of field selected assessment.
+
+Unfortunately, it turned out that we also did not have resources to implement a
+hill-climbing algorithm. Having only 2 programmers made the task slightly too
+much work. Instead, we decided to take a different approach and statically
+analyse the importance of all fields. The measure that we use takes the form
+of:
+
+![Field Relevance Measure](http://mathurl.com/yc2ptq63.png "Field Relevance
+Measure")
+
+Where *relevance* is the BM25 relevance that is stored by Nordlys, *D* is the
+set of documents, *Q* the set of queries, *tf* the function that counts the
+amount of times any of the query terms was found in that field and |*f*| the
+size of the field.
+
+The formula assumes that relevance is more or less linear. The logarithm is
+used because more occurrences of the same term are not as important as the
+first occurrence.
+
+## Code
+
+We use three Python programs that:
+
+1. Get DBPedia entities from the Nordlys API (`scrape.py`)
+2. For each entry in a BM25 run, list DBPedia ID, relevance score and for each
+   field in the entity how many of the values match with at least one of the
+   query terms. This information is what BM25 uses to compute the relevance.
+   This file is `run.py`.
+3. Use that information to investigate how important each field is
+   (`check.py`).
+
+We will now discuss the implementation of each of these files.
+
+### `scrape.py`
+
+- In this file we read lines from `stdin`. These lines are supposed to come
+  from a BM25 run. That way, we only download DBPedia entities that we
+  actually need.
+
+```python
+if __name__ == '__main__':
+    for line in fileinput.input():
+        scrape(line)
+```
+
+- We split the lines from the run file. Only the DBPedia ID is relevant.
+
+```python
+def scrape(line):
+    index, query, dbpediaid, relevance = line.split('\t')
+    try:
+        get(dbpediaid)
+    except Exception as e:
+        with open(ERRORFILE, 'a') as f:
+            f.write(dbpediaid + '\t' + e + '\n')
+```
+
+- We store the entities one per file in the original JSON format. We use the ID
+  as filename, but have to prevent special characters so URL-encode it and
+  remove slashes.
+
+  Normally, Nordlys will refuse queries from a Python user-agent. So we adapt
+  the user-agent to `Radboud University` and Nordlys accepts it happily. We did
+  not hit rate limiting.
+
+```python
+def get(dbpediaid):
+    outfile = os.path.join(OUTDIR, quote_plus(dbpediaid) + '.json')
+    if os.path.isfile(outfile):
+        return
+    url = 'http://api.nordlys.cc/ec/lookup_id/{}'.format(quote_plus(dbpediaid))
+    print(url)
+    result = urlopen(Request(url,
+        headers={'User-Agent': 'Radboud University'})).read()
+    with open(outfile, 'w') as f:
+        f.write(result.decode(encoding='UTF-8'))
+```
+
+### `run.py`
+
+- `queries_stopped.json` lists all query terms. We load this file once, then
+  process a run from `stdin`.
+
+```python
+if __name__ == '__main__':
+    with open('queries_stopped.json') as f:
+        queries = json.load(f)
+
+        for line in fileinput.input():
+            run(queries, line)
+```
+
+- We split the line in the run file. For each field we check (1) how many
+  values there are and (2) how many values match a query term.
+
+```python
+def run(queries, line):
+    query, _, dbpediaid, _, relevance, method = line.split('\t')
+    terms = queries[query].split()
+    try:
+        result = get(dbpediaid)
+        if result is None:
+            return
+        for field, values in result.items():
+            matches = 0
+            for value in values:
+                if match(value, terms):
+                    matches += 1
+            print('{}\t{}\t{}\t{}\t{}\t{}\n'.format(
+                query, dbpediaid, relevance, field, len(values), matches))
+    except Exception as e:
+        print(dbpediaid)
+        print(e)
+        with open(ERRORFILE, 'a') as f:
+            f.write(dbpediaid + '\t' + e + '\n')
+```
+
+- For simplicity, we do not use lemmatisation or synonym resolution here, which
+  could be an improvement in a next version.
+
+```python
+def match(value, terms):
+    for v in value.split():
+        if v in terms:
+            return True
+    return False
+```
+
+- `get` simply gets the file that we stored with `scrape.py`:
+
+```python
+def get(dbpediaid):
+    outfile = os.path.join(DATADIR, quote_plus(dbpediaid) + '.json')
+    if not os.path.isfile(outfile):
+        return None
+    with open(outfile) as f:
+        return json.load(f)
+```
+
+### `check.py`
+
+- We keep a dictionary of scores per field and simply compute our weight score:
+
+```python
+if __name__ == '__main__':
+    scores = dict()
+    for line in fileinput.input():
+        query, dbpediaid, relevance, field, nvalues, nmatches = line.split('\t')
+        if field not in scores:
+            scores[field] = 0
+        scores[field] += float(relevance) * log(1 + int(nmatches)/int(nvalues))
+```
+
+- Then we print all scores:
+
+```python
+    for field, score in scores.items():
+        print('{}\t{}'.format(field, score))
+```
+
+### Usage
+
+All this allows for a fairly simple workflow:
+
+```bash
+mkdir data
+./scrape.py < qrels-v2.txt
+./run.py < bm25.run > fields.txt
+./check.py < fields.txt | sort -k2 -n > scores.txt
+```
+
+This assumes that you have the following files from Nordlys:
+
+- [`qrels-v2.txt`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/qrels-v2.txt) (entity list)
+- [`bm25.run`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/runs/bm25.run) (BM25 relevance judgements)
+- [`queries_stopped.json`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/queries_stopped.json) (query terms)
+
+The system is agnostic with regards to the ranking function (BM25 or another
+method).
+
+## Intermediate Results
+These are the thirty most important fields as found by our measure when used on
+the BM25 relevance scores:
+
+| Field                        | Score     | Used by Nordlys |
+|------------------------------|----------:|:---------------:|
+| `<dbp:imageFlag>`            |   2205.50 | ![][n]          |
+| `<dbp:office>`               |   2246.90 | ![][n]          |
+| `<dbp:pushpinMapCaption>`    |   2357.07 | ![][n]          |
+| `<dbp:description>`          |   2357.35 | ![][n]          |
+| `<dbp:placeOfBirth>`         |   2384.14 | ![][n]          |
+| `<dbp:fastTime>`             |   2440.73 | ![][n]          |
+| `<dbp:imageMap>`             |   2485.96 | ![][n]          |
+| `<dbp:writer>`               |   2689.86 | ![][n]          |
+| `<dbp:alt>`                  |   2691.94 | ![][n]          |
+| `<foaf:givenName>`           |   2694.41 | ![][y]          |
+| `<dbp:poleTime>`             |   2698.75 | ![][n]          |
+| `<dbp:country>`              |   2836.44 | ![][n]          |
+| `<dbp:type>`                 |   3248.58 | ![][n]          |
+| `<dbo:office>`               |   3425.58 | ![][n]          |
+| `<dbp:location>`             |   3430.20 | ![][n]          |
+| `<dbp:officialName>`         |   4316.34 | ![][y]          |
+| `<dbp:quote>`                |   4470.38 | ![][n]          |
+| `<dbp:imageCaption>`         |   4480.06 | ![][n]          |
+| `<dbp:producer>`             |   4704.52 | ![][n]          |
+| `<dbp:mapCaption>`           |   8040.36 | ![][n]          |
+| `<dbp:title>`                |  10999.72 | ![][n]          |
+| `<dbp:shortDescription>`     |  22065.46 | ![][n]          |
+| `<dc:description>`           |  23442.34 | ![][n]          |
+| `<dbp:caption>`              |  24697.75 | ![][n]          |
+| `<dbp:name>`                 |  25500.42 | ![][y]          |
+| `<foaf:name>`                |  32860.37 | ![][y]          |
+| `<dbo:wikiPageWikiLinkTent>` |  86218.71 | ![][y]          |
+| `<rdfs:label>`               | 105358.89 | ![][y]          |
+| `<rdfs:comment>`             | 514446.08 | ![][n]          |
+| `<dbo:abstract>`             | 581355.57 | ![][n]          |
+
+We see that many of the relevant fields are actually [not used by
+Nordlys](https://iai-group.github.io/DBpedia-Entity/index_details.html).
+However, this is not yet an indication that these fields should be added to the
+index. After all, adding more fields means more computation time to build the
+index and to retrieve search results.
+
+In fact, we expect that many of the fields not used actually display
+similarities with fields that *are* indexed. For example, the `<dbo:abstract>`
+field will probably match because the title is repeated in the abstract.
+
+We can perform the same analysis on the human assessments. This gives a rather
+different list of fields:
+
+| Field                         | Score    | Rank for BM25 | Used by Nordlys |
+|-------------------------------|---------:|--------------:|:---------------:|
+| `<dbp:pushpinMapCaption>`     |   133.77 |            28 | ![][n]          |
+| `<dbp:foundation>`            |   136.32 |           266 | ![][n]          |
+| `<dbp:imageCaption>`          |   139.85 |            13 | ![][n]          |
+| `<dbp:bridgeName>`            |   164.91 |            49 | ![][n]          |
+| `<dbp:imageFlag>`             |   166.35 |            30 | ![][n]          |
+| `<dbp:mapCaption>`            |   170.93 |            11 | ![][n]          |
+| `<dbo:foundingYear>`          |   173.92 |           299 | ![][n]          |
+| `<dbp:producer>`              |   186.37 |            12 | ![][n]          |
+| `<dbp:ground>`                |   297.25 |           802 | ![][n]          |
+| `<dbp:title>`                 |   328.93 |            10 | ![][n]          |
+| `<dc:description>`            |   332.05 |             8 | ![][n]          |
+| `<dbp:shortDescription>`      |   334.79 |             9 | ![][n]          |
+| `<dbp:caption>`               |   648.73 |             7 | ![][n]          |
+| `<foaf:givenName>`            |  1436.74 |            21 | ![][y]          |
+| `<dbp:name>`                  |  1961.98 |             6 | ![][y]          |
+| `<foaf:name>`                 |  2086.67 |             5 | ![][y]          |
+| `<dbo:wikiPageWikiLinkText>`  |  2897.51 |             4 | ![][y]          |
+| `<rdfs:label>`                |  3483.06 |             3 | ![][y]          |
+| `<rdfs:comment>`              | 12323.46 |             2 | ![][n]          |
+| `<dbo:abstract>`              | 13002.74 |             1 | ![][n]          |
+
+Based on this, one may want to try adding fields like `<dbp:caption>` to the
+index.
+
+Conversely, this information can also be used to improve the relevance measure.
+Apparently, `<dbp:ground>`, `<dbo:foundingYear>` and `<dbp:foundation>` are
+quite relevant according to human assessors, but not at all according to BM25.
+
+[y]: http://i.stack.imgur.com/iro5J.png
+[n]: http://i.stack.imgur.com/asAya.png