3 files changed, 163 insertions, 3 deletions
diff --git a/Implementation.md b/Implementation.md
index 337b9fe..b60b8b9 100644
--- a/Implementation.md
+++ b/Implementation.md
@@ -26,6 +26,165 @@ The meansure that we use take the form of:
 
 Where `relevance` is the bm25 relevance that is stored by nordlys, `D` is the set of documents, `Q` the set of queries, `tf` the function that counts the amount of times any of the query terms was found in that field and `|f|` the size of the field.
 
-## Implementation
+## Code
+
+We use three Python programs that:
+
+1. Get DBPedia entities from the Nordlys API (`scrape.py`)
+2. For each entry in a BM25 run, list DBPedia ID, relevance score and for each
+   field in the entity how many of the values match with at least one of the
+   query terms. This information is what BM25 uses to compute the relevance.
+   This file is `run.py`.
+3. Use that information to investigate how important each field is
+   (`check.py`).
+
+We will now discuss the implementation of each of these files.
+
+### `scrape.py`
+
+- In this file we read lines from `stdin`. These lines are supposed to come
+  from a BM25 run.  That way, we only download DBPedia entities that we
+  actually need.
+
+```python
+if __name__ == '__main__':
+    for line in fileinput.input():
+        scrape(line)
+```
+
+- We split the lines from the run file. Only the DBPedia ID is relevant.
+
+```python
+def scrape(line):
+    index, query, dbpediaid, relevance = line.split('\t')
+    try:
+        get(dbpediaid)
+    except Exception as e:
+        with open(ERRORFILE, 'a') as f:
+            f.write(dbpediaid + '\t' + e + '\n')
+```
+
+- We store the entities one per file in the original JSON format. We use the ID
+  as filename, but have to prevent special characters so URL-encode it and
+  remove slashes.
+
+  Normally, Nordlys will refuse queries from a Python user-agent. So we adapt
+  the user-agent to `Radboud University` and Nordlys accepts it happily. We did
+  not hit rate limiting.
+
+```python
+def get(dbpediaid):
+    outfile = os.path.join(OUTDIR, quote_plus(dbpediaid) + '.json')
+    if os.path.isfile(outfile):
+        return
+    url = 'http://api.nordlys.cc/ec/lookup_id/{}'.format(quote_plus(dbpediaid))
+    print(url)
+    result = urlopen(Request(url,
+        headers={'User-Agent': 'Radboud University'})).read()
+    with open(outfile, 'w') as f:
+        f.write(result.decode(encoding='UTF-8'))
+```
+
+### `run.py`
+
+- `queries_stopped.json` lists all query terms. We load this file ones, then
+  process a run from `stdin`.
+
+```python
+if __name__ == '__main__':
+    with open('queries_stopped.json') as f:
+        queries = json.load(f)
+
+        for line in fileinput.input():
+            run(queries, line)
+```
+
+- We split the line in the run file. For each field we check (1) how many
+  values there are and (2) how many values match a query term.
+
+```python
+def run(queries, line):
+    query, _, dbpediaid, _, relevance, method = line.split('\t')
+    terms = queries[query].split()
+    try:
+        result = get(dbpediaid)
+        if result is None:
+            return
+        for field, values in result.items():
+            matches = 0
+            for value in values:
+                if match(value, terms):
+                    matches += 1
+            print('{}\t{}\t{}\t{}\t{}\t{}\n'.format(
+                query, dbpediaid, relevance, field, len(values), matches))
+    except Exception as e:
+        print(dbpediaid)
+        print(e)
+        with open(ERRORFILE, 'a') as f:
+            f.write(dbpediaid + '\t' + e + '\n')
+```
+
+- For simplicity, we do not use lemmatisation or synonym resolution here, which
+  could be an improvement in a next version.
+
+```python
+def match(value, terms):
+    for v in value.split():
+        if v in terms:
+            return True
+    return False
+```
+
+- `get` simply gets the file that we stored with `scrape.py`:
+
+```
+def get(dbpediaid):
+    outfile = os.path.join(DATADIR, quote_plus(dbpediaid) + '.json')
+    if not os.path.isfile(outfile):
+        return None
+    with open(outfile) as f:
+        return json.load(f)
+```
+
+### `check.py`
+
+- We keep a dictionary of scores per field and simply compute our weight score:
+
+```python
+if __name__ == '__main__':
+    scores = dict()
+    for line in fileinput.input():
+        query, dbpediaid, relevance, field, nvalues, nmatches = line.split('\t')
+        if field not in scores:
+            scores[field] = 0
+        scores[field] += float(relevance) * log(1 + int(nmatches)/int(nvalues))
+```
+
+- Then we print all scores:
+
+```python
+    for field, score in scores.items():
+        print('{}\t{}'.format(field, score))
+```
+
+### Usage
+
+All this allows for a fairly simple workflow:
+
+```bash
+mkdir data
+./scrape.py < qrels-v2.txt
+./run.py < bm25.run > fields.txt
+./check.py < fields.txt | sort -k2 -n > scores.txt
+```
+
+This assumes that you have the following files from Nordlys:
+
+- [`qrels-v2.txt`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/qrels-v2.txt) (entity list)
+- [`bm25.run`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/runs/bm25.run) (BM25 relevance judgements)
+- [`queries_stopped.json`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/queries_stopped.json) (query terms)
+
+The system is agnostic with regards to the ranking function (BM25 or another
+method).
 
 ## Intermediate Result
diff --git a/README.md b/README.md
index 6599c92..5350acc 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
 # Practical Assignment Repository Team 4
 0. [Installation](Install.md)
 1. [The Plan](Plan.md)
-
+2. [The Implementation](Implementation.md)
diff --git a/check.py b/check.py
index 69c247c..44a7f70 100755
--- a/check.py
+++ b/check.py
@@ -1,6 +1,7 @@
 #!/usr/bin/env python3
 
 import fileinput
+from math import log
 
 if __name__ == '__main__':
     scores = dict()
@@ -8,7 +9,7 @@ if __name__ == '__main__':
         query, dbpediaid, relevance, field, nvalues, nmatches = line.split('\t')
         if field not in scores:
             scores[field] = 0
-        scores[field] += float(relevance) * int(nmatches) / int(nvalues)
+        scores[field] += float(relevance) * log(1 + int(nmatches)/int(nvalues))
 
     for field, score in scores.items():
         print('{}\t{}'.format(field, score))