# Implementation ## Feasibility Our [Plan](Plan.md) mentions the following: > We consider a vector space where every possible search field represents a > binary parameter. A vector has `1` for the parameter if and only if it is > included in the search (excluded from the blacklist). We will then run a > hill-climbing algorithm through this higher-dimensional vector space in order > to find a vector (an index setting) for which the ranking results are best. Soon after we began trying to implement this feature using a locally run version of Nordlys, we encountered some issues, the most notable being that our machines were unable to index the full DBPedia set in a reasonable amount of time, using a reasonable amount of resources. When we encountered this issue, we decided that the best option was to use a subset of the DBPedia dataset. The subset that we settled on is the subset that has relevance scores assigned to them for any query. We then only consider the result of a given query in our assessment. The above has the additional benefit that the relevance judgements (both the human assessment and the score) need not be computed. This meant that simply parsing the files that are provided by Nordlys is enough to implement any kind of field selected assessment. Unfortunately, it turned out that we also did not have resources to implement a hill-climbing algorithm. Having only 2 programmers made the task slightly too much work. Instead, we decided to take a different approach and statically analyse the importance of all fields. The measure that we use takes the form of: ![Field Relevance Measure](http://mathurl.com/yc2ptq63.png "Field Relevance Measure") Where *relevance* is the BM25 relevance that is stored by Nordlys, *D* is the set of documents, *Q* the set of queries, *tf* the function that counts the amount of times any of the query terms was found in that field and |*f*| the size of the field. The formula assumes that relevance is more or less linear. The logarithm is used because more occurrences of the same term are not as important as the first occurrence. ## Code We use three Python programs that: 1. Get DBPedia entities from the Nordlys API (`scrape.py`) 2. For each entry in a BM25 run, list DBPedia ID, relevance score and for each field in the entity how many of the values match with at least one of the query terms. This information is what BM25 uses to compute the relevance. This file is `run.py`. 3. Use that information to investigate how important each field is (`check.py`). We will now discuss the implementation of each of these files. ### `scrape.py` - In this file we read lines from `stdin`. These lines are supposed to come from a BM25 run. That way, we only download DBPedia entities that we actually need. ```python if __name__ == '__main__': for line in fileinput.input(): scrape(line) ``` - We split the lines from the run file. Only the DBPedia ID is relevant. ```python def scrape(line): index, query, dbpediaid, relevance = line.split('\t') try: get(dbpediaid) except Exception as e: with open(ERRORFILE, 'a') as f: f.write(dbpediaid + '\t' + e + '\n') ``` - We store the entities one per file in the original JSON format. We use the ID as filename, but have to prevent special characters so URL-encode it and remove slashes. Normally, Nordlys will refuse queries from a Python user-agent. So we adapt the user-agent to `Radboud University` and Nordlys accepts it happily. We did not hit rate limiting. ```python def get(dbpediaid): outfile = os.path.join(OUTDIR, quote_plus(dbpediaid) + '.json') if os.path.isfile(outfile): return url = 'http://api.nordlys.cc/ec/lookup_id/{}'.format(quote_plus(dbpediaid)) print(url) result = urlopen(Request(url, headers={'User-Agent': 'Radboud University'})).read() with open(outfile, 'w') as f: f.write(result.decode(encoding='UTF-8')) ``` ### `run.py` - `queries_stopped.json` lists all query terms. We load this file once, then process a run from `stdin`. ```python if __name__ == '__main__': with open('queries_stopped.json') as f: queries = json.load(f) for line in fileinput.input(): run(queries, line) ``` - We split the line in the run file. For each field we check (1) how many values there are and (2) how many values match a query term. ```python def run(queries, line): query, _, dbpediaid, _, relevance, method = line.split('\t') terms = queries[query].split() try: result = get(dbpediaid) if result is None: return for field, values in result.items(): matches = 0 for value in values: if match(value, terms): matches += 1 print('{}\t{}\t{}\t{}\t{}\t{}\n'.format( query, dbpediaid, relevance, field, len(values), matches)) except Exception as e: print(dbpediaid) print(e) with open(ERRORFILE, 'a') as f: f.write(dbpediaid + '\t' + e + '\n') ``` - For simplicity, we do not use lemmatisation or synonym resolution here, which could be an improvement in a next version. ```python def match(value, terms): for v in value.split(): if v in terms: return True return False ``` - `get` simply gets the file that we stored with `scrape.py`: ```python def get(dbpediaid): outfile = os.path.join(DATADIR, quote_plus(dbpediaid) + '.json') if not os.path.isfile(outfile): return None with open(outfile) as f: return json.load(f) ``` ### `check.py` - We keep a dictionary of scores per field and simply compute our weight score: ```python if __name__ == '__main__': scores = dict() for line in fileinput.input(): query, dbpediaid, relevance, field, nvalues, nmatches = line.split('\t') if field not in scores: scores[field] = 0 scores[field] += float(relevance) * log(1 + int(nmatches)/int(nvalues)) ``` - Then we print all scores: ```python for field, score in scores.items(): print('{}\t{}'.format(field, score)) ``` ### Usage All this allows for a fairly simple workflow: ```bash mkdir data ./scrape.py < qrels-v2.txt ./run.py < bm25.run > fields.txt ./check.py < fields.txt | sort -k2 -n > scores.txt ``` This assumes that you have the following files from Nordlys: - [`qrels-v2.txt`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/qrels-v2.txt) (entity list) - [`bm25.run`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/runs/bm25.run) (BM25 relevance judgements) - [`queries_stopped.json`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/queries_stopped.json) (query terms) The system is agnostic with regards to the ranking function (BM25 or another method). ## Intermediate Results These are the thirty most important fields as found by our measure when used on the BM25 relevance scores: | Field | Score | Used by Nordlys | |------------------------------|----------:|:---------------:| | `` | 2205.50 | ![][n] | | `` | 2246.90 | ![][n] | | `` | 2357.07 | ![][n] | | `` | 2357.35 | ![][n] | | `` | 2384.14 | ![][n] | | `` | 2440.73 | ![][n] | | `` | 2485.96 | ![][n] | | `` | 2689.86 | ![][n] | | `` | 2691.94 | ![][n] | | `` | 2694.41 | ![][y] | | `` | 2698.75 | ![][n] | | `` | 2836.44 | ![][n] | | `` | 3248.58 | ![][n] | | `` | 3425.58 | ![][n] | | `` | 3430.20 | ![][n] | | `` | 4316.34 | ![][y] | | `` | 4470.38 | ![][n] | | `` | 4480.06 | ![][n] | | `` | 4704.52 | ![][n] | | `` | 8040.36 | ![][n] | | `` | 10999.72 | ![][n] | | `` | 22065.46 | ![][n] | | `` | 23442.34 | ![][n] | | `` | 24697.75 | ![][n] | | `` | 25500.42 | ![][y] | | `` | 32860.37 | ![][y] | | `` | 86218.71 | ![][y] | | `` | 105358.89 | ![][y] | | `` | 514446.08 | ![][n] | | `` | 581355.57 | ![][n] | We see that many of the relevant fields are actually [not used by Nordlys](https://iai-group.github.io/DBpedia-Entity/index_details.html). However, this is not yet an indication that these fields should be added to the index. After all, adding more fields means more computation time to build the index and to retrieve search results. In fact, we expect that many of the fields not used actually display similarities with fields that *are* indexed. For example, the `` field will probably match because the title is repeated in the abstract. We can perform the same analysis on the human assessments. This gives a rather different list of fields: | Field | Score | Rank for BM25 | Used by Nordlys | |-------------------------------|---------:|--------------:|:---------------:| | `` | 133.77 | 28 | ![][n] | | `` | 136.32 | 266 | ![][n] | | `` | 139.85 | 13 | ![][n] | | `` | 164.91 | 49 | ![][n] | | `` | 166.35 | 30 | ![][n] | | `` | 170.93 | 11 | ![][n] | | `` | 173.92 | 299 | ![][n] | | `` | 186.37 | 12 | ![][n] | | `` | 297.25 | 802 | ![][n] | | `` | 328.93 | 10 | ![][n] | | `` | 332.05 | 8 | ![][n] | | `` | 334.79 | 9 | ![][n] | | `` | 648.73 | 7 | ![][n] | | `` | 1436.74 | 21 | ![][y] | | `` | 1961.98 | 6 | ![][y] | | `` | 2086.67 | 5 | ![][y] | | `` | 2897.51 | 4 | ![][y] | | `` | 3483.06 | 3 | ![][y] | | `` | 12323.46 | 2 | ![][n] | | `` | 13002.74 | 1 | ![][n] | Based on this, one may want to try adding fields like `` to the index. Conversely, this information can also be used to improve the relevance measure. Apparently, ``, `` and `` are quite relevant according to human assessors, but not at all according to BM25. [y]: http://i.stack.imgur.com/iro5J.png [n]: http://i.stack.imgur.com/asAya.png