# Implementation ## Feasibility The Plan mentions the following: > We consider a vector space where every possible search field represents a binary parameter. > A vector has `1` for the parameter if and only if it is included in the search (excluded from the blacklist). > We will then run a hill-climbing algorithm through this higher-dimensional vector space > in order to find a vector (an index setting) for which the ranking results are best. Soon after we began trying to implement this feature using a locally run version of nordlys, we encountered some issues. The most notable being that our machines were unable to index the full DB-Pedia set in a reasonable amount of time, using a reasonable amount of resources. When we encountered this issue, we decided that the best options was using a subset of the DB-Pedia dataset. The subset that we settled on is the subset that has relevance scores assigned to them for any query. We then only considered the result of a given query in our assessment. The above has the added benefit that the relevance (both the human assessment and the score) are precomputed. This meant that simply parsing the files that are provided by nordlys is enough to implement any kind of field selected assessment. Unfortunately, it turned out that hill-climbing was also out of the scope of the assignment. Having only 2 programmers, both of whom have not a lot of experience in implementing such algorithms, made the task slightly to much work. Instead, we decided to take a different approach and statically analyse the importance of all fields. The meansure that we use take the form of: ![Field Relevance Measure](http://mathurl.com/yc2ptq63.png "Field Relevance Measure") Where `relevance` is the bm25 relevance that is stored by nordlys, `D` is the set of documents, `Q` the set of queries, `tf` the function that counts the amount of times any of the query terms was found in that field and `|f|` the size of the field. ## Code We use three Python programs that: 1. Get DBPedia entities from the Nordlys API (`scrape.py`) 2. For each entry in a BM25 run, list DBPedia ID, relevance score and for each field in the entity how many of the values match with at least one of the query terms. This information is what BM25 uses to compute the relevance. This file is `run.py`. 3. Use that information to investigate how important each field is (`check.py`). We will now discuss the implementation of each of these files. ### `scrape.py` - In this file we read lines from `stdin`. These lines are supposed to come from a BM25 run. That way, we only download DBPedia entities that we actually need. ```python if __name__ == '__main__': for line in fileinput.input(): scrape(line) ``` - We split the lines from the run file. Only the DBPedia ID is relevant. ```python def scrape(line): index, query, dbpediaid, relevance = line.split('\t') try: get(dbpediaid) except Exception as e: with open(ERRORFILE, 'a') as f: f.write(dbpediaid + '\t' + e + '\n') ``` - We store the entities one per file in the original JSON format. We use the ID as filename, but have to prevent special characters so URL-encode it and remove slashes. Normally, Nordlys will refuse queries from a Python user-agent. So we adapt the user-agent to `Radboud University` and Nordlys accepts it happily. We did not hit rate limiting. ```python def get(dbpediaid): outfile = os.path.join(OUTDIR, quote_plus(dbpediaid) + '.json') if os.path.isfile(outfile): return url = 'http://api.nordlys.cc/ec/lookup_id/{}'.format(quote_plus(dbpediaid)) print(url) result = urlopen(Request(url, headers={'User-Agent': 'Radboud University'})).read() with open(outfile, 'w') as f: f.write(result.decode(encoding='UTF-8')) ``` ### `run.py` - `queries_stopped.json` lists all query terms. We load this file ones, then process a run from `stdin`. ```python if __name__ == '__main__': with open('queries_stopped.json') as f: queries = json.load(f) for line in fileinput.input(): run(queries, line) ``` - We split the line in the run file. For each field we check (1) how many values there are and (2) how many values match a query term. ```python def run(queries, line): query, _, dbpediaid, _, relevance, method = line.split('\t') terms = queries[query].split() try: result = get(dbpediaid) if result is None: return for field, values in result.items(): matches = 0 for value in values: if match(value, terms): matches += 1 print('{}\t{}\t{}\t{}\t{}\t{}\n'.format( query, dbpediaid, relevance, field, len(values), matches)) except Exception as e: print(dbpediaid) print(e) with open(ERRORFILE, 'a') as f: f.write(dbpediaid + '\t' + e + '\n') ``` - For simplicity, we do not use lemmatisation or synonym resolution here, which could be an improvement in a next version. ```python def match(value, terms): for v in value.split(): if v in terms: return True return False ``` - `get` simply gets the file that we stored with `scrape.py`: ``` def get(dbpediaid): outfile = os.path.join(DATADIR, quote_plus(dbpediaid) + '.json') if not os.path.isfile(outfile): return None with open(outfile) as f: return json.load(f) ``` ### `check.py` - We keep a dictionary of scores per field and simply compute our weight score: ```python if __name__ == '__main__': scores = dict() for line in fileinput.input(): query, dbpediaid, relevance, field, nvalues, nmatches = line.split('\t') if field not in scores: scores[field] = 0 scores[field] += float(relevance) * log(1 + int(nmatches)/int(nvalues)) ``` - Then we print all scores: ```python for field, score in scores.items(): print('{}\t{}'.format(field, score)) ``` ### Usage All this allows for a fairly simple workflow: ```bash mkdir data ./scrape.py < qrels-v2.txt ./run.py < bm25.run > fields.txt ./check.py < fields.txt | sort -k2 -n > scores.txt ``` This assumes that you have the following files from Nordlys: - [`qrels-v2.txt`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/qrels-v2.txt) (entity list) - [`bm25.run`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/runs/bm25.run) (BM25 relevance judgements) - [`queries_stopped.json`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/queries_stopped.json) (query terms) The system is agnostic with regards to the ranking function (BM25 or another method). ## Intermediate Results These are the thirty most important fields as found by our measure: | Field | Score | Used by Nordlys | |------------------------------|----------:|:---------------:| | `` | 2205.50 | ![][n] | | `` | 2246.90 | ![][n] | | `` | 2357.07 | ![][n] | | `` | 2357.35 | ![][n] | | `` | 2384.14 | ![][n] | | `` | 2440.73 | ![][n] | | `` | 2485.96 | ![][n] | | `` | 2689.86 | ![][n] | | `` | 2691.94 | ![][n] | | `` | 2694.41 | ![][y] | | `` | 2698.75 | ![][n] | | `` | 2836.44 | ![][n] | | `` | 3248.58 | ![][n] | | `` | 3425.58 | ![][n] | | `` | 3430.20 | ![][n] | | `` | 4316.34 | ![][y] | | `` | 4470.38 | ![][n] | | `` | 4480.06 | ![][n] | | `` | 4704.52 | ![][n] | | `` | 8040.36 | ![][n] | | `` | 10999.72 | ![][n] | | `` | 22065.46 | ![][n] | | `` | 23442.34 | ![][n] | | `` | 24697.75 | ![][n] | | `` | 25500.42 | ![][y] | | `` | 32860.37 | ![][y] | | `` | 86218.71 | ![][y] | | `` | 105358.89 | ![][y] | | `` | 514446.08 | ![][n] | | `` | 581355.57 | ![][n] | We see that many of the relevant fields are actually [not used by Nordlys](https://iai-group.github.io/DBpedia-Entity/index_details.html). However, this is not yet an indication that these fields should be added to the index. After all, adding more fields means more computation time to build the index and to retrieve search results. In fact, we expect that many of the fields not used actually display similarities with fields that *are* indexed. For example, the `` field will probably match because the title is repeated in the abstract. [y]: http://i.stack.imgur.com/iro5J.png [n]: http://i.stack.imgur.com/asAya.png