diff options
-rw-r--r-- | Implementation.md | 161 | ||||
-rw-r--r-- | README.md | 2 | ||||
-rwxr-xr-x | check.py | 3 |
3 files changed, 163 insertions, 3 deletions
diff --git a/Implementation.md b/Implementation.md index 337b9fe..b60b8b9 100644 --- a/Implementation.md +++ b/Implementation.md @@ -26,6 +26,165 @@ The meansure that we use take the form of: Where `relevance` is the bm25 relevance that is stored by nordlys, `D` is the set of documents, `Q` the set of queries, `tf` the function that counts the amount of times any of the query terms was found in that field and `|f|` the size of the field. -## Implementation +## Code + +We use three Python programs that: + +1. Get DBPedia entities from the Nordlys API (`scrape.py`) +2. For each entry in a BM25 run, list DBPedia ID, relevance score and for each + field in the entity how many of the values match with at least one of the + query terms. This information is what BM25 uses to compute the relevance. + This file is `run.py`. +3. Use that information to investigate how important each field is + (`check.py`). + +We will now discuss the implementation of each of these files. + +### `scrape.py` + +- In this file we read lines from `stdin`. These lines are supposed to come + from a BM25 run. That way, we only download DBPedia entities that we + actually need. + +```python +if __name__ == '__main__': + for line in fileinput.input(): + scrape(line) +``` + +- We split the lines from the run file. Only the DBPedia ID is relevant. + +```python +def scrape(line): + index, query, dbpediaid, relevance = line.split('\t') + try: + get(dbpediaid) + except Exception as e: + with open(ERRORFILE, 'a') as f: + f.write(dbpediaid + '\t' + e + '\n') +``` + +- We store the entities one per file in the original JSON format. We use the ID + as filename, but have to prevent special characters so URL-encode it and + remove slashes. + + Normally, Nordlys will refuse queries from a Python user-agent. So we adapt + the user-agent to `Radboud University` and Nordlys accepts it happily. We did + not hit rate limiting. + +```python +def get(dbpediaid): + outfile = os.path.join(OUTDIR, quote_plus(dbpediaid) + '.json') + if os.path.isfile(outfile): + return + url = 'http://api.nordlys.cc/ec/lookup_id/{}'.format(quote_plus(dbpediaid)) + print(url) + result = urlopen(Request(url, + headers={'User-Agent': 'Radboud University'})).read() + with open(outfile, 'w') as f: + f.write(result.decode(encoding='UTF-8')) +``` + +### `run.py` + +- `queries_stopped.json` lists all query terms. We load this file ones, then + process a run from `stdin`. + +```python +if __name__ == '__main__': + with open('queries_stopped.json') as f: + queries = json.load(f) + + for line in fileinput.input(): + run(queries, line) +``` + +- We split the line in the run file. For each field we check (1) how many + values there are and (2) how many values match a query term. + +```python +def run(queries, line): + query, _, dbpediaid, _, relevance, method = line.split('\t') + terms = queries[query].split() + try: + result = get(dbpediaid) + if result is None: + return + for field, values in result.items(): + matches = 0 + for value in values: + if match(value, terms): + matches += 1 + print('{}\t{}\t{}\t{}\t{}\t{}\n'.format( + query, dbpediaid, relevance, field, len(values), matches)) + except Exception as e: + print(dbpediaid) + print(e) + with open(ERRORFILE, 'a') as f: + f.write(dbpediaid + '\t' + e + '\n') +``` + +- For simplicity, we do not use lemmatisation or synonym resolution here, which + could be an improvement in a next version. + +```python +def match(value, terms): + for v in value.split(): + if v in terms: + return True + return False +``` + +- `get` simply gets the file that we stored with `scrape.py`: + +``` +def get(dbpediaid): + outfile = os.path.join(DATADIR, quote_plus(dbpediaid) + '.json') + if not os.path.isfile(outfile): + return None + with open(outfile) as f: + return json.load(f) +``` + +### `check.py` + +- We keep a dictionary of scores per field and simply compute our weight score: + +```python +if __name__ == '__main__': + scores = dict() + for line in fileinput.input(): + query, dbpediaid, relevance, field, nvalues, nmatches = line.split('\t') + if field not in scores: + scores[field] = 0 + scores[field] += float(relevance) * log(1 + int(nmatches)/int(nvalues)) +``` + +- Then we print all scores: + +```python + for field, score in scores.items(): + print('{}\t{}'.format(field, score)) +``` + +### Usage + +All this allows for a fairly simple workflow: + +```bash +mkdir data +./scrape.py < qrels-v2.txt +./run.py < bm25.run > fields.txt +./check.py < fields.txt | sort -k2 -n > scores.txt +``` + +This assumes that you have the following files from Nordlys: + +- [`qrels-v2.txt`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/qrels-v2.txt) (entity list) +- [`bm25.run`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/runs/bm25.run) (BM25 relevance judgements) +- [`queries_stopped.json`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/queries_stopped.json) (query terms) + +The system is agnostic with regards to the ranking function (BM25 or another +method). ## Intermediate Result @@ -1,4 +1,4 @@ # Practical Assignment Repository Team 4 0. [Installation](Install.md) 1. [The Plan](Plan.md) - +2. [The Implementation](Implementation.md) @@ -1,6 +1,7 @@ #!/usr/bin/env python3 import fileinput +from math import log if __name__ == '__main__': scores = dict() @@ -8,7 +9,7 @@ if __name__ == '__main__': query, dbpediaid, relevance, field, nvalues, nmatches = line.split('\t') if field not in scores: scores[field] = 0 - scores[field] += float(relevance) * int(nmatches) / int(nvalues) + scores[field] += float(relevance) * log(1 + int(nmatches)/int(nvalues)) for field, score in scores.items(): print('{}\t{}'.format(field, score)) |