diff options
Diffstat (limited to 'Implementation.md')
-rw-r--r-- | Implementation.md | 288 |
1 files changed, 288 insertions, 0 deletions
diff --git a/Implementation.md b/Implementation.md new file mode 100644 index 0000000..c19f119 --- /dev/null +++ b/Implementation.md @@ -0,0 +1,288 @@ +# Implementation + +## Feasibility +Our [Plan](Plan.md) mentions the following: + +> We consider a vector space where every possible search field represents a +> binary parameter. A vector has `1` for the parameter if and only if it is +> included in the search (excluded from the blacklist). We will then run a +> hill-climbing algorithm through this higher-dimensional vector space in order +> to find a vector (an index setting) for which the ranking results are best. + +Soon after we began trying to implement this feature using a locally run +version of Nordlys, we encountered some issues, the most notable being that +our machines were unable to index the full DBPedia set in a reasonable amount +of time, using a reasonable amount of resources. When we encountered this +issue, we decided that the best option was to use a subset of the DBPedia +dataset. + +The subset that we settled on is the subset that has relevance scores assigned +to them for any query. We then only consider the result of a given query in our +assessment. + +The above has the additional benefit that the relevance judgements (both the +human assessment and the score) need not be computed. This meant that simply +parsing the files that are provided by Nordlys is enough to implement any kind +of field selected assessment. + +Unfortunately, it turned out that we also did not have resources to implement a +hill-climbing algorithm. Having only 2 programmers made the task slightly too +much work. Instead, we decided to take a different approach and statically +analyse the importance of all fields. The measure that we use takes the form +of: + + + +Where *relevance* is the BM25 relevance that is stored by Nordlys, *D* is the +set of documents, *Q* the set of queries, *tf* the function that counts the +amount of times any of the query terms was found in that field and |*f*| the +size of the field. + +The formula assumes that relevance is more or less linear. The logarithm is +used because more occurrences of the same term are not as important as the +first occurrence. + +## Code + +We use three Python programs that: + +1. Get DBPedia entities from the Nordlys API (`scrape.py`) +2. For each entry in a BM25 run, list DBPedia ID, relevance score and for each + field in the entity how many of the values match with at least one of the + query terms. This information is what BM25 uses to compute the relevance. + This file is `run.py`. +3. Use that information to investigate how important each field is + (`check.py`). + +We will now discuss the implementation of each of these files. + +### `scrape.py` + +- In this file we read lines from `stdin`. These lines are supposed to come + from a BM25 run. That way, we only download DBPedia entities that we + actually need. + +```python +if __name__ == '__main__': + for line in fileinput.input(): + scrape(line) +``` + +- We split the lines from the run file. Only the DBPedia ID is relevant. + +```python +def scrape(line): + index, query, dbpediaid, relevance = line.split('\t') + try: + get(dbpediaid) + except Exception as e: + with open(ERRORFILE, 'a') as f: + f.write(dbpediaid + '\t' + e + '\n') +``` + +- We store the entities one per file in the original JSON format. We use the ID + as filename, but have to prevent special characters so URL-encode it and + remove slashes. + + Normally, Nordlys will refuse queries from a Python user-agent. So we adapt + the user-agent to `Radboud University` and Nordlys accepts it happily. We did + not hit rate limiting. + +```python +def get(dbpediaid): + outfile = os.path.join(OUTDIR, quote_plus(dbpediaid) + '.json') + if os.path.isfile(outfile): + return + url = 'http://api.nordlys.cc/ec/lookup_id/{}'.format(quote_plus(dbpediaid)) + print(url) + result = urlopen(Request(url, + headers={'User-Agent': 'Radboud University'})).read() + with open(outfile, 'w') as f: + f.write(result.decode(encoding='UTF-8')) +``` + +### `run.py` + +- `queries_stopped.json` lists all query terms. We load this file once, then + process a run from `stdin`. + +```python +if __name__ == '__main__': + with open('queries_stopped.json') as f: + queries = json.load(f) + + for line in fileinput.input(): + run(queries, line) +``` + +- We split the line in the run file. For each field we check (1) how many + values there are and (2) how many values match a query term. + +```python +def run(queries, line): + query, _, dbpediaid, _, relevance, method = line.split('\t') + terms = queries[query].split() + try: + result = get(dbpediaid) + if result is None: + return + for field, values in result.items(): + matches = 0 + for value in values: + if match(value, terms): + matches += 1 + print('{}\t{}\t{}\t{}\t{}\t{}\n'.format( + query, dbpediaid, relevance, field, len(values), matches)) + except Exception as e: + print(dbpediaid) + print(e) + with open(ERRORFILE, 'a') as f: + f.write(dbpediaid + '\t' + e + '\n') +``` + +- For simplicity, we do not use lemmatisation or synonym resolution here, which + could be an improvement in a next version. + +```python +def match(value, terms): + for v in value.split(): + if v in terms: + return True + return False +``` + +- `get` simply gets the file that we stored with `scrape.py`: + +```python +def get(dbpediaid): + outfile = os.path.join(DATADIR, quote_plus(dbpediaid) + '.json') + if not os.path.isfile(outfile): + return None + with open(outfile) as f: + return json.load(f) +``` + +### `check.py` + +- We keep a dictionary of scores per field and simply compute our weight score: + +```python +if __name__ == '__main__': + scores = dict() + for line in fileinput.input(): + query, dbpediaid, relevance, field, nvalues, nmatches = line.split('\t') + if field not in scores: + scores[field] = 0 + scores[field] += float(relevance) * log(1 + int(nmatches)/int(nvalues)) +``` + +- Then we print all scores: + +```python + for field, score in scores.items(): + print('{}\t{}'.format(field, score)) +``` + +### Usage + +All this allows for a fairly simple workflow: + +```bash +mkdir data +./scrape.py < qrels-v2.txt +./run.py < bm25.run > fields.txt +./check.py < fields.txt | sort -k2 -n > scores.txt +``` + +This assumes that you have the following files from Nordlys: + +- [`qrels-v2.txt`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/qrels-v2.txt) (entity list) +- [`bm25.run`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/runs/bm25.run) (BM25 relevance judgements) +- [`queries_stopped.json`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/queries_stopped.json) (query terms) + +The system is agnostic with regards to the ranking function (BM25 or another +method). + +## Intermediate Results +These are the thirty most important fields as found by our measure when used on +the BM25 relevance scores: + +| Field | Score | Used by Nordlys | +|------------------------------|----------:|:---------------:| +| `<dbp:imageFlag>` | 2205.50 | ![][n] | +| `<dbp:office>` | 2246.90 | ![][n] | +| `<dbp:pushpinMapCaption>` | 2357.07 | ![][n] | +| `<dbp:description>` | 2357.35 | ![][n] | +| `<dbp:placeOfBirth>` | 2384.14 | ![][n] | +| `<dbp:fastTime>` | 2440.73 | ![][n] | +| `<dbp:imageMap>` | 2485.96 | ![][n] | +| `<dbp:writer>` | 2689.86 | ![][n] | +| `<dbp:alt>` | 2691.94 | ![][n] | +| `<foaf:givenName>` | 2694.41 | ![][y] | +| `<dbp:poleTime>` | 2698.75 | ![][n] | +| `<dbp:country>` | 2836.44 | ![][n] | +| `<dbp:type>` | 3248.58 | ![][n] | +| `<dbo:office>` | 3425.58 | ![][n] | +| `<dbp:location>` | 3430.20 | ![][n] | +| `<dbp:officialName>` | 4316.34 | ![][y] | +| `<dbp:quote>` | 4470.38 | ![][n] | +| `<dbp:imageCaption>` | 4480.06 | ![][n] | +| `<dbp:producer>` | 4704.52 | ![][n] | +| `<dbp:mapCaption>` | 8040.36 | ![][n] | +| `<dbp:title>` | 10999.72 | ![][n] | +| `<dbp:shortDescription>` | 22065.46 | ![][n] | +| `<dc:description>` | 23442.34 | ![][n] | +| `<dbp:caption>` | 24697.75 | ![][n] | +| `<dbp:name>` | 25500.42 | ![][y] | +| `<foaf:name>` | 32860.37 | ![][y] | +| `<dbo:wikiPageWikiLinkTent>` | 86218.71 | ![][y] | +| `<rdfs:label>` | 105358.89 | ![][y] | +| `<rdfs:comment>` | 514446.08 | ![][n] | +| `<dbo:abstract>` | 581355.57 | ![][n] | + +We see that many of the relevant fields are actually [not used by +Nordlys](https://iai-group.github.io/DBpedia-Entity/index_details.html). +However, this is not yet an indication that these fields should be added to the +index. After all, adding more fields means more computation time to build the +index and to retrieve search results. + +In fact, we expect that many of the fields not used actually display +similarities with fields that *are* indexed. For example, the `<dbo:abstract>` +field will probably match because the title is repeated in the abstract. + +We can perform the same analysis on the human assessments. This gives a rather +different list of fields: + +| Field | Score | Rank for BM25 | Used by Nordlys | +|-------------------------------|---------:|--------------:|:---------------:| +| `<dbp:pushpinMapCaption>` | 133.77 | 28 | ![][n] | +| `<dbp:foundation>` | 136.32 | 266 | ![][n] | +| `<dbp:imageCaption>` | 139.85 | 13 | ![][n] | +| `<dbp:bridgeName>` | 164.91 | 49 | ![][n] | +| `<dbp:imageFlag>` | 166.35 | 30 | ![][n] | +| `<dbp:mapCaption>` | 170.93 | 11 | ![][n] | +| `<dbo:foundingYear>` | 173.92 | 299 | ![][n] | +| `<dbp:producer>` | 186.37 | 12 | ![][n] | +| `<dbp:ground>` | 297.25 | 802 | ![][n] | +| `<dbp:title>` | 328.93 | 10 | ![][n] | +| `<dc:description>` | 332.05 | 8 | ![][n] | +| `<dbp:shortDescription>` | 334.79 | 9 | ![][n] | +| `<dbp:caption>` | 648.73 | 7 | ![][n] | +| `<foaf:givenName>` | 1436.74 | 21 | ![][y] | +| `<dbp:name>` | 1961.98 | 6 | ![][y] | +| `<foaf:name>` | 2086.67 | 5 | ![][y] | +| `<dbo:wikiPageWikiLinkText>` | 2897.51 | 4 | ![][y] | +| `<rdfs:label>` | 3483.06 | 3 | ![][y] | +| `<rdfs:comment>` | 12323.46 | 2 | ![][n] | +| `<dbo:abstract>` | 13002.74 | 1 | ![][n] | + +Based on this, one may want to try adding fields like `<dbp:caption>` to the +index. + +Conversely, this information can also be used to improve the relevance measure. +Apparently, `<dbp:ground>`, `<dbo:foundingYear>` and `<dbp:foundation>` are +quite relevant according to human assessors, but not at all according to BM25. + +[y]: http://i.stack.imgur.com/iro5J.png +[n]: http://i.stack.imgur.com/asAya.png |