Implementation.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288

# Implementation

## Feasibility
Our [Plan](Plan.md) mentions the following:

> We consider a vector space where every possible search field represents a
> binary parameter. A vector has `1` for the parameter if and only if it is
> included in the search (excluded from the blacklist). We will then run a
> hill-climbing algorithm through this higher-dimensional vector space in order
> to find a vector (an index setting) for which the ranking results are best.

Soon after we began trying to implement this feature using a locally run
version of Nordlys, we encountered some issues, the most notable being that
our machines were unable to index the full DBPedia set in a reasonable amount
of time, using a reasonable amount of resources. When we encountered this
issue, we decided that the best option was to use a subset of the DBPedia
dataset.

The subset that we settled on is the subset that has relevance scores assigned
to them for any query. We then only consider the result of a given query in our
assessment.

The above has the additional benefit that the relevance judgements (both the
human assessment and the score) need not be computed. This meant that simply
parsing the files that are provided by Nordlys is enough to implement any kind
of field selected assessment.

Unfortunately, it turned out that we also did not have resources to implement a
hill-climbing algorithm. Having only 2 programmers made the task slightly too
much work. Instead, we decided to take a different approach and statically
analyse the importance of all fields. The measure that we use takes the form
of:

![Field Relevance Measure](http://mathurl.com/yc2ptq63.png "Field Relevance
Measure")

Where *relevance* is the BM25 relevance that is stored by Nordlys, *D* is the
set of documents, *Q* the set of queries, *tf* the function that counts the
amount of times any of the query terms was found in that field and |*f*| the
size of the field.

The formula assumes that relevance is more or less linear. The logarithm is
used because more occurrences of the same term are not as important as the
first occurrence.

## Code

We use three Python programs that:

1. Get DBPedia entities from the Nordlys API (`scrape.py`)
2. For each entry in a BM25 run, list DBPedia ID, relevance score and for each
   field in the entity how many of the values match with at least one of the
   query terms. This information is what BM25 uses to compute the relevance.
   This file is `run.py`.
3. Use that information to investigate how important each field is
   (`check.py`).

We will now discuss the implementation of each of these files.

### `scrape.py`

- In this file we read lines from `stdin`. These lines are supposed to come
  from a BM25 run. That way, we only download DBPedia entities that we
  actually need.

```python
if __name__ == '__main__':
    for line in fileinput.input():
        scrape(line)
```

- We split the lines from the run file. Only the DBPedia ID is relevant.

```python
def scrape(line):
    index, query, dbpediaid, relevance = line.split('\t')
    try:
        get(dbpediaid)
    except Exception as e:
        with open(ERRORFILE, 'a') as f:
            f.write(dbpediaid + '\t' + e + '\n')
```

- We store the entities one per file in the original JSON format. We use the ID
  as filename, but have to prevent special characters so URL-encode it and
  remove slashes.

  Normally, Nordlys will refuse queries from a Python user-agent. So we adapt
  the user-agent to `Radboud University` and Nordlys accepts it happily. We did
  not hit rate limiting.

```python
def get(dbpediaid):
    outfile = os.path.join(OUTDIR, quote_plus(dbpediaid) + '.json')
    if os.path.isfile(outfile):
        return
    url = 'http://api.nordlys.cc/ec/lookup_id/{}'.format(quote_plus(dbpediaid))
    print(url)
    result = urlopen(Request(url,
        headers={'User-Agent': 'Radboud University'})).read()
    with open(outfile, 'w') as f:
        f.write(result.decode(encoding='UTF-8'))
```

### `run.py`

- `queries_stopped.json` lists all query terms. We load this file once, then
  process a run from `stdin`.

```python
if __name__ == '__main__':
    with open('queries_stopped.json') as f:
        queries = json.load(f)

        for line in fileinput.input():
            run(queries, line)
```

- We split the line in the run file. For each field we check (1) how many
  values there are and (2) how many values match a query term.

```python
def run(queries, line):
    query, _, dbpediaid, _, relevance, method = line.split('\t')
    terms = queries[query].split()
    try:
        result = get(dbpediaid)
        if result is None:
            return
        for field, values in result.items():
            matches = 0
            for value in values:
                if match(value, terms):
                    matches += 1
            print('{}\t{}\t{}\t{}\t{}\t{}\n'.format(
                query, dbpediaid, relevance, field, len(values), matches))
    except Exception as e:
        print(dbpediaid)
        print(e)
        with open(ERRORFILE, 'a') as f:
            f.write(dbpediaid + '\t' + e + '\n')
```

- For simplicity, we do not use lemmatisation or synonym resolution here, which
  could be an improvement in a next version.

```python
def match(value, terms):
    for v in value.split():
        if v in terms:
            return True
    return False
```

- `get` simply gets the file that we stored with `scrape.py`:

```python
def get(dbpediaid):
    outfile = os.path.join(DATADIR, quote_plus(dbpediaid) + '.json')
    if not os.path.isfile(outfile):
        return None
    with open(outfile) as f:
        return json.load(f)
```

### `check.py`

- We keep a dictionary of scores per field and simply compute our weight score:

```python
if __name__ == '__main__':
    scores = dict()
    for line in fileinput.input():
        query, dbpediaid, relevance, field, nvalues, nmatches = line.split('\t')
        if field not in scores:
            scores[field] = 0
        scores[field] += float(relevance) * log(1 + int(nmatches)/int(nvalues))
```

- Then we print all scores:

```python
    for field, score in scores.items():
        print('{}\t{}'.format(field, score))
```

### Usage

All this allows for a fairly simple workflow:

```bash
mkdir data
./scrape.py < qrels-v2.txt
./run.py < bm25.run > fields.txt
./check.py < fields.txt | sort -k2 -n > scores.txt
```

This assumes that you have the following files from Nordlys:

- [`qrels-v2.txt`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/qrels-v2.txt) (entity list)
- [`bm25.run`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/runs/bm25.run) (BM25 relevance judgements)
- [`queries_stopped.json`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/queries_stopped.json) (query terms)

The system is agnostic with regards to the ranking function (BM25 or another
method).

## Intermediate Results
These are the thirty most important fields as found by our measure when used on
the BM25 relevance scores:

| Field                        | Score     | Used by Nordlys |
|------------------------------|----------:|:---------------:|
| `<dbp:imageFlag>`            |   2205.50 | ![][n]          |
| `<dbp:office>`               |   2246.90 | ![][n]          |
| `<dbp:pushpinMapCaption>`    |   2357.07 | ![][n]          |
| `<dbp:description>`          |   2357.35 | ![][n]          |
| `<dbp:placeOfBirth>`         |   2384.14 | ![][n]          |
| `<dbp:fastTime>`             |   2440.73 | ![][n]          |
| `<dbp:imageMap>`             |   2485.96 | ![][n]          |
| `<dbp:writer>`               |   2689.86 | ![][n]          |
| `<dbp:alt>`                  |   2691.94 | ![][n]          |
| `<foaf:givenName>`           |   2694.41 | ![][y]          |
| `<dbp:poleTime>`             |   2698.75 | ![][n]          |
| `<dbp:country>`              |   2836.44 | ![][n]          |
| `<dbp:type>`                 |   3248.58 | ![][n]          |
| `<dbo:office>`               |   3425.58 | ![][n]          |
| `<dbp:location>`             |   3430.20 | ![][n]          |
| `<dbp:officialName>`         |   4316.34 | ![][y]          |
| `<dbp:quote>`                |   4470.38 | ![][n]          |
| `<dbp:imageCaption>`         |   4480.06 | ![][n]          |
| `<dbp:producer>`             |   4704.52 | ![][n]          |
| `<dbp:mapCaption>`           |   8040.36 | ![][n]          |
| `<dbp:title>`                |  10999.72 | ![][n]          |
| `<dbp:shortDescription>`     |  22065.46 | ![][n]          |
| `<dc:description>`           |  23442.34 | ![][n]          |
| `<dbp:caption>`              |  24697.75 | ![][n]          |
| `<dbp:name>`                 |  25500.42 | ![][y]          |
| `<foaf:name>`                |  32860.37 | ![][y]          |
| `<dbo:wikiPageWikiLinkTent>` |  86218.71 | ![][y]          |
| `<rdfs:label>`               | 105358.89 | ![][y]          |
| `<rdfs:comment>`             | 514446.08 | ![][n]          |
| `<dbo:abstract>`             | 581355.57 | ![][n]          |

We see that many of the relevant fields are actually [not used by
Nordlys](https://iai-group.github.io/DBpedia-Entity/index_details.html).
However, this is not yet an indication that these fields should be added to the
index. After all, adding more fields means more computation time to build the
index and to retrieve search results.

In fact, we expect that many of the fields not used actually display
similarities with fields that *are* indexed. For example, the `<dbo:abstract>`
field will probably match because the title is repeated in the abstract.

We can perform the same analysis on the human assessments. This gives a rather
different list of fields:

| Field                         | Score    | Rank for BM25 | Used by Nordlys |
|-------------------------------|---------:|--------------:|:---------------:|
| `<dbp:pushpinMapCaption>`     |   133.77 |            28 | ![][n]          |
| `<dbp:foundation>`            |   136.32 |           266 | ![][n]          |
| `<dbp:imageCaption>`          |   139.85 |            13 | ![][n]          |
| `<dbp:bridgeName>`            |   164.91 |            49 | ![][n]          |
| `<dbp:imageFlag>`             |   166.35 |            30 | ![][n]          |
| `<dbp:mapCaption>`            |   170.93 |            11 | ![][n]          |
| `<dbo:foundingYear>`          |   173.92 |           299 | ![][n]          |
| `<dbp:producer>`              |   186.37 |            12 | ![][n]          |
| `<dbp:ground>`                |   297.25 |           802 | ![][n]          |
| `<dbp:title>`                 |   328.93 |            10 | ![][n]          |
| `<dc:description>`            |   332.05 |             8 | ![][n]          |
| `<dbp:shortDescription>`      |   334.79 |             9 | ![][n]          |
| `<dbp:caption>`               |   648.73 |             7 | ![][n]          |
| `<foaf:givenName>`            |  1436.74 |            21 | ![][y]          |
| `<dbp:name>`                  |  1961.98 |             6 | ![][y]          |
| `<foaf:name>`                 |  2086.67 |             5 | ![][y]          |
| `<dbo:wikiPageWikiLinkText>`  |  2897.51 |             4 | ![][y]          |
| `<rdfs:label>`                |  3483.06 |             3 | ![][y]          |
| `<rdfs:comment>`              | 12323.46 |             2 | ![][n]          |
| `<dbo:abstract>`              | 13002.74 |             1 | ![][n]          |

Based on this, one may want to try adding fields like `<dbp:caption>` to the
index.

Conversely, this information can also be used to improve the relevance measure.
Apparently, `<dbp:ground>`, `<dbo:foundingYear>` and `<dbp:foundation>` are
quite relevant according to human assessors, but not at all according to BM25.

[y]: http://i.stack.imgur.com/iro5J.png
[n]: http://i.stack.imgur.com/asAya.png