Implementation.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237

# Implementation

## Feasibility
The Plan mentions the following:
> We consider a vector space where every possible search field represents a binary parameter.
> A vector has `1` for the parameter if and only if it is included in the search (excluded from the blacklist).
> We will then run a hill-climbing algorithm through this higher-dimensional vector space
> in order to find a vector (an index setting) for which the ranking results are best.

Soon after we began trying to implement this feature using a locally run version of nordlys, we encountered some issues.
The most notable being that our machines were unable to index the full DB-Pedia set in a reasonable amount of time, using a reasonable amount of resources.
When we encountered this issue, we decided that the best options was using a subset of the DB-Pedia dataset.

The subset that we settled on is the subset that has relevance scores assigned to them for any query.
We then only considered the result of a given query in our assessment.

The above has the added benefit that the relevance (both the human assessment and the score) are precomputed.
This meant that simply parsing the files that are provided by nordlys is enough to implement any kind of field selected assessment.

Unfortunately, it turned out that hill-climbing was also out of the scope of the assignment.
Having only 2 programmers, both of whom have not a lot of experience in implementing such algorithms, made the task slightly to much work.
Instead, we decided to take a different approach and statically analyse the importance of all fields.
The meansure that we use take the form of:

![Field Relevance Measure](http://mathurl.com/yc2ptq63.png "Field Relevance Measure")

Where `relevance` is the bm25 relevance that is stored by nordlys, `D` is the set of documents, `Q` the set of queries, `tf` the function that counts the amount of times any of the query terms was found in that field and `|f|` the size of the field.

## Code

We use three Python programs that:

1. Get DBPedia entities from the Nordlys API (`scrape.py`)
2. For each entry in a BM25 run, list DBPedia ID, relevance score and for each
   field in the entity how many of the values match with at least one of the
   query terms. This information is what BM25 uses to compute the relevance.
   This file is `run.py`.
3. Use that information to investigate how important each field is
   (`check.py`).

We will now discuss the implementation of each of these files.

### `scrape.py`

- In this file we read lines from `stdin`. These lines are supposed to come
  from a BM25 run.  That way, we only download DBPedia entities that we
  actually need.

```python
if __name__ == '__main__':
    for line in fileinput.input():
        scrape(line)
```

- We split the lines from the run file. Only the DBPedia ID is relevant.

```python
def scrape(line):
    index, query, dbpediaid, relevance = line.split('\t')
    try:
        get(dbpediaid)
    except Exception as e:
        with open(ERRORFILE, 'a') as f:
            f.write(dbpediaid + '\t' + e + '\n')
```

- We store the entities one per file in the original JSON format. We use the ID
  as filename, but have to prevent special characters so URL-encode it and
  remove slashes.

  Normally, Nordlys will refuse queries from a Python user-agent. So we adapt
  the user-agent to `Radboud University` and Nordlys accepts it happily. We did
  not hit rate limiting.

```python
def get(dbpediaid):
    outfile = os.path.join(OUTDIR, quote_plus(dbpediaid) + '.json')
    if os.path.isfile(outfile):
        return
    url = 'http://api.nordlys.cc/ec/lookup_id/{}'.format(quote_plus(dbpediaid))
    print(url)
    result = urlopen(Request(url,
        headers={'User-Agent': 'Radboud University'})).read()
    with open(outfile, 'w') as f:
        f.write(result.decode(encoding='UTF-8'))
```

### `run.py`

- `queries_stopped.json` lists all query terms. We load this file ones, then
  process a run from `stdin`.

```python
if __name__ == '__main__':
    with open('queries_stopped.json') as f:
        queries = json.load(f)

        for line in fileinput.input():
            run(queries, line)
```

- We split the line in the run file. For each field we check (1) how many
  values there are and (2) how many values match a query term.

```python
def run(queries, line):
    query, _, dbpediaid, _, relevance, method = line.split('\t')
    terms = queries[query].split()
    try:
        result = get(dbpediaid)
        if result is None:
            return
        for field, values in result.items():
            matches = 0
            for value in values:
                if match(value, terms):
                    matches += 1
            print('{}\t{}\t{}\t{}\t{}\t{}\n'.format(
                query, dbpediaid, relevance, field, len(values), matches))
    except Exception as e:
        print(dbpediaid)
        print(e)
        with open(ERRORFILE, 'a') as f:
            f.write(dbpediaid + '\t' + e + '\n')
```

- For simplicity, we do not use lemmatisation or synonym resolution here, which
  could be an improvement in a next version.

```python
def match(value, terms):
    for v in value.split():
        if v in terms:
            return True
    return False
```

- `get` simply gets the file that we stored with `scrape.py`:

```
def get(dbpediaid):
    outfile = os.path.join(DATADIR, quote_plus(dbpediaid) + '.json')
    if not os.path.isfile(outfile):
        return None
    with open(outfile) as f:
        return json.load(f)
```

### `check.py`

- We keep a dictionary of scores per field and simply compute our weight score:

```python
if __name__ == '__main__':
    scores = dict()
    for line in fileinput.input():
        query, dbpediaid, relevance, field, nvalues, nmatches = line.split('\t')
        if field not in scores:
            scores[field] = 0
        scores[field] += float(relevance) * log(1 + int(nmatches)/int(nvalues))
```

- Then we print all scores:

```python
    for field, score in scores.items():
        print('{}\t{}'.format(field, score))
```

### Usage

All this allows for a fairly simple workflow:

```bash
mkdir data
./scrape.py < qrels-v2.txt
./run.py < bm25.run > fields.txt
./check.py < fields.txt | sort -k2 -n > scores.txt
```

This assumes that you have the following files from Nordlys:

- [`qrels-v2.txt`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/qrels-v2.txt) (entity list)
- [`bm25.run`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/runs/bm25.run) (BM25 relevance judgements)
- [`queries_stopped.json`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/queries_stopped.json) (query terms)

The system is agnostic with regards to the ranking function (BM25 or another
method).

## Intermediate Results
These are the thirty most important fields as found by our measure:

| Field                        | Score     | Used by Nordlys |
|------------------------------|----------:|:---------------:|
| `<dbp:imageFlag>`            |   2205.50 | ![][n]          |
| `<dbp:office>`               |   2246.90 | ![][n]          |
| `<dbp:pushpinMapCaption>`    |   2357.07 | ![][n]          |
| `<dbp:description>`          |   2357.35 | ![][n]          |
| `<dbp:placeOfBirth>`         |   2384.14 | ![][n]          |
| `<dbp:fastTime>`             |   2440.73 | ![][n]          |
| `<dbp:imageMap>`             |   2485.96 | ![][n]          |
| `<dbp:writer>`               |   2689.86 | ![][n]          |
| `<dbp:alt>`                  |   2691.94 | ![][n]          |
| `<foaf:givenName>`           |   2694.41 | ![][y]          |
| `<dbp:poleTime>`             |   2698.75 | ![][n]          |
| `<dbp:country>`              |   2836.44 | ![][n]          |
| `<dbp:type>`                 |   3248.58 | ![][n]          |
| `<dbo:office>`               |   3425.58 | ![][n]          |
| `<dbp:location>`             |   3430.20 | ![][n]          |
| `<dbp:officialName>`         |   4316.34 | ![][y]          |
| `<dbp:quote>`                |   4470.38 | ![][n]          |
| `<dbp:imageCaption>`         |   4480.06 | ![][n]          |
| `<dbp:producer>`             |   4704.52 | ![][n]          |
| `<dbp:mapCaption>`           |   8040.36 | ![][n]          |
| `<dbp:title>`                |  10999.72 | ![][n]          |
| `<dbp:shortDescription>`     |  22065.46 | ![][n]          |
| `<dc:description>`           |  23442.34 | ![][n]          |
| `<dbp:caption>`              |  24697.75 | ![][n]          |
| `<dbp:name>`                 |  25500.42 | ![][y]          |
| `<foaf:name>`                |  32860.37 | ![][y]          |
| `<dbo:wikiPageWikiLinkTent>` |  86218.71 | ![][y]          |
| `<rdfs:label>`               | 105358.89 | ![][y]          |
| `<rdfs:comment>`             | 514446.08 | ![][n]          |
| `<dbo:abstract>`             | 581355.57 | ![][n]          |

We see that many of the relevant fields are actually [not used by
Nordlys](https://iai-group.github.io/DBpedia-Entity/index_details.html).
However, this is not yet an indication that these fields should be added to the
index. After all, adding more fields means more computation time to build the
index and to retrieve search results.

In fact, we expect that many of the fields not used actually display
similarities with fields that *are* indexed. For example, the `<dbo:abstract>`
field will probably match because the title is repeated in the abstract.

[y]: http://i.stack.imgur.com/iro5J.png
[n]: http://i.stack.imgur.com/asAya.png