Plan.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60

# Plan

## The Idea
The DBpedia-Entity repository has base rankings for a select amount of retrieval algorithms for multiple sets of queries.
These base rankings [were obtained](https://iai-group.github.io/DBpedia-Entity/index_details.html) by running tests with the ranking algorithms on the dataset,
	where the dataset was reduced to contain only a subset of all possible fields.
Also, some fields had a special function:

> | Field | Description | Predicates | Notes |
> | --- | --- | --- | --- |
> | Names | Names of the entity | `<foaf:name>`, `<dbp:name>`, `<foaf:givenName>`, `<foaf:surname>`, `<dbp:officialName>`, `<dbp:fullname>`, `<dbp:nativeName>`, `<dbp:birthName>`, `<dbo:birthName>`, `<dbp:nickname>`, `<dbp:showName>`, `<dbp:shipName>`, `<dbp:clubname>`, `<dbp:unitName>`, `<dbp:otherName>`, `<dbo:formerName>`, `<dbp:birthname>`, `<dbp:alternativeNames>`, `<dbp:otherNames>`, `<dbp:names>`, `<rdfs:label>` | |
> | Categories | Entity types | `<dcterms:subject>` | |
> | Similar entity names | Entity  name variants | `!<dbo:wikiPageRedirects>`, `!<dbo:wikiPageDisambiguates>`, `<dbo:wikiPageWikiLinkText>` | `!` denotes reverse direction (i.e. `<o, p, s>`) |
> | Attributes | Literal attibutes of entity | All `<s, p, o>`, where *"o"* is a literal and *"p"* is not in *Names*, *Categories*, *Similar entity names*, and blacklist predicates.For each `<s, p, o>` triple, if `p matches <dbp:.*>` both *p* and *o* are stored (i.e. *"p o"* is indexed). | |
> | Related entity names | URI relations of entity|  Similar to *Attributes* field, but *"o"* should be a URI. | |

> ### Index B
>  - Anchor texts (i.e. contents of `<dbo:wikiPageWikiLinkText>` predicate) are added to both "similar entity names" and "attributes" fields.
>  - Entity URIs are resolved differently for the "related entity names" field. Names for related entities are extracted in the same way as it is done for "names" field (see predicates for "names" in the above table), but only one arbitrary name is used for each related entity.
>  - Category URIs are resolved using `category_labels_en.ttl` file
>  - Predicate URIs are resolved using `infobox_property_definitions_en.ttl` file. If a name for a predicate is not defined, a predicate is omitted.

However, of the remaining fields not all information is used to base the ranking on;
	some fields are simply ignored.
Which fields are ignored can be found in the [Nordlys repository](https://github.com/iai-group/nordlys/blob/master/data/config/index_dbpedia_2015_10.config.json),
	in the `blacklist` key.
We could not find how the Nordlys group has decided on putting these fields in the blacklist &mdash;
	it might be, that this is just based on educated, but subjective, guesses.

It is important to base this blacklist on actual observations,
	because this may improve the results of the retrieval function.
Hence, we want to find a better, objectively produced, reproducible blacklist.

### Our Approach
We consider a vector space where every possible search field represents a binary parameter.
A vector has `1` for the parameter iff it is included in the search (excluded from the blacklist).
We will then run a hill-climbing algorithm through this higher-dimensional vector space
	in order to find a vector (an index setting) for which the ranking results are best.

We measure the quality of the ranking using Normalized Discounted Cumulative Gain (NDCG).
This is the same method Nordlys [used](http://nordlys.readthedocs.io/en/latest/er.html?highlight=NDCG#benchmark-results) for benchmarking,
	which allows us to verify our first results.

We will use only one ranking function to start with (the fastest, or the one we can get working most easily),
	but might extend it to more ranking functions.
On first sight, that does not seem to be particularly interesting;
	it would be 'more of the same'.

## Nordlys
Nordlyss is a toolkit for entity-oriented and semantic search. It currently supports four entity-oriented tasks, which could be useful for our project. These entity-oriented tasks are: 
- `Entity cataloging`
- `Entity retrieval` Returns a ranked list of entities in response to a query
- `Entity linking in queries` Identifies entities in a query and links them to the corresponding entry in the Knowledge base
- `Target type identification` Detects the target types (or categories) of a query

The Nordlys toolkit was used to create the results described above, as such, it provides us with the means to reproduce these results.
In addition, Nordlys provides a Python interface that can be used to implement the Hill Climbing algorithm.

The data that is used by the results is also bundled with the Nordlys Python package, and has already been indexed.
This allows us to use the Python package without having to convert/index the data ourselves.