# Plan

## The Idea
The DBpedia-Entity repository has base rankings for a select amount of retrieval algorithms for multiple sets of queries.
These base rankings were obtained by running the algorithms on the dataset, where the dataset was reduced to contain only a subset of all possible fields.
In particular, the fields used by the base rankings were:

| Field | Description | Predicates | Notes |
| --- | --- | --- | --- |
| Names | Names of the entity | `<foaf:name>`, `<dbp:name>`, `<foaf:givenName>`, `<foaf:surname>`, `<dbp:officialName>`, `<dbp:fullname>`, `<dbp:nativeName>`, `<dbp:birthName>`, `<dbo:birthName>`, `<dbp:nickname>`, `<dbp:showName>`, `<dbp:shipName>`, `<dbp:clubname>`, `<dbp:unitName>`, `<dbp:otherName>`, `<dbo:formerName>`, `<dbp:birthname>`, `<dbp:alternativeNames>`, `<dbp:otherNames>`, `<dbp:names>`, `<rdfs:label>` | |
| Categories | Entity types | `<dcterms:subject>` | |
| Similar entity names | Entity  name variants | `!<dbo:wikiPageRedirects>`, `!<dbo:wikiPageDisambiguates>`, `<dbo:wikiPageWikiLinkText>` | `!` denotes reverse direction (i.e. `<o, p, s>`) |
| Attributes | Literal attibutes of entity | All `<s, p, o>`, where *"o"* is a literal and *"p"* is not in *Names*, *Categories*, *Similar entity names*, and blacklist predicates.For each `<s, p, o>` triple, if `p matches <dbp:.*>` both *p* and *o* are stored (i.e. *"p o"* is indexed). | |
| Related entity names | URI relations of entity|  Similar to *Attributes* field, but *"o"* should be a URI. | |

Of the following files from the 2015-10 dump:
- `anchor_text_en.ttl`
- `article_categories_en.ttl`
- `disambiguations_en.ttl`
- `infobox_properties_en.ttl`
- `instance_types_transitive_en.ttl`
- `labels_en.ttl`
- `long_abstracts_en.ttl`
- `mappingbased_literals_en.ttl`
- `mappingbased_objects_en.ttl`
- `page_links_en.ttl`
- `persondata_en.ttl`
- `short_abstracts_en.ttl`
- `transitive_redirects_en.ttl`

Our hypothesis is that not all of the fields are of similar importance.
As such, our idea is to use some kind of Hill-Climbing algorithm to determine just what combination of fields (or possible weights) produces the best output.

## Nordlys
The Nordlys toolkit was used to create the results described above, as such, it provides us with the means to reproduce these results.
In addition, Nordlys provides a Python interface that can be used to implement the Hill Climbing algorithm.

The data that is used by the results is also bundled with the Nordlys Python package, and has already been indexed.
This allows us to use the Python package without having to convert/index the data ourselves.