aboutsummaryrefslogtreecommitdiff
path: root/Plan.md
blob: 78bf7c5b0f6530c695bff408299035cf6550cddb (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Plan

## The Idea
The DBpedia-Entity repository has base rankings for a select amount of retrieval algorithms for multiple sets of queries.
These base rankings were obtained by running the algorithms on the dataset, where the dataset was reduced to contain only a subset of all possible fields.
In particular, the fields used by the base rankings were:

| Field | Description | Predicates | Notes |
| --- | --- | --- | --- |
| Names | Names of the entity | `<foaf:name>`, `<dbp:name>`, `<foaf:givenName>`, `<foaf:surname>`, `<dbp:officialName>`, `<dbp:fullname>`, `<dbp:nativeName>`, `<dbp:birthName>`, `<dbo:birthName>`, `<dbp:nickname>`, `<dbp:showName>`, `<dbp:shipName>`, `<dbp:clubname>`, `<dbp:unitName>`, `<dbp:otherName>`, `<dbo:formerName>`, `<dbp:birthname>`, `<dbp:alternativeNames>`, `<dbp:otherNames>`, `<dbp:names>`, `<rdfs:label>` | |
| Categories | Entity types | `<dcterms:subject>` | |
| Similar entity names | Entity  name variants | `!<dbo:wikiPageRedirects>`, `!<dbo:wikiPageDisambiguates>`, `<dbo:wikiPageWikiLinkText>` | `!` denotes reverse direction (i.e. `<o, p, s>`) |
| Attributes | Literal attibutes of entity | All `<s, p, o>`, where *"o"* is a literal and *"p"* is not in *Names*, *Categories*, *Similar entity names*, and blacklist predicates.For each `<s, p, o>` triple, if `p matches <dbp:.*>` both *p* and *o* are stored (i.e. *"p o"* is indexed). | |
| Related entity names | URI relations of entity|  Similar to *Attributes* field, but *"o"* should be a URI. | |

Of the following files from the 2015-10 dump:
- `anchor_text_en.ttl`
- `article_categories_en.ttl`
- `disambiguations_en.ttl`
- `infobox_properties_en.ttl`
- `instance_types_transitive_en.ttl`
- `labels_en.ttl`
- `long_abstracts_en.ttl`
- `mappingbased_literals_en.ttl`
- `mappingbased_objects_en.ttl`
- `page_links_en.ttl`
- `persondata_en.ttl`
- `short_abstracts_en.ttl`
- `transitive_redirects_en.ttl`

Our hypothesis is that not all of the fields are of similar importance.
As such, our idea is to use some kind of Hill-Climbing algorithm to determine just what combination of fields (or possible weights) produces the best output.