From 288b5e20b18ca5ee784fbd0bd4adfc49a6db9947 Mon Sep 17 00:00:00 2001 From: Erin van der Veen Date: Fri, 29 Sep 2017 11:06:31 +0200 Subject: Write Idea section of plan --- Plan.md | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) create mode 100644 Plan.md diff --git a/Plan.md b/Plan.md new file mode 100644 index 0000000..78bf7c5 --- /dev/null +++ b/Plan.md @@ -0,0 +1,32 @@ +# Plan + +## The Idea +The DBpedia-Entity repository has base rankings for a select amount of retrieval algorithms for multiple sets of queries. +These base rankings were obtained by running the algorithms on the dataset, where the dataset was reduced to contain only a subset of all possible fields. +In particular, the fields used by the base rankings were: + +| Field | Description | Predicates | Notes | +| --- | --- | --- | --- | +| Names | Names of the entity | ``, ``, ``, ``, ``, ``, ``, ``, ``, ``, ``, ``, ``, ``, ``, ``, ``, ``, ``, ``, `` | | +| Categories | Entity types | `` | | +| Similar entity names | Entity name variants | `!`, `!`, `` | `!` denotes reverse direction (i.e. ``) | +| Attributes | Literal attibutes of entity | All ``, where *"o"* is a literal and *"p"* is not in *Names*, *Categories*, *Similar entity names*, and blacklist predicates.For each `` triple, if `p matches ` both *p* and *o* are stored (i.e. *"p o"* is indexed). | | +| Related entity names | URI relations of entity| Similar to *Attributes* field, but *"o"* should be a URI. | | + +Of the following files from the 2015-10 dump: +- `anchor_text_en.ttl` +- `article_categories_en.ttl` +- `disambiguations_en.ttl` +- `infobox_properties_en.ttl` +- `instance_types_transitive_en.ttl` +- `labels_en.ttl` +- `long_abstracts_en.ttl` +- `mappingbased_literals_en.ttl` +- `mappingbased_objects_en.ttl` +- `page_links_en.ttl` +- `persondata_en.ttl` +- `short_abstracts_en.ttl` +- `transitive_redirects_en.ttl` + +Our hypothesis is that not all of the fields are of similar importance. +As such, our idea is to use some kind of Hill-Climbing algorithm to determine just what combination of fields (or possible weights) produces the best output. -- cgit v1.2.3