aboutsummaryrefslogtreecommitdiff
path: root/Plan.md
diff options
context:
space:
mode:
Diffstat (limited to 'Plan.md')
-rw-r--r--Plan.md86
1 files changed, 41 insertions, 45 deletions
diff --git a/Plan.md b/Plan.md
index c258d12..26f5b0d 100644
--- a/Plan.md
+++ b/Plan.md
@@ -2,53 +2,49 @@
## The Idea
The DBpedia-Entity repository has base rankings for a select amount of retrieval algorithms for multiple sets of queries.
-These base rankings were obtained by running the algorithms on the dataset, where the dataset was reduced to contain only a subset of all possible fields.
-In particular, the fields used by the base rankings were:
-
-| Field | Description | Predicates | Notes |
-| --- | --- | --- | --- |
-| Names | Names of the entity | `<foaf:name>`, `<dbp:name>`, `<foaf:givenName>`, `<foaf:surname>`, `<dbp:officialName>`, `<dbp:fullname>`, `<dbp:nativeName>`, `<dbp:birthName>`, `<dbo:birthName>`, `<dbp:nickname>`, `<dbp:showName>`, `<dbp:shipName>`, `<dbp:clubname>`, `<dbp:unitName>`, `<dbp:otherName>`, `<dbo:formerName>`, `<dbp:birthname>`, `<dbp:alternativeNames>`, `<dbp:otherNames>`, `<dbp:names>`, `<rdfs:label>` | |
-| Categories | Entity types | `<dcterms:subject>` | |
-| Similar entity names | Entity name variants | `!<dbo:wikiPageRedirects>`, `!<dbo:wikiPageDisambiguates>`, `<dbo:wikiPageWikiLinkText>` | `!` denotes reverse direction (i.e. `<o, p, s>`) |
-| Attributes | Literal attibutes of entity | All `<s, p, o>`, where *"o"* is a literal and *"p"* is not in *Names*, *Categories*, *Similar entity names*, and blacklist predicates.For each `<s, p, o>` triple, if `p matches <dbp:.*>` both *p* and *o* are stored (i.e. *"p o"* is indexed). | |
-| Related entity names | URI relations of entity| Similar to *Attributes* field, but *"o"* should be a URI. | |
-
-Of the following files from the 2015-10 dump:
-- `anchor_text_en.ttl`
-- `article_categories_en.ttl`
-- `disambiguations_en.ttl`
-- `infobox_properties_en.ttl`
-- `instance_types_transitive_en.ttl`
-- `labels_en.ttl`
-- `long_abstracts_en.ttl`
-- `mappingbased_literals_en.ttl`
-- `mappingbased_objects_en.ttl`
-- `page_links_en.ttl`
-- `persondata_en.ttl`
-- `short_abstracts_en.ttl`
-- `transitive_redirects_en.ttl`
-
-There are two indexes that are used for this result.
-<!-- TODO: Are they? -->
-Both Indexes are likely implemented by the Nordlys package that we will describe below.
-
-### Index A
- - A new field called "catchall" is used; it encompass the content of all other fields. Duplicate values are not removed in this field.
-
-### Index B
- - Anchor texts (i.e. contents of `<dbo:wikiPageWikiLinkText>` predicate) are added to both "similar entity names" and "attributes" fields.
- - Entity URIs are resolved differently for the "related entity names" field. Names for related entities are extracted in the same way as it is done for "names" field (see predicates for "names" in the above table), but only one arbitrary name is used for each related entity.
- - Category URIs are resolved using `category_labels_en.ttl` file
- - Predicate URIs are resolved using `infobox_property_definitions_en.ttl` file. If a name for a predicate is not defined, a predicate is omitted.
-
-More information about the way it was indexed can be found [here](https://iai-group.github.io/DBpedia-Entity/index_details.html).
+These base rankings [were obtained](https://iai-group.github.io/DBpedia-Entity/index_details.html) by running tests with the ranking algorithms on the dataset,
+ where the dataset was reduced to contain only a subset of all possible fields.
+Also, some fields had a special function:
+
+> | Field | Description | Predicates | Notes |
+> | --- | --- | --- | --- |
+> | Names | Names of the entity | `<foaf:name>`, `<dbp:name>`, `<foaf:givenName>`, `<foaf:surname>`, `<dbp:officialName>`, `<dbp:fullname>`, `<dbp:nativeName>`, `<dbp:birthName>`, `<dbo:birthName>`, `<dbp:nickname>`, `<dbp:showName>`, `<dbp:shipName>`, `<dbp:clubname>`, `<dbp:unitName>`, `<dbp:otherName>`, `<dbo:formerName>`, `<dbp:birthname>`, `<dbp:alternativeNames>`, `<dbp:otherNames>`, `<dbp:names>`, `<rdfs:label>` | |
+> | Categories | Entity types | `<dcterms:subject>` | |
+> | Similar entity names | Entity name variants | `!<dbo:wikiPageRedirects>`, `!<dbo:wikiPageDisambiguates>`, `<dbo:wikiPageWikiLinkText>` | `!` denotes reverse direction (i.e. `<o, p, s>`) |
+> | Attributes | Literal attibutes of entity | All `<s, p, o>`, where *"o"* is a literal and *"p"* is not in *Names*, *Categories*, *Similar entity names*, and blacklist predicates.For each `<s, p, o>` triple, if `p matches <dbp:.*>` both *p* and *o* are stored (i.e. *"p o"* is indexed). | |
+> | Related entity names | URI relations of entity| Similar to *Attributes* field, but *"o"* should be a URI. | |
+
+> ### Index B
+> - Anchor texts (i.e. contents of `<dbo:wikiPageWikiLinkText>` predicate) are added to both "similar entity names" and "attributes" fields.
+> - Entity URIs are resolved differently for the "related entity names" field. Names for related entities are extracted in the same way as it is done for "names" field (see predicates for "names" in the above table), but only one arbitrary name is used for each related entity.
+> - Category URIs are resolved using `category_labels_en.ttl` file
+> - Predicate URIs are resolved using `infobox_property_definitions_en.ttl` file. If a name for a predicate is not defined, a predicate is omitted.
+
+However, of the remaining fields not all information is used to base the ranking on;
+ some fields are simply ignored.
+Which fields are ignored can be found in the [Nordlys repository](https://github.com/iai-group/nordlys/blob/master/data/config/index_dbpedia_2015_10.config.json),
+ in the `blacklist` key.
+We could not find how the Nordlys group has decided on putting these fields in the blacklist &mdash;
+ it might be, that this is just based on educated, but subjective, guesses.
+
+It is important to base this blacklist on actual observations,
+ because this may improve the results of the retrieval function.
+Hence, we want to find a better, objectively produced, reproducible blacklist.
### Our Approach
-Our hypothesis is that not all of the fields are of similar importance.
-As such, our idea is to use some kind of Hill-Climbing algorithm to determine just what combination of fields (or possible weights) produces the best ranking.
-
-The quality of the ranking is determined by measuring effectiveness in terms of Normalized Discounted Cumulative Gain (NDCG).
-We calculate the NDCG@10, NDCG@100 for (each of) the ranking function(s) (BM25, PRMS, MLM, LM), so the hill climbing algorithm can compare the output for the various combinations of fields and so we can analyze if the effectiveness of our rankings are an improvement over the base rankings described in the DBpedia-Entity repo.
+We consider a vector space where every possible search field represents a binary parameter.
+A vector has `1` for the parameter iff it is included in the search (excluded from the blacklist).
+We will then run a hill-climbing algorithm through this higher-dimensional vector space
+ in order to find a vector (an index setting) for which the ranking results are best.
+
+We measure the quality of the ranking using Normalized Discounted Cumulative Gain (NDCG).
+This is the same method Nordlys [used](http://nordlys.readthedocs.io/en/latest/er.html?highlight=NDCG#benchmark-results) for benchmarking,
+ which allows us to verify our first results.
+
+We will use only one ranking function to start with (the fastest, or the one we can get working most easily),
+ but might extend it to more ranking functions.
+On first sight, that does not seem to be particularly interesting;
+ it would be 'more of the same'.
## Nordlys
Nordlyss is a toolkit for entity-oriented and semantic search. It currently supports four entity-oriented tasks, which could be useful for our project. These entity-oriented tasks are: