From e6bf1a23452db1c01c5cf0293c6c04e42283f96b Mon Sep 17 00:00:00 2001 From: Erin van der Veen Date: Fri, 29 Sep 2017 11:53:51 +0200 Subject: Mention that there are two ways to Index the data --- Plan.md | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) (limited to 'Plan.md') diff --git a/Plan.md b/Plan.md index 30e0e46..5a2611f 100644 --- a/Plan.md +++ b/Plan.md @@ -28,6 +28,21 @@ Of the following files from the 2015-10 dump: - `short_abstracts_en.ttl` - `transitive_redirects_en.ttl` +There are two indexes that are used for this result. + +Both Indexes are likely implemented by the Nordlys package that we will describe below. + +###Index A + - A new field called "catchall" is used; it encompass the content of all other fields. Duplicate values are not removed in this field. + +###Index B + - Anchor texts (i.e. contents of `` predicate) are added to both "similar entity names" and "attributes" fields. + - Entity URIs are resolved differently for the "related entity names" field. Names for related entities are extracted in the same way as it is done for "names" field (see predicates for "names" in the above table), but only one arbitrary name is used for each related entity. + - Category URIs are resolved using `category_labels_en.ttl` file + - Predicate URIs are resolved using `infobox_property_definitions_en.ttl` file. If a name for a predicate is not defined, a predicate is omitted. + +More information about the way it was indexed can be fond [here](https://iai-group.github.io/DBpedia-Entity/index_details.html). + Our hypothesis is that not all of the fields are of similar importance. As such, our idea is to use some kind of Hill-Climbing algorithm to determine just what combination of fields (or possible weights) produces the best output. @@ -43,5 +58,3 @@ In addition, Nordlys provides a Python interface that can be used to implement t The data that is used by the results is also bundled with the Nordlys Python package, and has already been indexed. This allows us to use the Python package without having to convert/index the data ourselves. - - -- cgit v1.2.3