# Evaluation

## Explanation of results

The results we obtained are described in the second blogpost: implementation. In this evaluation we will describe the fields that can be added to the index in order to find out the importance of all fields. As stated in the table with the scores and ranks, we can see that not all fields that are found to be relevant by BM25 are included in Nordlys. Only `<foaf:givenName>`, `<dpb:name>`, `<foaf:name>`, `<dbo:wikiPageWikiLinkText>` and `<rfds:label>` are used in Nordlys. There may be some fields that we want to add to the index. We will now evaluate those fields that are in our lists of important fields as found by our measure for both the BM25 relevance scores and our human assesments.

The fields with the top two ranks in both the BM25 and the human assesment rankings are `<dbo:abstract>` and `<rdfs:comment>`. Even though these fields are ranked so highly by BM25 we do not recommend adding them both, since the `<rdfs:comment>` field is simply a shorter version of `<dbo:abstract>`. Also, since these fields contain large texts, adding them both to the index would likely increase the computing time by quite a bit. Instead, we recommend only adding the `<rdfs:comment>` field.

Two other fields we might want to add are `<dc:description>` that has rank 8 of BM25 and `<dbp:shortDescription>`, which has rank 9. These description are likely to be searched for and therefore we would recommend to add this field to the index. Since there is a lot of overlap between these fields we would recommend adding the higher ranked `<dc:description>` to the index, because it is ranked higher and the descriptions are already relatively short, meaning the difference in computation necessary for these two fields will not be very large.

Some fields that scored well using our human assesment as a relevance measures turned out to have a low ranking when using bm25. These fields are `<dbp:ground>` (802), `<dbo:foundingYear>` (299), `<dbp:foundation>` (266). We suspect that the reason for this difference is that a lot of the DBpedia entries don't contain these fields, because they're too specific. For example, countries, people and a lot of organisations don't have a founding year in DBpedia. Other fields in our lists that are also too specific are `<dbp:bridgeName>`, `<dbp:producer>` (which we had a hard time even finding in the DBpedia). 

Another field with a high score that we do not want to add is `<dbp:caption>`, which is rank 7 for BM25. Since `<dbp:caption>` can be the caption of an image or an table, which a lot of the times are added in support of information contained in other fields, this field does not provide a lot of new information. For similar reason we also do not add `<dbp:mapCaption>`, `<dbp:imageCaption>` and `< dbp:pushpinMapCaption>`.

## Conclusion
Our conclusion is that it would be best to add `<rdfs:comment>` and `<dc:description>` because we think those are the fields that would influence the results the most significantly. Both fields describe more about the topic instead of only using the fields used by Nordlys (`<foaf:givenName>`, `<dpb:name>`, `<foaf:name>`, `<dbo:wikiPageWikiLinkText>` and `<rfds:label>`). So if a user would want to look for for example: the footballplayer Messi, but he does not know the name, he could use the query: argentine footballplayer. In that case, he is more likely to find the information about the person he is looking for, in this case: Messi, since this is the description of Messi in DBpedia.

## Further Research

### Hill climbing to validate or improve the results of our statical analysis of the importance of fields.
Unfortunately, we couldn't apply hill climbing to our own research because we did not have enough programmers in order to carry out this. In further research, it would still be interesting to apply hill climbing for search engine optimalization because it takes the value of bm25 into account. Something we did not manage to do for our research as of yet.

In particular, using hill climbing takes duplicate data in multiple fields into account. For most wikipedia articles the name of the page also occurs in the abstract of said page. Therefore, adding the name of the page might not actually increase the evaluation of the search algorithm. Our current data does not take such correlations into account.

The inverse might also be true, some fields that we think are not relevant (because they do not often contain a search term) might actually have the search term in such specific cases that it actually increases the overall evaluation of the system.

It would be interesting to see how this hill climbing algorithm can optimize search.

### Adding the fields to the index and comparing the new NDCG scores with baseline runs.
In further research, it would also be interesting to really implement the suggestion we do now. In that case, we would add the fields: `<rdfs:comment>` and `<dc:description>` to see whether it really optimizes the search results. In that case we would have to add those fields to the index and compare them with the NDCG scores with baseline runs.