aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--.gitignore4
-rw-r--r--Evaluation.md30
-rw-r--r--Fields.md41
-rw-r--r--Implementation.md288
-rw-r--r--Plan.md60
-rw-r--r--README.md6
-rwxr-xr-xcheck.py15
-rwxr-xr-xdf.py26
-rwxr-xr-xrun.py52
-rwxr-xr-xscrape.py33
10 files changed, 553 insertions, 2 deletions
diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..5baa1a3
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,4 @@
+errors
+data/
+run.txt
+scores.txt
diff --git a/Evaluation.md b/Evaluation.md
new file mode 100644
index 0000000..c5cba19
--- /dev/null
+++ b/Evaluation.md
@@ -0,0 +1,30 @@
+# Evaluation
+
+## Explanation of results
+
+The results we obtained are described in the second blogpost: implementation. In this evaluation we will describe the fields that can be added to the index in order to find out the importance of all fields. As stated in the table with the scores and ranks, we can see that not all fields that are found to be relevant by BM25 are included in Nordlys. Only `<foaf:givenName>`, `<dpb:name>`, `<foaf:name>`, `<dbo:wikiPageWikiLinkText>` and `<rfds:label>` are used in Nordlys. There may be some fields that we want to add to the index. We will now evaluate those fields that are in our lists of important fields as found by our measure for both the BM25 relevance scores and our human assesments.
+
+The fields with the top two ranks in both the BM25 and the human assesment rankings are `<dbo:abstract>` and `<rdfs:comment>`. Even though these fields are ranked so highly by BM25 we do not recommend adding them both, since the `<rdfs:comment>` field is simply a shorter version of `<dbo:abstract>`. Also, since these fields contain large texts, adding them both to the index would likely increase the computing time by quite a bit. Instead, we recommend only adding the `<rdfs:comment>` field.
+
+Two other fields we might want to add are `<dc:description>` that has rank 8 of BM25 and `<dbp:shortDescription>`, which has rank 9. These description are likely to be searched for and therefore we would recommend to add this field to the index. Since there is a lot of overlap between these fields we would recommend adding the higher ranked `<dc:description>` to the index, because it is ranked higher and the descriptions are already relatively short, meaning the difference in computation necessary for these two fields will not be very large.
+
+Some fields that scored well using our human assesment as a relevance measures turned out to have a low ranking when using bm25. These fields are `<dbp:ground>` (802), `<dbo:foundingYear>` (299), `<dbp:foundation>` (266). We suspect that the reason for this difference is that a lot of the DBpedia entries don't contain these fields, because they're too specific. For example, countries, people and a lot of organisations don't have a founding year in DBpedia. Other fields in our lists that are also too specific are `<dbp:bridgeName>`, `<dbp:producer>` (which we had a hard time even finding in the DBpedia).
+
+Another field with a high score that we do not want to add is `<dbp:caption>`, which is rank 7 for BM25. Since `<dbp:caption>` can be the caption of an image or an table, which a lot of the times are added in support of information contained in other fields, this field does not provide a lot of new information. For similar reason we also do not add `<dbp:mapCaption>`, `<dbp:imageCaption>` and `< dbp:pushpinMapCaption>`.
+
+## Conclusion
+Our conclusion is that it would be best to add `<rdfs:comment>` and `<dc:description>` because we think those are the fields that would influence the results the most significantly. Both fields describe more about the topic instead of only using the fields used by Nordlys (`<foaf:givenName>`, `<dpb:name>`, `<foaf:name>`, `<dbo:wikiPageWikiLinkText>` and `<rfds:label>`). So if a user would want to look for for example: the footballplayer Messi, but he does not know the name, he could use the query: argentine footballplayer. In that case, he is more likely to find the information about the person he is looking for, in this case: Messi, since this is the description of Messi in DBpedia.
+
+## Further Research
+
+### Hill climbing to validate or improve the results of our statical analysis of the importance of fields.
+Unfortunately, we couldn't apply hill climbing to our own research because we did not have enough programmers in order to carry out this. In further research, it would still be interesting to apply hill climbing for search engine optimalization because it takes the value of bm25 into account. Something we did not manage to do for our research as of yet.
+
+In particular, using hill climbing takes duplicate data in multiple fields into account. For most wikipedia articles the name of the page also occurs in the abstract of said page. Therefore, adding the name of the page might not actually increase the evaluation of the search algorithm. Our current data does not take such correlations into account.
+
+The inverse might also be true, some fields that we think are not relevant (because they do not often contain a search term) might actually have the search term in such specific cases that it actually increases the overall evaluation of the system.
+
+It would be interesting to see how this hill climbing algorithm can optimize search.
+
+### Adding the fields to the index and comparing the new NDCG scores with baseline runs.
+In further research, it would also be interesting to really implement the suggestion we do now. In that case, we would add the fields: `<rdfs:comment>` and `<dc:description>` to see whether it really optimizes the search results. In that case we would have to add those fields to the index and compare them with the NDCG scores with baseline runs.
diff --git a/Fields.md b/Fields.md
new file mode 100644
index 0000000..121cc96
--- /dev/null
+++ b/Fields.md
@@ -0,0 +1,41 @@
+Below is a list of the categories of fields that we think are important, based on the amount of occurrences of the fields:
+
+Names ( with occurences > 1000 )
+
+{
+
+ <dbp:showName>, <dbp:filename>, <dbo:ingredientName>, <dbp:nativeName>, <dbp:fullname>, <dbp:leaderName>,
+ <dbp:officialName>, <dbp:subdivisionName>, <dbp:nickname>, <dbo:birthName>, <dbp:alternativeNames>,
+ <dbp:birthName>, <foaf:surname>, <foaf:givenName>, <dbp:name>, <foaf:name>
+
+}
+
+Dates ( with occurences > 1000 )
+
+{
+
+ <dbo:foundingDate>, <dbp:establishedDate>, <dbp:releaseDate>, <dbo:releaseDate>, <dbp:deathDate>, <dbp:date>,
+ <dbp:dateOfDeath>, <dbo:deathDate>, <dbp:birthDate>, <dbp:dateOfBirth>, <dbo:birthDate>,
+
+ <dbp:year>, <dbo:activeYearsEndyear>, <dbo:foundingYear>, <dbp:years>, <dbp:yearsActive>, <dbo:deathYear>,
+ <dbo:activeYearsStartYear>, <dbo:birthYear>
+
+}
+
+Top occuring fields ( with occurences > 10k ) ( including fields blacklisted by Nordlys )
+
+{
+
+ <dbp:locatio>n, <dbp:website>, <dbo:genre>, <dbp:birthPlace>, <dbp:placeOfBirth>, <dbo:birthPlace>, <dbp:title>,
+ <dbp:type>, <dbp:country>, <dbp:genre>, <dbp:shortDescription>, <dc:description>, <dbp:caption>, <foaf:homepage>,
+ !<dbo:wikiPageDisambiguates>, <dbo:thumbnail>, <foaf:depiction>, <rdf:type>, !<dbo:wikiPageredirects>, fb:<owl:sameAs>,
+ <dbo:wikiPigaWikiLinkText>, <dcterms:subject>, <dbo:wikiPageWikiLink>, <dbo:abstract>, <dbo:wikiPageID>,
+ <dbo:wikiPageOutDegree>, <rdfs:comment>, <rdfs:label>
+
+ Duplicates with previous categories:
+
+ <foaf:surname>, <foaf:givenName>, <dbp:name>,
+ <dbp:BirthDate>, <dbo:birthDate>, dbp:dateOfBirth>,
+ <dbo:birthYear>
+
+}
diff --git a/Implementation.md b/Implementation.md
new file mode 100644
index 0000000..c19f119
--- /dev/null
+++ b/Implementation.md
@@ -0,0 +1,288 @@
+# Implementation
+
+## Feasibility
+Our [Plan](Plan.md) mentions the following:
+
+> We consider a vector space where every possible search field represents a
+> binary parameter. A vector has `1` for the parameter if and only if it is
+> included in the search (excluded from the blacklist). We will then run a
+> hill-climbing algorithm through this higher-dimensional vector space in order
+> to find a vector (an index setting) for which the ranking results are best.
+
+Soon after we began trying to implement this feature using a locally run
+version of Nordlys, we encountered some issues, the most notable being that
+our machines were unable to index the full DBPedia set in a reasonable amount
+of time, using a reasonable amount of resources. When we encountered this
+issue, we decided that the best option was to use a subset of the DBPedia
+dataset.
+
+The subset that we settled on is the subset that has relevance scores assigned
+to them for any query. We then only consider the result of a given query in our
+assessment.
+
+The above has the additional benefit that the relevance judgements (both the
+human assessment and the score) need not be computed. This meant that simply
+parsing the files that are provided by Nordlys is enough to implement any kind
+of field selected assessment.
+
+Unfortunately, it turned out that we also did not have resources to implement a
+hill-climbing algorithm. Having only 2 programmers made the task slightly too
+much work. Instead, we decided to take a different approach and statically
+analyse the importance of all fields. The measure that we use takes the form
+of:
+
+![Field Relevance Measure](http://mathurl.com/yc2ptq63.png "Field Relevance
+Measure")
+
+Where *relevance* is the BM25 relevance that is stored by Nordlys, *D* is the
+set of documents, *Q* the set of queries, *tf* the function that counts the
+amount of times any of the query terms was found in that field and |*f*| the
+size of the field.
+
+The formula assumes that relevance is more or less linear. The logarithm is
+used because more occurrences of the same term are not as important as the
+first occurrence.
+
+## Code
+
+We use three Python programs that:
+
+1. Get DBPedia entities from the Nordlys API (`scrape.py`)
+2. For each entry in a BM25 run, list DBPedia ID, relevance score and for each
+ field in the entity how many of the values match with at least one of the
+ query terms. This information is what BM25 uses to compute the relevance.
+ This file is `run.py`.
+3. Use that information to investigate how important each field is
+ (`check.py`).
+
+We will now discuss the implementation of each of these files.
+
+### `scrape.py`
+
+- In this file we read lines from `stdin`. These lines are supposed to come
+ from a BM25 run. That way, we only download DBPedia entities that we
+ actually need.
+
+```python
+if __name__ == '__main__':
+ for line in fileinput.input():
+ scrape(line)
+```
+
+- We split the lines from the run file. Only the DBPedia ID is relevant.
+
+```python
+def scrape(line):
+ index, query, dbpediaid, relevance = line.split('\t')
+ try:
+ get(dbpediaid)
+ except Exception as e:
+ with open(ERRORFILE, 'a') as f:
+ f.write(dbpediaid + '\t' + e + '\n')
+```
+
+- We store the entities one per file in the original JSON format. We use the ID
+ as filename, but have to prevent special characters so URL-encode it and
+ remove slashes.
+
+ Normally, Nordlys will refuse queries from a Python user-agent. So we adapt
+ the user-agent to `Radboud University` and Nordlys accepts it happily. We did
+ not hit rate limiting.
+
+```python
+def get(dbpediaid):
+ outfile = os.path.join(OUTDIR, quote_plus(dbpediaid) + '.json')
+ if os.path.isfile(outfile):
+ return
+ url = 'http://api.nordlys.cc/ec/lookup_id/{}'.format(quote_plus(dbpediaid))
+ print(url)
+ result = urlopen(Request(url,
+ headers={'User-Agent': 'Radboud University'})).read()
+ with open(outfile, 'w') as f:
+ f.write(result.decode(encoding='UTF-8'))
+```
+
+### `run.py`
+
+- `queries_stopped.json` lists all query terms. We load this file once, then
+ process a run from `stdin`.
+
+```python
+if __name__ == '__main__':
+ with open('queries_stopped.json') as f:
+ queries = json.load(f)
+
+ for line in fileinput.input():
+ run(queries, line)
+```
+
+- We split the line in the run file. For each field we check (1) how many
+ values there are and (2) how many values match a query term.
+
+```python
+def run(queries, line):
+ query, _, dbpediaid, _, relevance, method = line.split('\t')
+ terms = queries[query].split()
+ try:
+ result = get(dbpediaid)
+ if result is None:
+ return
+ for field, values in result.items():
+ matches = 0
+ for value in values:
+ if match(value, terms):
+ matches += 1
+ print('{}\t{}\t{}\t{}\t{}\t{}\n'.format(
+ query, dbpediaid, relevance, field, len(values), matches))
+ except Exception as e:
+ print(dbpediaid)
+ print(e)
+ with open(ERRORFILE, 'a') as f:
+ f.write(dbpediaid + '\t' + e + '\n')
+```
+
+- For simplicity, we do not use lemmatisation or synonym resolution here, which
+ could be an improvement in a next version.
+
+```python
+def match(value, terms):
+ for v in value.split():
+ if v in terms:
+ return True
+ return False
+```
+
+- `get` simply gets the file that we stored with `scrape.py`:
+
+```python
+def get(dbpediaid):
+ outfile = os.path.join(DATADIR, quote_plus(dbpediaid) + '.json')
+ if not os.path.isfile(outfile):
+ return None
+ with open(outfile) as f:
+ return json.load(f)
+```
+
+### `check.py`
+
+- We keep a dictionary of scores per field and simply compute our weight score:
+
+```python
+if __name__ == '__main__':
+ scores = dict()
+ for line in fileinput.input():
+ query, dbpediaid, relevance, field, nvalues, nmatches = line.split('\t')
+ if field not in scores:
+ scores[field] = 0
+ scores[field] += float(relevance) * log(1 + int(nmatches)/int(nvalues))
+```
+
+- Then we print all scores:
+
+```python
+ for field, score in scores.items():
+ print('{}\t{}'.format(field, score))
+```
+
+### Usage
+
+All this allows for a fairly simple workflow:
+
+```bash
+mkdir data
+./scrape.py < qrels-v2.txt
+./run.py < bm25.run > fields.txt
+./check.py < fields.txt | sort -k2 -n > scores.txt
+```
+
+This assumes that you have the following files from Nordlys:
+
+- [`qrels-v2.txt`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/qrels-v2.txt) (entity list)
+- [`bm25.run`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/runs/bm25.run) (BM25 relevance judgements)
+- [`queries_stopped.json`](https://github.com/iai-group/nordlys/blob/master/data/dbpedia-entity-v2/queries_stopped.json) (query terms)
+
+The system is agnostic with regards to the ranking function (BM25 or another
+method).
+
+## Intermediate Results
+These are the thirty most important fields as found by our measure when used on
+the BM25 relevance scores:
+
+| Field | Score | Used by Nordlys |
+|------------------------------|----------:|:---------------:|
+| `<dbp:imageFlag>` | 2205.50 | ![][n] |
+| `<dbp:office>` | 2246.90 | ![][n] |
+| `<dbp:pushpinMapCaption>` | 2357.07 | ![][n] |
+| `<dbp:description>` | 2357.35 | ![][n] |
+| `<dbp:placeOfBirth>` | 2384.14 | ![][n] |
+| `<dbp:fastTime>` | 2440.73 | ![][n] |
+| `<dbp:imageMap>` | 2485.96 | ![][n] |
+| `<dbp:writer>` | 2689.86 | ![][n] |
+| `<dbp:alt>` | 2691.94 | ![][n] |
+| `<foaf:givenName>` | 2694.41 | ![][y] |
+| `<dbp:poleTime>` | 2698.75 | ![][n] |
+| `<dbp:country>` | 2836.44 | ![][n] |
+| `<dbp:type>` | 3248.58 | ![][n] |
+| `<dbo:office>` | 3425.58 | ![][n] |
+| `<dbp:location>` | 3430.20 | ![][n] |
+| `<dbp:officialName>` | 4316.34 | ![][y] |
+| `<dbp:quote>` | 4470.38 | ![][n] |
+| `<dbp:imageCaption>` | 4480.06 | ![][n] |
+| `<dbp:producer>` | 4704.52 | ![][n] |
+| `<dbp:mapCaption>` | 8040.36 | ![][n] |
+| `<dbp:title>` | 10999.72 | ![][n] |
+| `<dbp:shortDescription>` | 22065.46 | ![][n] |
+| `<dc:description>` | 23442.34 | ![][n] |
+| `<dbp:caption>` | 24697.75 | ![][n] |
+| `<dbp:name>` | 25500.42 | ![][y] |
+| `<foaf:name>` | 32860.37 | ![][y] |
+| `<dbo:wikiPageWikiLinkTent>` | 86218.71 | ![][y] |
+| `<rdfs:label>` | 105358.89 | ![][y] |
+| `<rdfs:comment>` | 514446.08 | ![][n] |
+| `<dbo:abstract>` | 581355.57 | ![][n] |
+
+We see that many of the relevant fields are actually [not used by
+Nordlys](https://iai-group.github.io/DBpedia-Entity/index_details.html).
+However, this is not yet an indication that these fields should be added to the
+index. After all, adding more fields means more computation time to build the
+index and to retrieve search results.
+
+In fact, we expect that many of the fields not used actually display
+similarities with fields that *are* indexed. For example, the `<dbo:abstract>`
+field will probably match because the title is repeated in the abstract.
+
+We can perform the same analysis on the human assessments. This gives a rather
+different list of fields:
+
+| Field | Score | Rank for BM25 | Used by Nordlys |
+|-------------------------------|---------:|--------------:|:---------------:|
+| `<dbp:pushpinMapCaption>` | 133.77 | 28 | ![][n] |
+| `<dbp:foundation>` | 136.32 | 266 | ![][n] |
+| `<dbp:imageCaption>` | 139.85 | 13 | ![][n] |
+| `<dbp:bridgeName>` | 164.91 | 49 | ![][n] |
+| `<dbp:imageFlag>` | 166.35 | 30 | ![][n] |
+| `<dbp:mapCaption>` | 170.93 | 11 | ![][n] |
+| `<dbo:foundingYear>` | 173.92 | 299 | ![][n] |
+| `<dbp:producer>` | 186.37 | 12 | ![][n] |
+| `<dbp:ground>` | 297.25 | 802 | ![][n] |
+| `<dbp:title>` | 328.93 | 10 | ![][n] |
+| `<dc:description>` | 332.05 | 8 | ![][n] |
+| `<dbp:shortDescription>` | 334.79 | 9 | ![][n] |
+| `<dbp:caption>` | 648.73 | 7 | ![][n] |
+| `<foaf:givenName>` | 1436.74 | 21 | ![][y] |
+| `<dbp:name>` | 1961.98 | 6 | ![][y] |
+| `<foaf:name>` | 2086.67 | 5 | ![][y] |
+| `<dbo:wikiPageWikiLinkText>` | 2897.51 | 4 | ![][y] |
+| `<rdfs:label>` | 3483.06 | 3 | ![][y] |
+| `<rdfs:comment>` | 12323.46 | 2 | ![][n] |
+| `<dbo:abstract>` | 13002.74 | 1 | ![][n] |
+
+Based on this, one may want to try adding fields like `<dbp:caption>` to the
+index.
+
+Conversely, this information can also be used to improve the relevance measure.
+Apparently, `<dbp:ground>`, `<dbo:foundingYear>` and `<dbp:foundation>` are
+quite relevant according to human assessors, but not at all according to BM25.
+
+[y]: http://i.stack.imgur.com/iro5J.png
+[n]: http://i.stack.imgur.com/asAya.png
diff --git a/Plan.md b/Plan.md
new file mode 100644
index 0000000..26f5b0d
--- /dev/null
+++ b/Plan.md
@@ -0,0 +1,60 @@
+# Plan
+
+## The Idea
+The DBpedia-Entity repository has base rankings for a select amount of retrieval algorithms for multiple sets of queries.
+These base rankings [were obtained](https://iai-group.github.io/DBpedia-Entity/index_details.html) by running tests with the ranking algorithms on the dataset,
+ where the dataset was reduced to contain only a subset of all possible fields.
+Also, some fields had a special function:
+
+> | Field | Description | Predicates | Notes |
+> | --- | --- | --- | --- |
+> | Names | Names of the entity | `<foaf:name>`, `<dbp:name>`, `<foaf:givenName>`, `<foaf:surname>`, `<dbp:officialName>`, `<dbp:fullname>`, `<dbp:nativeName>`, `<dbp:birthName>`, `<dbo:birthName>`, `<dbp:nickname>`, `<dbp:showName>`, `<dbp:shipName>`, `<dbp:clubname>`, `<dbp:unitName>`, `<dbp:otherName>`, `<dbo:formerName>`, `<dbp:birthname>`, `<dbp:alternativeNames>`, `<dbp:otherNames>`, `<dbp:names>`, `<rdfs:label>` | |
+> | Categories | Entity types | `<dcterms:subject>` | |
+> | Similar entity names | Entity name variants | `!<dbo:wikiPageRedirects>`, `!<dbo:wikiPageDisambiguates>`, `<dbo:wikiPageWikiLinkText>` | `!` denotes reverse direction (i.e. `<o, p, s>`) |
+> | Attributes | Literal attibutes of entity | All `<s, p, o>`, where *"o"* is a literal and *"p"* is not in *Names*, *Categories*, *Similar entity names*, and blacklist predicates.For each `<s, p, o>` triple, if `p matches <dbp:.*>` both *p* and *o* are stored (i.e. *"p o"* is indexed). | |
+> | Related entity names | URI relations of entity| Similar to *Attributes* field, but *"o"* should be a URI. | |
+
+> ### Index B
+> - Anchor texts (i.e. contents of `<dbo:wikiPageWikiLinkText>` predicate) are added to both "similar entity names" and "attributes" fields.
+> - Entity URIs are resolved differently for the "related entity names" field. Names for related entities are extracted in the same way as it is done for "names" field (see predicates for "names" in the above table), but only one arbitrary name is used for each related entity.
+> - Category URIs are resolved using `category_labels_en.ttl` file
+> - Predicate URIs are resolved using `infobox_property_definitions_en.ttl` file. If a name for a predicate is not defined, a predicate is omitted.
+
+However, of the remaining fields not all information is used to base the ranking on;
+ some fields are simply ignored.
+Which fields are ignored can be found in the [Nordlys repository](https://github.com/iai-group/nordlys/blob/master/data/config/index_dbpedia_2015_10.config.json),
+ in the `blacklist` key.
+We could not find how the Nordlys group has decided on putting these fields in the blacklist &mdash;
+ it might be, that this is just based on educated, but subjective, guesses.
+
+It is important to base this blacklist on actual observations,
+ because this may improve the results of the retrieval function.
+Hence, we want to find a better, objectively produced, reproducible blacklist.
+
+### Our Approach
+We consider a vector space where every possible search field represents a binary parameter.
+A vector has `1` for the parameter iff it is included in the search (excluded from the blacklist).
+We will then run a hill-climbing algorithm through this higher-dimensional vector space
+ in order to find a vector (an index setting) for which the ranking results are best.
+
+We measure the quality of the ranking using Normalized Discounted Cumulative Gain (NDCG).
+This is the same method Nordlys [used](http://nordlys.readthedocs.io/en/latest/er.html?highlight=NDCG#benchmark-results) for benchmarking,
+ which allows us to verify our first results.
+
+We will use only one ranking function to start with (the fastest, or the one we can get working most easily),
+ but might extend it to more ranking functions.
+On first sight, that does not seem to be particularly interesting;
+ it would be 'more of the same'.
+
+## Nordlys
+Nordlyss is a toolkit for entity-oriented and semantic search. It currently supports four entity-oriented tasks, which could be useful for our project. These entity-oriented tasks are:
+- `Entity cataloging`
+- `Entity retrieval` Returns a ranked list of entities in response to a query
+- `Entity linking in queries` Identifies entities in a query and links them to the corresponding entry in the Knowledge base
+- `Target type identification` Detects the target types (or categories) of a query
+
+The Nordlys toolkit was used to create the results described above, as such, it provides us with the means to reproduce these results.
+In addition, Nordlys provides a Python interface that can be used to implement the Hill Climbing algorithm.
+
+The data that is used by the results is also bundled with the Nordlys Python package, and has already been indexed.
+This allows us to use the Python package without having to convert/index the data ourselves.
diff --git a/README.md b/README.md
index f089b13..db3dd6a 100644
--- a/README.md
+++ b/README.md
@@ -1,2 +1,4 @@
-# IR-2017-4
-Practical Assignment Repository Team 4
+# Practical Assignment Repository Team 4
+1. [The Plan](Plan.md)
+2. [The Implementation](Implementation.md)
+3. [Evaluation](Evaluation.md)
diff --git a/check.py b/check.py
new file mode 100755
index 0000000..44a7f70
--- /dev/null
+++ b/check.py
@@ -0,0 +1,15 @@
+#!/usr/bin/env python3
+
+import fileinput
+from math import log
+
+if __name__ == '__main__':
+ scores = dict()
+ for line in fileinput.input():
+ query, dbpediaid, relevance, field, nvalues, nmatches = line.split('\t')
+ if field not in scores:
+ scores[field] = 0
+ scores[field] += float(relevance) * log(1 + int(nmatches)/int(nvalues))
+
+ for field, score in scores.items():
+ print('{}\t{}'.format(field, score))
diff --git a/df.py b/df.py
new file mode 100755
index 0000000..0137cd8
--- /dev/null
+++ b/df.py
@@ -0,0 +1,26 @@
+#!/usr/bin/env python3
+
+import os
+import json
+from collections import Counter
+
+if __name__ == '__main__':
+ queries = dict()
+ with open('queries_stopped.json') as f:
+ queries = json.load(f)
+ terms = set([t for q in queries.values() for t in q.split()])
+
+ store = dict()
+
+ for filename in os.listdir('information-retrieval-data/'):
+ with open('information-retrieval-data/' + filename) as f:
+ entity = json.load(f)
+ for field, values in entity.items():
+ if field not in store:
+ store[field] = []
+ store[field] += [v.lower() for value in values for v in value.split() if v in terms]
+
+ for field in store:
+ cnt = Counter(store[field])
+ for term in cnt.items():
+ print('{}\t{}\t{}'.format(field, term[0], term[1]))
diff --git a/run.py b/run.py
new file mode 100755
index 0000000..1551717
--- /dev/null
+++ b/run.py
@@ -0,0 +1,52 @@
+#!/usr/bin/env python3
+
+import fileinput
+import json
+import os
+from urllib.parse import quote_plus
+
+DATADIR = '/home/camil/temp/information-retrieval-data'
+ERRORFILE = 'errors'
+
+def get(dbpediaid):
+ outfile = os.path.join(DATADIR, quote_plus(dbpediaid) + '.json')
+ if not os.path.isfile(outfile):
+ return None
+ with open(outfile) as f:
+ return json.load(f)
+
+def match(value, terms):
+ for v in value.split():
+ if v in terms:
+ return True
+ return False
+
+def run(queries, line):
+ try:
+ query, _, dbpediaid, _, relevance, method = line.split('\t')
+ except ValueError: # For qrels.txt
+ query, _, dbpediaid, relevance = line.split('\t')
+ terms = queries[query].split()
+ try:
+ result = get(dbpediaid)
+ if result is None:
+ return
+ for field, values in result.items():
+ matches = 0
+ for value in values:
+ if match(value, terms):
+ matches += 1
+ print('{}\t{}\t{}\t{}\t{}\t{}'.format(
+ query, dbpediaid, float(relevance), field, len(values), matches))
+ except Exception as e:
+ print(dbpediaid)
+ print(e)
+ with open(ERRORFILE, 'a') as f:
+ f.write(dbpediaid + '\t' + e + '\n')
+
+if __name__ == '__main__':
+ with open('queries_stopped.json') as f:
+ queries = json.load(f)
+
+ for line in fileinput.input():
+ run(queries, line)
diff --git a/scrape.py b/scrape.py
new file mode 100755
index 0000000..067a16c
--- /dev/null
+++ b/scrape.py
@@ -0,0 +1,33 @@
+#!/usr/bin/env python3
+
+import fileinput
+import os
+from urllib.parse import quote_plus
+from urllib.request import urlopen, Request
+
+OUTDIR = 'data'
+ERRORFILE = 'errors'
+
+def get(dbpediaid):
+ outfile = os.path.join(OUTDIR, quote_plus(dbpediaid) + '.json')
+ if os.path.isfile(outfile):
+ return
+ url = 'http://api.nordlys.cc/ec/lookup_id/{}'.format(
+ quote_plus(dbpediaid.replace('/', '___SLASH')).replace('___SLASH', '/'))
+ print(url)
+ result = urlopen(Request(url,
+ headers={'User-Agent': 'Radboud University'})).read()
+ with open(outfile, 'w') as f:
+ f.write(result.decode(encoding='UTF-8'))
+
+def scrape(line):
+ index, query, dbpediaid, relevance = line.split('\t')
+ try:
+ get(dbpediaid)
+ except Exception as e:
+ with open(ERRORFILE, 'a') as f:
+ f.write(dbpediaid + '\t' + e + '\n')
+
+if __name__ == '__main__':
+ for line in fileinput.input():
+ scrape(line)