aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorErin van der Veen2017-12-15 12:42:51 +0100
committerErin van der Veen2017-12-15 12:42:51 +0100
commite555d48548d294c80b5ac7502cec07d3d469d04b (patch)
treeb0c579f196ac513f6ce1c9cc90ed0d95990dd15e
parentCalculate term frequency (diff)
parentAdd analysis on qrels (diff)
Merge branch 'implementation' of github.com:rubigdata/IR-2017-4 into implementation
-rw-r--r--Implementation.md150
-rw-r--r--Install.md45
-rw-r--r--README.md1
-rwxr-xr-xrun.py9
4 files changed, 130 insertions, 75 deletions
diff --git a/Implementation.md b/Implementation.md
index b60b8b9..1a81e6f 100644
--- a/Implementation.md
+++ b/Implementation.md
@@ -1,30 +1,47 @@
# Implementation
## Feasibility
-The Plan mentions the following:
-> We consider a vector space where every possible search field represents a binary parameter.
-> A vector has `1` for the parameter if and only if it is included in the search (excluded from the blacklist).
-> We will then run a hill-climbing algorithm through this higher-dimensional vector space
-> in order to find a vector (an index setting) for which the ranking results are best.
-
-Soon after we began trying to implement this feature using a locally run version of nordlys, we encountered some issues.
-The most notable being that our machines were unable to index the full DB-Pedia set in a reasonable amount of time, using a reasonable amount of resources.
-When we encountered this issue, we decided that the best options was using a subset of the DB-Pedia dataset.
-
-The subset that we settled on is the subset that has relevance scores assigned to them for any query.
-We then only considered the result of a given query in our assessment.
-
-The above has the added benefit that the relevance (both the human assessment and the score) are precomputed.
-This meant that simply parsing the files that are provided by nordlys is enough to implement any kind of field selected assessment.
-
-Unfortunately, it turned out that hill-climbing was also out of the scope of the assignment.
-Having only 2 programmers, both of whom have not a lot of experience in implementing such algorithms, made the task slightly to much work.
-Instead, we decided to take a different approach and statically analyse the importance of all fields.
-The meansure that we use take the form of:
-
-![Field Relevance Measure](http://mathurl.com/yc2ptq63.png "Field Relevance Measure")
-
-Where `relevance` is the bm25 relevance that is stored by nordlys, `D` is the set of documents, `Q` the set of queries, `tf` the function that counts the amount of times any of the query terms was found in that field and `|f|` the size of the field.
+Our [Plan](Plan.md) mentions the following:
+
+> We consider a vector space where every possible search field represents a
+> binary parameter. A vector has `1` for the parameter if and only if it is
+> included in the search (excluded from the blacklist). We will then run a
+> hill-climbing algorithm through this higher-dimensional vector space in order
+> to find a vector (an index setting) for which the ranking results are best.
+
+Soon after we began trying to implement this feature using a locally run
+version of Nordlys, we encountered some issues, the most notable being that
+our machines were unable to index the full DBPedia set in a reasonable amount
+of time, using a reasonable amount of resources. When we encountered this
+issue, we decided that the best option was to use a subset of the DBPedia
+dataset.
+
+The subset that we settled on is the subset that has relevance scores assigned
+to them for any query. We then only consider the result of a given query in our
+assessment.
+
+The above has the additional benefit that the relevance judgements (both the
+human assessment and the score) need not be computed. This meant that simply
+parsing the files that are provided by Nordlys is enough to implement any kind
+of field selected assessment.
+
+Unfortunately, it turned out that we also did not have resources to implement a
+hill-climbing algorithm. Having only 2 programmers made the task slightly too
+much work. Instead, we decided to take a different approach and statically
+analyse the importance of all fields. The measure that we use takes the form
+of:
+
+![Field Relevance Measure](http://mathurl.com/yc2ptq63.png "Field Relevance
+Measure")
+
+Where *relevance* is the BM25 relevance that is stored by Nordlys, *D* is the
+set of documents, *Q* the set of queries, *tf* the function that counts the
+amount of times any of the query terms was found in that field and |*f*| the
+size of the field.
+
+The formula assumes that relevance is more or less linear. The logarithm is
+used because more occurrences of the same term are not as important as the
+first occurrence.
## Code
@@ -43,7 +60,7 @@ We will now discuss the implementation of each of these files.
### `scrape.py`
- In this file we read lines from `stdin`. These lines are supposed to come
- from a BM25 run. That way, we only download DBPedia entities that we
+ from a BM25 run. That way, we only download DBPedia entities that we
actually need.
```python
@@ -187,4 +204,85 @@ This assumes that you have the following files from Nordlys:
The system is agnostic with regards to the ranking function (BM25 or another
method).
-## Intermediate Result
+## Intermediate Results
+These are the thirty most important fields as found by our measure when used on
+the BM25 relevance scores:
+
+| Field | Score | Used by Nordlys |
+|------------------------------|----------:|:---------------:|
+| `<dbp:imageFlag>` | 2205.50 | ![][n] |
+| `<dbp:office>` | 2246.90 | ![][n] |
+| `<dbp:pushpinMapCaption>` | 2357.07 | ![][n] |
+| `<dbp:description>` | 2357.35 | ![][n] |
+| `<dbp:placeOfBirth>` | 2384.14 | ![][n] |
+| `<dbp:fastTime>` | 2440.73 | ![][n] |
+| `<dbp:imageMap>` | 2485.96 | ![][n] |
+| `<dbp:writer>` | 2689.86 | ![][n] |
+| `<dbp:alt>` | 2691.94 | ![][n] |
+| `<foaf:givenName>` | 2694.41 | ![][y] |
+| `<dbp:poleTime>` | 2698.75 | ![][n] |
+| `<dbp:country>` | 2836.44 | ![][n] |
+| `<dbp:type>` | 3248.58 | ![][n] |
+| `<dbo:office>` | 3425.58 | ![][n] |
+| `<dbp:location>` | 3430.20 | ![][n] |
+| `<dbp:officialName>` | 4316.34 | ![][y] |
+| `<dbp:quote>` | 4470.38 | ![][n] |
+| `<dbp:imageCaption>` | 4480.06 | ![][n] |
+| `<dbp:producer>` | 4704.52 | ![][n] |
+| `<dbp:mapCaption>` | 8040.36 | ![][n] |
+| `<dbp:title>` | 10999.72 | ![][n] |
+| `<dbp:shortDescription>` | 22065.46 | ![][n] |
+| `<dc:description>` | 23442.34 | ![][n] |
+| `<dbp:caption>` | 24697.75 | ![][n] |
+| `<dbp:name>` | 25500.42 | ![][y] |
+| `<foaf:name>` | 32860.37 | ![][y] |
+| `<dbo:wikiPageWikiLinkTent>` | 86218.71 | ![][y] |
+| `<rdfs:label>` | 105358.89 | ![][y] |
+| `<rdfs:comment>` | 514446.08 | ![][n] |
+| `<dbo:abstract>` | 581355.57 | ![][n] |
+
+We see that many of the relevant fields are actually [not used by
+Nordlys](https://iai-group.github.io/DBpedia-Entity/index_details.html).
+However, this is not yet an indication that these fields should be added to the
+index. After all, adding more fields means more computation time to build the
+index and to retrieve search results.
+
+In fact, we expect that many of the fields not used actually display
+similarities with fields that *are* indexed. For example, the `<dbo:abstract>`
+field will probably match because the title is repeated in the abstract.
+
+We can perform the same analysis on the human assessments. This gives a rather
+different list of fields:
+
+| Field | Score | Rank for BM25 | Used by Nordlys |
+|-------------------------------|---------:|--------------:|:---------------:|
+| `<dbp:pushpinMapCaption>` | 133.77 | 28 | ![][n] |
+| `<dbp:foundation>` | 136.32 | 266 | ![][n] |
+| `<dbp:imageCaption>` | 139.85 | 13 | ![][n] |
+| `<dbp:bridgeName>` | 164.91 | 49 | ![][n] |
+| `<dbp:imageFlag>` | 166.35 | 30 | ![][n] |
+| `<dbp:mapCaption>` | 170.93 | 11 | ![][n] |
+| `<dbo:foundingYear>` | 173.92 | 299 | ![][n] |
+| `<dbp:producer>` | 186.37 | 12 | ![][n] |
+| `<dbp:ground>` | 297.25 | 802 | ![][n] |
+| `<dbp:title>` | 328.93 | 10 | ![][n] |
+| `<dc:description>` | 332.05 | 8 | ![][n] |
+| `<dbp:shortDescription>` | 334.79 | 9 | ![][n] |
+| `<dbp:caption>` | 648.73 | 7 | ![][n] |
+| `<foaf:givenName>` | 1436.74 | 21 | ![][y] |
+| `<dbp:name>` | 1961.98 | 6 | ![][y] |
+| `<foaf:name>` | 2086.67 | 5 | ![][y] |
+| `<dbo:wikiPageWikiLinkText>` | 2897.51 | 4 | ![][y] |
+| `<rdfs:label>` | 3483.06 | 3 | ![][y] |
+| `<rdfs:comment>` | 12323.46 | 2 | ![][n] |
+| `<dbo:abstract>` | 13002.74 | 1 | ![][n] |
+
+Based on this, one may want to try adding fields like `<dbp:caption>` to the
+index.
+
+Conversely, this information can also be used to improve the relevance measure.
+Apparently, `<dbp:ground>`, `<dbo:foundingYear>` and `<dbp:foundation>` are
+quite relevant according to human assessors, but not at all according to BM25.
+
+[y]: http://i.stack.imgur.com/iro5J.png
+[n]: http://i.stack.imgur.com/asAya.png
diff --git a/Install.md b/Install.md
deleted file mode 100644
index 212b0cc..0000000
--- a/Install.md
+++ /dev/null
@@ -1,45 +0,0 @@
-# Installation instructions
-
-We use software by the [Nordlys][] group.
-The installation instructions are [here](http://nordlys.readthedocs.io/en/latest/installation.html).
-
-For now, we don't work on the full dataset so we use step 3.1B, **not** 3.1A.
-
-Concretely, the following needs to be done:
-
-```bash
-# Requirements
-apt install mongodb
-service mongodb start
-
-# Nordlys repo
-git clone https://github.com/iai-group/nordlys.git
-cd nordlys
-
-# Install python code
-pip install -r requirements.txt
-
-# Load data
-./scripts/load_mongo_dumps.sh
-```
-
-If you get encoding errors, you may need to change `python` to `python3` in
-the scripts in `./scripts`.
-
-Now we need elasticsearch and build the index:
-
-```bash
-wget -qO- https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/2.3.4/elasticsearch-2.3.4.tar.gz | tar xzv
-cd elasticsearch*/bin
-./elasticsearch
-```
-
-And in an other shell:
-
-```bash
-./scripts/build_dbpedia_index.sh
-```
-
-However, that currently only adds 97 documents which does not seem reasonable.
-
-[Nordlys]: http://nordlys.cc/
diff --git a/README.md b/README.md
index 5350acc..d080980 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,3 @@
# Practical Assignment Repository Team 4
-0. [Installation](Install.md)
1. [The Plan](Plan.md)
2. [The Implementation](Implementation.md)
diff --git a/run.py b/run.py
index 7b42ea8..1551717 100755
--- a/run.py
+++ b/run.py
@@ -22,7 +22,10 @@ def match(value, terms):
return False
def run(queries, line):
- query, _, dbpediaid, _, relevance, method = line.split('\t')
+ try:
+ query, _, dbpediaid, _, relevance, method = line.split('\t')
+ except ValueError: # For qrels.txt
+ query, _, dbpediaid, relevance = line.split('\t')
terms = queries[query].split()
try:
result = get(dbpediaid)
@@ -33,8 +36,8 @@ def run(queries, line):
for value in values:
if match(value, terms):
matches += 1
- print('{}\t{}\t{}\t{}\t{}\t{}\n'.format(
- query, dbpediaid, relevance, field, len(values), matches))
+ print('{}\t{}\t{}\t{}\t{}\t{}'.format(
+ query, dbpediaid, float(relevance), field, len(values), matches))
except Exception as e:
print(dbpediaid)
print(e)