From 9afaec05649f76afbd5f06c7b29ab0ec33b6e86e Mon Sep 17 00:00:00 2001 From: Camil Staps Date: Fri, 15 Dec 2017 11:15:52 +0100 Subject: Remove Install.md --- Install.md | 45 --------------------------------------------- README.md | 1 - 2 files changed, 46 deletions(-) delete mode 100644 Install.md diff --git a/Install.md b/Install.md deleted file mode 100644 index 212b0cc..0000000 --- a/Install.md +++ /dev/null @@ -1,45 +0,0 @@ -# Installation instructions - -We use software by the [Nordlys][] group. -The installation instructions are [here](http://nordlys.readthedocs.io/en/latest/installation.html). - -For now, we don't work on the full dataset so we use step 3.1B, **not** 3.1A. - -Concretely, the following needs to be done: - -```bash -# Requirements -apt install mongodb -service mongodb start - -# Nordlys repo -git clone https://github.com/iai-group/nordlys.git -cd nordlys - -# Install python code -pip install -r requirements.txt - -# Load data -./scripts/load_mongo_dumps.sh -``` - -If you get encoding errors, you may need to change `python` to `python3` in -the scripts in `./scripts`. - -Now we need elasticsearch and build the index: - -```bash -wget -qO- https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/2.3.4/elasticsearch-2.3.4.tar.gz | tar xzv -cd elasticsearch*/bin -./elasticsearch -``` - -And in an other shell: - -```bash -./scripts/build_dbpedia_index.sh -``` - -However, that currently only adds 97 documents which does not seem reasonable. - -[Nordlys]: http://nordlys.cc/ diff --git a/README.md b/README.md index 5350acc..d080980 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,3 @@ # Practical Assignment Repository Team 4 -0. [Installation](Install.md) 1. [The Plan](Plan.md) 2. [The Implementation](Implementation.md) -- cgit v1.2.3 From 2f8ac2a790326c1092e40ee895829c8ee3a86da2 Mon Sep 17 00:00:00 2001 From: Camil Staps Date: Fri, 15 Dec 2017 11:52:00 +0100 Subject: Add results --- Implementation.md | 49 ++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 48 insertions(+), 1 deletion(-) diff --git a/Implementation.md b/Implementation.md index b60b8b9..ae527d7 100644 --- a/Implementation.md +++ b/Implementation.md @@ -187,4 +187,51 @@ This assumes that you have the following files from Nordlys: The system is agnostic with regards to the ranking function (BM25 or another method). -## Intermediate Result +## Intermediate Results +These are the thirty most important fields as found by our measure: + +| Field | Score | Used by Nordlys | +|------------------------------|----------:|:---------------:| +| `` | 2205.50 | ![][n] | +| `` | 2246.90 | ![][n] | +| `` | 2357.07 | ![][n] | +| `` | 2357.35 | ![][n] | +| `` | 2384.14 | ![][n] | +| `` | 2440.73 | ![][n] | +| `` | 2485.96 | ![][n] | +| `` | 2689.86 | ![][n] | +| `` | 2691.94 | ![][n] | +| `` | 2694.41 | ![][y] | +| `` | 2698.75 | ![][n] | +| `` | 2836.44 | ![][n] | +| `` | 3248.58 | ![][n] | +| `` | 3425.58 | ![][n] | +| `` | 3430.20 | ![][n] | +| `` | 4316.34 | ![][y] | +| `` | 4470.38 | ![][n] | +| `` | 4480.06 | ![][n] | +| `` | 4704.52 | ![][n] | +| `` | 8040.36 | ![][n] | +| `` | 10999.72 | ![][n] | +| `` | 22065.46 | ![][n] | +| `` | 23442.34 | ![][n] | +| `` | 24697.75 | ![][n] | +| `` | 25500.42 | ![][y] | +| `` | 32860.37 | ![][y] | +| `` | 86218.71 | ![][y] | +| `` | 105358.89 | ![][y] | +| `` | 514446.08 | ![][n] | +| `` | 581355.57 | ![][n] | + +We see that many of the relevant fields are actually [not used by +Nordlys](https://iai-group.github.io/DBpedia-Entity/index_details.html). +However, this is not yet an indication that these fields should be added to the +index. After all, adding more fields means more computation time to build the +index and to retrieve search results. + +In fact, we expect that many of the fields not used actually display +similarities with fields that *are* indexed. For example, the `` +field will probably match because the title is repeated in the abstract. + +[y]: http://i.stack.imgur.com/iro5J.png +[n]: http://i.stack.imgur.com/asAya.png -- cgit v1.2.3 From 4e035c5dc4c3fc27a114ab81817dc467a4fd2095 Mon Sep 17 00:00:00 2001 From: Camil Staps Date: Fri, 15 Dec 2017 12:04:05 +0100 Subject: Copy editing --- Implementation.md | 63 +++++++++++++++++++++++++++++++++---------------------- 1 file changed, 38 insertions(+), 25 deletions(-) diff --git a/Implementation.md b/Implementation.md index ae527d7..e340ace 100644 --- a/Implementation.md +++ b/Implementation.md @@ -1,30 +1,43 @@ # Implementation ## Feasibility -The Plan mentions the following: -> We consider a vector space where every possible search field represents a binary parameter. -> A vector has `1` for the parameter if and only if it is included in the search (excluded from the blacklist). -> We will then run a hill-climbing algorithm through this higher-dimensional vector space -> in order to find a vector (an index setting) for which the ranking results are best. - -Soon after we began trying to implement this feature using a locally run version of nordlys, we encountered some issues. -The most notable being that our machines were unable to index the full DB-Pedia set in a reasonable amount of time, using a reasonable amount of resources. -When we encountered this issue, we decided that the best options was using a subset of the DB-Pedia dataset. - -The subset that we settled on is the subset that has relevance scores assigned to them for any query. -We then only considered the result of a given query in our assessment. - -The above has the added benefit that the relevance (both the human assessment and the score) are precomputed. -This meant that simply parsing the files that are provided by nordlys is enough to implement any kind of field selected assessment. - -Unfortunately, it turned out that hill-climbing was also out of the scope of the assignment. -Having only 2 programmers, both of whom have not a lot of experience in implementing such algorithms, made the task slightly to much work. -Instead, we decided to take a different approach and statically analyse the importance of all fields. -The meansure that we use take the form of: - -![Field Relevance Measure](http://mathurl.com/yc2ptq63.png "Field Relevance Measure") - -Where `relevance` is the bm25 relevance that is stored by nordlys, `D` is the set of documents, `Q` the set of queries, `tf` the function that counts the amount of times any of the query terms was found in that field and `|f|` the size of the field. +Our [Plan](Plan.md) mentions the following: + +> We consider a vector space where every possible search field represents a +> binary parameter. A vector has `1` for the parameter if and only if it is +> included in the search (excluded from the blacklist). We will then run a +> hill-climbing algorithm through this higher-dimensional vector space in order +> to find a vector (an index setting) for which the ranking results are best. + +Soon after we began trying to implement this feature using a locally run +version of Nordlys, we encountered some issues, the most notable being that +our machines were unable to index the full DBPedia set in a reasonable amount +of time, using a reasonable amount of resources. When we encountered this +issue, we decided that the best option was to use a subset of the DBPedia +dataset. + +The subset that we settled on is the subset that has relevance scores assigned +to them for any query. We then only consider the result of a given query in our +assessment. + +The above has the additional benefit that the relevance judgements (both the +human assessment and the score) need not be computed. This meant that simply +parsing the files that are provided by Nordlys is enough to implement any kind +of field selected assessment. + +Unfortunately, it turned out that we also did not have resources to implement a +hill-climbing algorithm. Having only 2 programmers made the task slightly too +much work. Instead, we decided to take a different approach and statically +analyse the importance of all fields. The measure that we use takes the form +of: + +![Field Relevance Measure](http://mathurl.com/yc2ptq63.png "Field Relevance +Measure") + +Where *relevance* is the BM25 relevance that is stored by Nordlys, *D* is the +set of documents, *Q* the set of queries, *tf* the function that counts the +amount of times any of the query terms was found in that field and |*f*| the +size of the field. ## Code @@ -43,7 +56,7 @@ We will now discuss the implementation of each of these files. ### `scrape.py` - In this file we read lines from `stdin`. These lines are supposed to come - from a BM25 run. That way, we only download DBPedia entities that we + from a BM25 run. That way, we only download DBPedia entities that we actually need. ```python -- cgit v1.2.3 From 7c668cf2a00f770f54ac11597d9149fa42f502a5 Mon Sep 17 00:00:00 2001 From: Camil Staps Date: Fri, 15 Dec 2017 12:38:43 +0100 Subject: Add analysis on qrels --- Implementation.md | 40 +++++++++++++++++++++++++++++++++++++++- run.py | 9 ++++++--- 2 files changed, 45 insertions(+), 4 deletions(-) diff --git a/Implementation.md b/Implementation.md index e340ace..1a81e6f 100644 --- a/Implementation.md +++ b/Implementation.md @@ -39,6 +39,10 @@ set of documents, *Q* the set of queries, *tf* the function that counts the amount of times any of the query terms was found in that field and |*f*| the size of the field. +The formula assumes that relevance is more or less linear. The logarithm is +used because more occurrences of the same term are not as important as the +first occurrence. + ## Code We use three Python programs that: @@ -201,7 +205,8 @@ The system is agnostic with regards to the ranking function (BM25 or another method). ## Intermediate Results -These are the thirty most important fields as found by our measure: +These are the thirty most important fields as found by our measure when used on +the BM25 relevance scores: | Field | Score | Used by Nordlys | |------------------------------|----------:|:---------------:| @@ -246,5 +251,38 @@ In fact, we expect that many of the fields not used actually display similarities with fields that *are* indexed. For example, the `` field will probably match because the title is repeated in the abstract. +We can perform the same analysis on the human assessments. This gives a rather +different list of fields: + +| Field | Score | Rank for BM25 | Used by Nordlys | +|-------------------------------|---------:|--------------:|:---------------:| +| `` | 133.77 | 28 | ![][n] | +| `` | 136.32 | 266 | ![][n] | +| `` | 139.85 | 13 | ![][n] | +| `` | 164.91 | 49 | ![][n] | +| `` | 166.35 | 30 | ![][n] | +| `` | 170.93 | 11 | ![][n] | +| `` | 173.92 | 299 | ![][n] | +| `` | 186.37 | 12 | ![][n] | +| `` | 297.25 | 802 | ![][n] | +| `` | 328.93 | 10 | ![][n] | +| `` | 332.05 | 8 | ![][n] | +| `` | 334.79 | 9 | ![][n] | +| `` | 648.73 | 7 | ![][n] | +| `` | 1436.74 | 21 | ![][y] | +| `` | 1961.98 | 6 | ![][y] | +| `` | 2086.67 | 5 | ![][y] | +| `` | 2897.51 | 4 | ![][y] | +| `` | 3483.06 | 3 | ![][y] | +| `` | 12323.46 | 2 | ![][n] | +| `` | 13002.74 | 1 | ![][n] | + +Based on this, one may want to try adding fields like `` to the +index. + +Conversely, this information can also be used to improve the relevance measure. +Apparently, ``, `` and `` are +quite relevant according to human assessors, but not at all according to BM25. + [y]: http://i.stack.imgur.com/iro5J.png [n]: http://i.stack.imgur.com/asAya.png diff --git a/run.py b/run.py index 7b42ea8..1551717 100755 --- a/run.py +++ b/run.py @@ -22,7 +22,10 @@ def match(value, terms): return False def run(queries, line): - query, _, dbpediaid, _, relevance, method = line.split('\t') + try: + query, _, dbpediaid, _, relevance, method = line.split('\t') + except ValueError: # For qrels.txt + query, _, dbpediaid, relevance = line.split('\t') terms = queries[query].split() try: result = get(dbpediaid) @@ -33,8 +36,8 @@ def run(queries, line): for value in values: if match(value, terms): matches += 1 - print('{}\t{}\t{}\t{}\t{}\t{}\n'.format( - query, dbpediaid, relevance, field, len(values), matches)) + print('{}\t{}\t{}\t{}\t{}\t{}'.format( + query, dbpediaid, float(relevance), field, len(values), matches)) except Exception as e: print(dbpediaid) print(e) -- cgit v1.2.3