Assignment 5

author: Camil Staps 2015-12-12 16:47:44 +0000
committer: Camil Staps 2015-12-12 16:47:44 +0000
commit: c6f86bdb722aac53bb39b0d78d2b538b6f07a692 (patch)
tree: a13212987d15f369b0a448df87b49bcca4cc7f51 /Assignment 5/report
parent: Finish assignment 4 (diff)
1 files changed, 156 insertions, 0 deletions
diff --git a/Assignment 5/report/assignment5.tex b/Assignment 5/report/assignment5.tex
new file mode 100644
index 0000000..47ff27e
--- /dev/null
+++ b/Assignment 5/report/assignment5.tex
@@ -0,0 +1,156 @@
+\documentclass[10pt,a4paper]{article}
+
+\usepackage[margin=2cm]{geometry}
+\usepackage{graphicx}
+
+\let\assignment5
+
+\usepackage{enumitem}
+\setenumerate[1]{label=\assignment.\arabic*.}
+\setenumerate[2]{label=\arabic*.}
+\setenumerate[3]{label=\roman*.}
+
+\usepackage{fancyhdr}
+\renewcommand{\headrulewidth}{0pt}
+\renewcommand{\footrulewidth}{0pt}
+\fancyhead{}
+%\fancyfoot[C]{Copyright {\textcopyright} 2015 Camil Staps}
+\pagestyle{fancy}
+
+\usepackage{caption}
+\usepackage{subcaption}
+\usepackage[hidelinks]{hyperref}
+
+\usepackage{listings}
+\lstset{basicstyle=\small\ttfamily,columns=flexible,breaklines=true}
+
+\usepackage{nicefrac}
+
+\parindent0pt
+
+\title{Data Mining - assignment \assignment}
+\author{Camil Staps\\\small{s4498062}}
+
+\begin{document}
+
+\maketitle
+\thispagestyle{fancy}
+
+\begin{enumerate}
+    \item This is the output:
+
+        \begin{lstlisting}[gobble=12]
+            Mining for frequent itemsets by the Apriori algorithm
+            Mining for associations by the Apriori algorithm
+            Apriori analysis done, extracting results
+            
+            
+            RESULTS:
+            
+            Frequent itemsets:
+            Item: 6[Sup. 100]
+            Item: 2[Sup. 83.3]
+            Item: 2 6[Sup. 83.3]
+            Item: 7[Sup. 83.3]
+            Item: 7 6[Sup. 83.3]
+            
+            
+            Association rules:
+            Rule: 6 <- [Conf. 100,Sup. 100]
+            Rule: 6 <- 2[Conf. 100,Sup. 83.3]
+            Rule: 6 <- 7[Conf. 100,Sup. 83.3]
+        \end{lstlisting}
+
+        In rule 1 we see simply that everyone studies physics. Either it is very popular or the data has been selected based on this (e.g. we could be only interested in physics students).
+
+        Rule 2 and rule 3 say that taking math or chemistry is a good indicator for taking physics. However, in this dataset everyone is taking physics, so it isn't particularly intersting.
+
+    \item \begin{enumerate}[start=2]
+            \item Here are some rules with high confidence:
+        
+                \begin{lstlisting}[gobble=20]
+                    [Conf. 99.7,Sup. 31.2]:
+                        Star Wars (1977) <- Empire Strikes Back, The (1980), Raiders of the Lost Ark (1981), Return of the Jedi (1983)
+                    [Conf. 91.8,Sup. 32.2]:
+                        Star Wars (1977) <- Indiana Jones and the Last Crusade (1989)
+                    [Conf. 86.6,Sup. 33.5]:
+                        Star Wars (1977) <- Star Trek: First Contact (1996)
+                    [Conf. 85.1,Sup. 30.3]:
+                        Raiders of the Lost Ark (1981) <- Fugitive, The (1993)
+                    [Conf. 81.6,Sup. 33.9]:
+                        Fargo (1996) <- Twelve Monkeys (1995)
+                \end{lstlisting}
+        
+                I haven't seen any of these movies, so I can't say much about this. Some research tells me many of them (like in the first rule) are Star Wars movies, so it isn't surprising they are correlated.
+        
+                Some more research tells me there are connections between Star Wars and Indiana Jones, so the second rule makes sense as well.
+        
+                Wikipedia\footnote{\url{https://en.wikipedia.org/wiki/Comparison_of_Star_Trek_and_Star_Wars}} says there are similarities between Star Wars and Star Trek, and that the two have influenced each other. That would explain the third rule.
+        
+                Both Raiders of the Lost Ark and The Fugitive are in the categories Action and Adventure according to the IMDb. Harrison Ford was the lead in both movies.
+        
+                Fargo and Twelve Monkeys both are marked as thriller in the IMDb. Apart from that and the data of publication, I can't find anything similar in the two at first glance.
+        
+            \item The movies bought by most users are:
+        
+                \begin{lstlisting}[gobble=20]
+                    Star Wars (1977) (support: 61.8)
+                    Contact (1997) (support: 54.0)
+                    Fargo (1996) (support: 53.9)
+                    Return of the Jedi (1983) (support: 53.8)
+                \end{lstlisting}
+        
+                There are few rules with more than three items, because there is just one itemset with support greater than or equal to 30\% (\texttt{[172, 174, 181, 50]}) that has more than three items. Apparently, no other four or more movies can be appointed that have been bought together by at least 30\% of the users. Reducing the support would, obviously, give more rules with more than three items.
+
+            \item Yes, it is very well possible to have rules with low support but high confidence. Consider a dataset where every record (say 1000) has but one item, \texttt{1}. However, two of the records have two other items as well, \texttt{2} and \texttt{3}. Then the support for the rules \texttt{2 <- 3} and \texttt{3 <- 2} is but $0.2\%$, however, the confidence is $100\%$.
+        \end{enumerate}
+
+    \item \begin{enumerate}
+            \item \begin{enumerate}
+                    \item $\frac{\sigma(a)}{\sigma(b)} = \frac{45}{80} = 0.5625 < 0.6$, so not interesting.
+                    \item $\frac{\sigma(a,b)}{\sigma(a)\sigma(b)} = \frac{0.3}{0.45\cdot0.8} \approx 0.833$. Since this is below one, the items are negatively correlated.
+                    \item The rule is not interesting. Neither the confidence nor the interest is high enough.
+                \end{enumerate}
+
+            \item \begin{enumerate}
+                    \item Table 2: 
+                        $$\frac{\nicefrac{105}{294}\cdot\nicefrac{62}{294}}{\nicefrac{87}{294}\cdot\nicefrac{40}{294}} \approx 1.870.$$
+                        
+                        Table 3, students: 
+                        $$\frac{\nicefrac{2}{36}\cdot\nicefrac{20}{36}}{\nicefrac{5}{36}\cdot\nicefrac{9}{36}} \approx 0.889.$$
+                        
+                        Table 3, adults:
+                        $$\frac{\nicefrac{103}{258}\cdot\nicefrac{42}{258}}{\nicefrac{35}{258}\cdot\nicefrac{78}{258}} \approx 1.585.$$
+
+                        The association is stronger (positively) when data is pooled together, since both $1.870 > 0.889$ and $1.870 > 1.585$.
+
+                    \item Table 2:
+
+                        $$\frac{\nicefrac{105}{294}-\nicefrac{192}{294}\cdot\nicefrac{145}{294}}{\sqrt{\nicefrac{192}{294}\cdot\nicefrac{145}{294}\cdot\left(1-\nicefrac{192}{294}\right)\left(1-\nicefrac{145}{294}\right)}} \approx 0.147.$$
+
+                        Table 3, students:
+                        $$\frac{\nicefrac{2}{36}-\nicefrac{11}{36}\cdot\nicefrac{7}{36}}{\sqrt{\nicefrac{11}{36}\cdot\nicefrac{7}{36}\cdot\left(1-\nicefrac{11}{36}\right)\left(1-\nicefrac{7}{36}\right)}} \approx -0.021.$$
+
+                        Table 3, adults:
+                        $$\frac{\nicefrac{103}{258}-\nicefrac{181}{258}\cdot\nicefrac{138}{258}}{\sqrt{\nicefrac{181}{258}\cdot\nicefrac{138}{258}\cdot\left(1-\nicefrac{181}{258}\right)\left(1-\nicefrac{138}{258}\right)}} \approx 0.105.$$
+
+                        Also this association is stronger when data is pooled together, for the analogous reason.
+
+                    \item Table 2:
+
+                        $$\frac{\nicefrac{105}{294}}{\nicefrac{145}{294}\cdot\nicefrac{192}{294}} \approx 1.109.$$
+
+                        Table 3, students:
+                        $$\frac{\nicefrac{2}{36}}{\nicefrac{7}{36}\cdot\nicefrac{11}{36}} \approx 0.935.$$
+
+                        Table 3, adults:
+                        $$\frac{\nicefrac{103}{258}}{\nicefrac{138}{258}\cdot\nicefrac{181}{258}} \approx 1.063.$$
+
+                        Also this association is stronger when data is pooled together, for the analogous reason.
+                \end{enumerate}
+        \end{enumerate}
+
+\end{enumerate}
+
+\end{document}
+
author	Camil Staps	2015-12-12 16:47:44 +0000
committer	Camil Staps	2015-12-12 16:47:44 +0000
commit	c6f86bdb722aac53bb39b0d78d2b538b6f07a692 (patch)
tree	a13212987d15f369b0a448df87b49bcca4cc7f51 /Assignment 5/report
parent	Finish assignment 4 (diff)