diff options
author | Camil Staps | 2015-12-12 16:47:44 +0000 |
---|---|---|
committer | Camil Staps | 2015-12-12 16:47:44 +0000 |
commit | c6f86bdb722aac53bb39b0d78d2b538b6f07a692 (patch) | |
tree | a13212987d15f369b0a448df87b49bcca4cc7f51 /Assignment 5/report | |
parent | Finish assignment 4 (diff) |
Assignment 5
Diffstat (limited to 'Assignment 5/report')
-rw-r--r-- | Assignment 5/report/assignment5.tex | 156 |
1 files changed, 156 insertions, 0 deletions
diff --git a/Assignment 5/report/assignment5.tex b/Assignment 5/report/assignment5.tex new file mode 100644 index 0000000..47ff27e --- /dev/null +++ b/Assignment 5/report/assignment5.tex @@ -0,0 +1,156 @@ +\documentclass[10pt,a4paper]{article} + +\usepackage[margin=2cm]{geometry} +\usepackage{graphicx} + +\let\assignment5 + +\usepackage{enumitem} +\setenumerate[1]{label=\assignment.\arabic*.} +\setenumerate[2]{label=\arabic*.} +\setenumerate[3]{label=\roman*.} + +\usepackage{fancyhdr} +\renewcommand{\headrulewidth}{0pt} +\renewcommand{\footrulewidth}{0pt} +\fancyhead{} +%\fancyfoot[C]{Copyright {\textcopyright} 2015 Camil Staps} +\pagestyle{fancy} + +\usepackage{caption} +\usepackage{subcaption} +\usepackage[hidelinks]{hyperref} + +\usepackage{listings} +\lstset{basicstyle=\small\ttfamily,columns=flexible,breaklines=true} + +\usepackage{nicefrac} + +\parindent0pt + +\title{Data Mining - assignment \assignment} +\author{Camil Staps\\\small{s4498062}} + +\begin{document} + +\maketitle +\thispagestyle{fancy} + +\begin{enumerate} + \item This is the output: + + \begin{lstlisting}[gobble=12] + Mining for frequent itemsets by the Apriori algorithm + Mining for associations by the Apriori algorithm + Apriori analysis done, extracting results + + + RESULTS: + + Frequent itemsets: + Item: 6[Sup. 100] + Item: 2[Sup. 83.3] + Item: 2 6[Sup. 83.3] + Item: 7[Sup. 83.3] + Item: 7 6[Sup. 83.3] + + + Association rules: + Rule: 6 <- [Conf. 100,Sup. 100] + Rule: 6 <- 2[Conf. 100,Sup. 83.3] + Rule: 6 <- 7[Conf. 100,Sup. 83.3] + \end{lstlisting} + + In rule 1 we see simply that everyone studies physics. Either it is very popular or the data has been selected based on this (e.g. we could be only interested in physics students). + + Rule 2 and rule 3 say that taking math or chemistry is a good indicator for taking physics. However, in this dataset everyone is taking physics, so it isn't particularly intersting. + + \item \begin{enumerate}[start=2] + \item Here are some rules with high confidence: + + \begin{lstlisting}[gobble=20] + [Conf. 99.7,Sup. 31.2]: + Star Wars (1977) <- Empire Strikes Back, The (1980), Raiders of the Lost Ark (1981), Return of the Jedi (1983) + [Conf. 91.8,Sup. 32.2]: + Star Wars (1977) <- Indiana Jones and the Last Crusade (1989) + [Conf. 86.6,Sup. 33.5]: + Star Wars (1977) <- Star Trek: First Contact (1996) + [Conf. 85.1,Sup. 30.3]: + Raiders of the Lost Ark (1981) <- Fugitive, The (1993) + [Conf. 81.6,Sup. 33.9]: + Fargo (1996) <- Twelve Monkeys (1995) + \end{lstlisting} + + I haven't seen any of these movies, so I can't say much about this. Some research tells me many of them (like in the first rule) are Star Wars movies, so it isn't surprising they are correlated. + + Some more research tells me there are connections between Star Wars and Indiana Jones, so the second rule makes sense as well. + + Wikipedia\footnote{\url{https://en.wikipedia.org/wiki/Comparison_of_Star_Trek_and_Star_Wars}} says there are similarities between Star Wars and Star Trek, and that the two have influenced each other. That would explain the third rule. + + Both Raiders of the Lost Ark and The Fugitive are in the categories Action and Adventure according to the IMDb. Harrison Ford was the lead in both movies. + + Fargo and Twelve Monkeys both are marked as thriller in the IMDb. Apart from that and the data of publication, I can't find anything similar in the two at first glance. + + \item The movies bought by most users are: + + \begin{lstlisting}[gobble=20] + Star Wars (1977) (support: 61.8) + Contact (1997) (support: 54.0) + Fargo (1996) (support: 53.9) + Return of the Jedi (1983) (support: 53.8) + \end{lstlisting} + + There are few rules with more than three items, because there is just one itemset with support greater than or equal to 30\% (\texttt{[172, 174, 181, 50]}) that has more than three items. Apparently, no other four or more movies can be appointed that have been bought together by at least 30\% of the users. Reducing the support would, obviously, give more rules with more than three items. + + \item Yes, it is very well possible to have rules with low support but high confidence. Consider a dataset where every record (say 1000) has but one item, \texttt{1}. However, two of the records have two other items as well, \texttt{2} and \texttt{3}. Then the support for the rules \texttt{2 <- 3} and \texttt{3 <- 2} is but $0.2\%$, however, the confidence is $100\%$. + \end{enumerate} + + \item \begin{enumerate} + \item \begin{enumerate} + \item $\frac{\sigma(a)}{\sigma(b)} = \frac{45}{80} = 0.5625 < 0.6$, so not interesting. + \item $\frac{\sigma(a,b)}{\sigma(a)\sigma(b)} = \frac{0.3}{0.45\cdot0.8} \approx 0.833$. Since this is below one, the items are negatively correlated. + \item The rule is not interesting. Neither the confidence nor the interest is high enough. + \end{enumerate} + + \item \begin{enumerate} + \item Table 2: + $$\frac{\nicefrac{105}{294}\cdot\nicefrac{62}{294}}{\nicefrac{87}{294}\cdot\nicefrac{40}{294}} \approx 1.870.$$ + + Table 3, students: + $$\frac{\nicefrac{2}{36}\cdot\nicefrac{20}{36}}{\nicefrac{5}{36}\cdot\nicefrac{9}{36}} \approx 0.889.$$ + + Table 3, adults: + $$\frac{\nicefrac{103}{258}\cdot\nicefrac{42}{258}}{\nicefrac{35}{258}\cdot\nicefrac{78}{258}} \approx 1.585.$$ + + The association is stronger (positively) when data is pooled together, since both $1.870 > 0.889$ and $1.870 > 1.585$. + + \item Table 2: + + $$\frac{\nicefrac{105}{294}-\nicefrac{192}{294}\cdot\nicefrac{145}{294}}{\sqrt{\nicefrac{192}{294}\cdot\nicefrac{145}{294}\cdot\left(1-\nicefrac{192}{294}\right)\left(1-\nicefrac{145}{294}\right)}} \approx 0.147.$$ + + Table 3, students: + $$\frac{\nicefrac{2}{36}-\nicefrac{11}{36}\cdot\nicefrac{7}{36}}{\sqrt{\nicefrac{11}{36}\cdot\nicefrac{7}{36}\cdot\left(1-\nicefrac{11}{36}\right)\left(1-\nicefrac{7}{36}\right)}} \approx -0.021.$$ + + Table 3, adults: + $$\frac{\nicefrac{103}{258}-\nicefrac{181}{258}\cdot\nicefrac{138}{258}}{\sqrt{\nicefrac{181}{258}\cdot\nicefrac{138}{258}\cdot\left(1-\nicefrac{181}{258}\right)\left(1-\nicefrac{138}{258}\right)}} \approx 0.105.$$ + + Also this association is stronger when data is pooled together, for the analogous reason. + + \item Table 2: + + $$\frac{\nicefrac{105}{294}}{\nicefrac{145}{294}\cdot\nicefrac{192}{294}} \approx 1.109.$$ + + Table 3, students: + $$\frac{\nicefrac{2}{36}}{\nicefrac{7}{36}\cdot\nicefrac{11}{36}} \approx 0.935.$$ + + Table 3, adults: + $$\frac{\nicefrac{103}{258}}{\nicefrac{138}{258}\cdot\nicefrac{181}{258}} \approx 1.063.$$ + + Also this association is stronger when data is pooled together, for the analogous reason. + \end{enumerate} + \end{enumerate} + +\end{enumerate} + +\end{document} + |