\documentclass[10pt,a4paper]{article} \usepackage[margin=2cm]{geometry} \usepackage{graphicx} \let\assignment5 \usepackage{enumitem} \setenumerate[1]{label=\assignment.\arabic*.} \setenumerate[2]{label=\arabic*.} \setenumerate[3]{label=\roman*.} \usepackage{fancyhdr} \renewcommand{\headrulewidth}{0pt} \renewcommand{\footrulewidth}{0pt} \fancyhead{} %\fancyfoot[C]{Copyright {\textcopyright} 2015 Camil Staps} \pagestyle{fancy} \usepackage{caption} \usepackage{subcaption} \usepackage[hidelinks]{hyperref} \usepackage{listings} \lstset{basicstyle=\small\ttfamily,columns=flexible,breaklines=true} \usepackage{nicefrac} \parindent0pt \title{Data Mining - assignment \assignment} \author{Camil Staps\\\small{s4498062}} \begin{document} \maketitle \thispagestyle{fancy} \begin{enumerate} \item This is the output: \begin{lstlisting}[gobble=12] Mining for frequent itemsets by the Apriori algorithm Mining for associations by the Apriori algorithm Apriori analysis done, extracting results RESULTS: Frequent itemsets: Item: 6[Sup. 100] Item: 2[Sup. 83.3] Item: 2 6[Sup. 83.3] Item: 7[Sup. 83.3] Item: 7 6[Sup. 83.3] Association rules: Rule: 6 <- [Conf. 100,Sup. 100] Rule: 6 <- 2[Conf. 100,Sup. 83.3] Rule: 6 <- 7[Conf. 100,Sup. 83.3] \end{lstlisting} In rule 1 we see simply that everyone studies physics. Either it is very popular or the data has been selected based on this (e.g. we could be only interested in physics students). Rule 2 and rule 3 say that taking math or chemistry is a good indicator for taking physics. However, in this dataset everyone is taking physics, so it isn't particularly intersting. \item \begin{enumerate}[start=2] \item Here are some rules with high confidence: \begin{lstlisting}[gobble=20] [Conf. 99.7,Sup. 31.2]: Star Wars (1977) <- Empire Strikes Back, The (1980), Raiders of the Lost Ark (1981), Return of the Jedi (1983) [Conf. 91.8,Sup. 32.2]: Star Wars (1977) <- Indiana Jones and the Last Crusade (1989) [Conf. 86.6,Sup. 33.5]: Star Wars (1977) <- Star Trek: First Contact (1996) [Conf. 85.1,Sup. 30.3]: Raiders of the Lost Ark (1981) <- Fugitive, The (1993) [Conf. 81.6,Sup. 33.9]: Fargo (1996) <- Twelve Monkeys (1995) \end{lstlisting} I haven't seen any of these movies, so I can't say much about this. Some research tells me many of them (like in the first rule) are Star Wars movies, so it isn't surprising they are correlated. Some more research tells me there are connections between Star Wars and Indiana Jones, so the second rule makes sense as well. Wikipedia\footnote{\url{https://en.wikipedia.org/wiki/Comparison_of_Star_Trek_and_Star_Wars}} says there are similarities between Star Wars and Star Trek, and that the two have influenced each other. That would explain the third rule. Both Raiders of the Lost Ark and The Fugitive are in the categories Action and Adventure according to the IMDb. Harrison Ford was the lead in both movies. Fargo and Twelve Monkeys both are marked as thriller in the IMDb. Apart from that and the data of publication, I can't find anything similar in the two at first glance. \item The movies bought by most users are: \begin{lstlisting}[gobble=20] Star Wars (1977) (support: 61.8) Contact (1997) (support: 54.0) Fargo (1996) (support: 53.9) Return of the Jedi (1983) (support: 53.8) \end{lstlisting} There are few rules with more than three items, because there is just one itemset with support greater than or equal to 30\% (\texttt{[172, 174, 181, 50]}) that has more than three items. Apparently, no other four or more movies can be appointed that have been bought together by at least 30\% of the users. Reducing the support would, obviously, give more rules with more than three items. \item Yes, it is very well possible to have rules with low support but high confidence. Consider a dataset where every record (say 1000) has but one item, \texttt{1}. However, two of the records have two other items as well, \texttt{2} and \texttt{3}. Then the support for the rules \texttt{2 <- 3} and \texttt{3 <- 2} is but $0.2\%$, however, the confidence is $100\%$. \end{enumerate} \item \begin{enumerate} \item \begin{enumerate} \item $\frac{\sigma(a)}{\sigma(b)} = \frac{45}{80} = 0.5625 < 0.6$, so not interesting. \item $\frac{\sigma(a,b)}{\sigma(a)\sigma(b)} = \frac{0.3}{0.45\cdot0.8} \approx 0.833$. Since this is below one, the items are negatively correlated. \item The rule is not interesting. Neither the confidence nor the interest is high enough. \end{enumerate} \item \begin{enumerate} \item Table 2: $$\frac{\nicefrac{105}{294}\cdot\nicefrac{62}{294}}{\nicefrac{87}{294}\cdot\nicefrac{40}{294}} \approx 1.870.$$ Table 3, students: $$\frac{\nicefrac{2}{36}\cdot\nicefrac{20}{36}}{\nicefrac{5}{36}\cdot\nicefrac{9}{36}} \approx 0.889.$$ Table 3, adults: $$\frac{\nicefrac{103}{258}\cdot\nicefrac{42}{258}}{\nicefrac{35}{258}\cdot\nicefrac{78}{258}} \approx 1.585.$$ The association is stronger (positively) when data is pooled together, since both $1.870 > 0.889$ and $1.870 > 1.585$. \item Table 2: $$\frac{\nicefrac{105}{294}-\nicefrac{192}{294}\cdot\nicefrac{145}{294}}{\sqrt{\nicefrac{192}{294}\cdot\nicefrac{145}{294}\cdot\left(1-\nicefrac{192}{294}\right)\left(1-\nicefrac{145}{294}\right)}} \approx 0.147.$$ Table 3, students: $$\frac{\nicefrac{2}{36}-\nicefrac{11}{36}\cdot\nicefrac{7}{36}}{\sqrt{\nicefrac{11}{36}\cdot\nicefrac{7}{36}\cdot\left(1-\nicefrac{11}{36}\right)\left(1-\nicefrac{7}{36}\right)}} \approx -0.021.$$ Table 3, adults: $$\frac{\nicefrac{103}{258}-\nicefrac{181}{258}\cdot\nicefrac{138}{258}}{\sqrt{\nicefrac{181}{258}\cdot\nicefrac{138}{258}\cdot\left(1-\nicefrac{181}{258}\right)\left(1-\nicefrac{138}{258}\right)}} \approx 0.105.$$ Also this association is stronger when data is pooled together, for the analogous reason. \item Table 2: $$\frac{\nicefrac{105}{294}}{\nicefrac{145}{294}\cdot\nicefrac{192}{294}} \approx 1.109.$$ Table 3, students: $$\frac{\nicefrac{2}{36}}{\nicefrac{7}{36}\cdot\nicefrac{11}{36}} \approx 0.935.$$ Table 3, adults: $$\frac{\nicefrac{103}{258}}{\nicefrac{138}{258}\cdot\nicefrac{181}{258}} \approx 1.063.$$ Also this association is stronger when data is pooled together, for the analogous reason. \end{enumerate} \end{enumerate} \end{enumerate} \end{document}