aboutsummaryrefslogtreecommitdiff
path: root/Assignment 5/report/assignment5.tex
blob: 47ff27edccbbd1fd63e690d709863f6174e79eea (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
\documentclass[10pt,a4paper]{article}

\usepackage[margin=2cm]{geometry}
\usepackage{graphicx}

\let\assignment5

\usepackage{enumitem}
\setenumerate[1]{label=\assignment.\arabic*.}
\setenumerate[2]{label=\arabic*.}
\setenumerate[3]{label=\roman*.}

\usepackage{fancyhdr}
\renewcommand{\headrulewidth}{0pt}
\renewcommand{\footrulewidth}{0pt}
\fancyhead{}
%\fancyfoot[C]{Copyright {\textcopyright} 2015 Camil Staps}
\pagestyle{fancy}

\usepackage{caption}
\usepackage{subcaption}
\usepackage[hidelinks]{hyperref}

\usepackage{listings}
\lstset{basicstyle=\small\ttfamily,columns=flexible,breaklines=true}

\usepackage{nicefrac}

\parindent0pt

\title{Data Mining - assignment \assignment}
\author{Camil Staps\\\small{s4498062}}

\begin{document}

\maketitle
\thispagestyle{fancy}

\begin{enumerate}
    \item This is the output:

        \begin{lstlisting}[gobble=12]
            Mining for frequent itemsets by the Apriori algorithm
            Mining for associations by the Apriori algorithm
            Apriori analysis done, extracting results
            
            
            RESULTS:
            
            Frequent itemsets:
            Item: 6[Sup. 100]
            Item: 2[Sup. 83.3]
            Item: 2 6[Sup. 83.3]
            Item: 7[Sup. 83.3]
            Item: 7 6[Sup. 83.3]
            
            
            Association rules:
            Rule: 6 <- [Conf. 100,Sup. 100]
            Rule: 6 <- 2[Conf. 100,Sup. 83.3]
            Rule: 6 <- 7[Conf. 100,Sup. 83.3]
        \end{lstlisting}

        In rule 1 we see simply that everyone studies physics. Either it is very popular or the data has been selected based on this (e.g. we could be only interested in physics students).

        Rule 2 and rule 3 say that taking math or chemistry is a good indicator for taking physics. However, in this dataset everyone is taking physics, so it isn't particularly intersting.

    \item \begin{enumerate}[start=2]
            \item Here are some rules with high confidence:
        
                \begin{lstlisting}[gobble=20]
                    [Conf. 99.7,Sup. 31.2]:
                        Star Wars (1977) <- Empire Strikes Back, The (1980), Raiders of the Lost Ark (1981), Return of the Jedi (1983)
                    [Conf. 91.8,Sup. 32.2]:
                        Star Wars (1977) <- Indiana Jones and the Last Crusade (1989)
                    [Conf. 86.6,Sup. 33.5]:
                        Star Wars (1977) <- Star Trek: First Contact (1996)
                    [Conf. 85.1,Sup. 30.3]:
                        Raiders of the Lost Ark (1981) <- Fugitive, The (1993)
                    [Conf. 81.6,Sup. 33.9]:
                        Fargo (1996) <- Twelve Monkeys (1995)
                \end{lstlisting}
        
                I haven't seen any of these movies, so I can't say much about this. Some research tells me many of them (like in the first rule) are Star Wars movies, so it isn't surprising they are correlated.
        
                Some more research tells me there are connections between Star Wars and Indiana Jones, so the second rule makes sense as well.
        
                Wikipedia\footnote{\url{https://en.wikipedia.org/wiki/Comparison_of_Star_Trek_and_Star_Wars}} says there are similarities between Star Wars and Star Trek, and that the two have influenced each other. That would explain the third rule.
        
                Both Raiders of the Lost Ark and The Fugitive are in the categories Action and Adventure according to the IMDb. Harrison Ford was the lead in both movies.
        
                Fargo and Twelve Monkeys both are marked as thriller in the IMDb. Apart from that and the data of publication, I can't find anything similar in the two at first glance.
        
            \item The movies bought by most users are:
        
                \begin{lstlisting}[gobble=20]
                    Star Wars (1977) (support: 61.8)
                    Contact (1997) (support: 54.0)
                    Fargo (1996) (support: 53.9)
                    Return of the Jedi (1983) (support: 53.8)
                \end{lstlisting}
        
                There are few rules with more than three items, because there is just one itemset with support greater than or equal to 30\% (\texttt{[172, 174, 181, 50]}) that has more than three items. Apparently, no other four or more movies can be appointed that have been bought together by at least 30\% of the users. Reducing the support would, obviously, give more rules with more than three items.

            \item Yes, it is very well possible to have rules with low support but high confidence. Consider a dataset where every record (say 1000) has but one item, \texttt{1}. However, two of the records have two other items as well, \texttt{2} and \texttt{3}. Then the support for the rules \texttt{2 <- 3} and \texttt{3 <- 2} is but $0.2\%$, however, the confidence is $100\%$.
        \end{enumerate}

    \item \begin{enumerate}
            \item \begin{enumerate}
                    \item $\frac{\sigma(a)}{\sigma(b)} = \frac{45}{80} = 0.5625 < 0.6$, so not interesting.
                    \item $\frac{\sigma(a,b)}{\sigma(a)\sigma(b)} = \frac{0.3}{0.45\cdot0.8} \approx 0.833$. Since this is below one, the items are negatively correlated.
                    \item The rule is not interesting. Neither the confidence nor the interest is high enough.
                \end{enumerate}

            \item \begin{enumerate}
                    \item Table 2: 
                        $$\frac{\nicefrac{105}{294}\cdot\nicefrac{62}{294}}{\nicefrac{87}{294}\cdot\nicefrac{40}{294}} \approx 1.870.$$
                        
                        Table 3, students: 
                        $$\frac{\nicefrac{2}{36}\cdot\nicefrac{20}{36}}{\nicefrac{5}{36}\cdot\nicefrac{9}{36}} \approx 0.889.$$
                        
                        Table 3, adults:
                        $$\frac{\nicefrac{103}{258}\cdot\nicefrac{42}{258}}{\nicefrac{35}{258}\cdot\nicefrac{78}{258}} \approx 1.585.$$

                        The association is stronger (positively) when data is pooled together, since both $1.870 > 0.889$ and $1.870 > 1.585$.

                    \item Table 2:

                        $$\frac{\nicefrac{105}{294}-\nicefrac{192}{294}\cdot\nicefrac{145}{294}}{\sqrt{\nicefrac{192}{294}\cdot\nicefrac{145}{294}\cdot\left(1-\nicefrac{192}{294}\right)\left(1-\nicefrac{145}{294}\right)}} \approx 0.147.$$

                        Table 3, students:
                        $$\frac{\nicefrac{2}{36}-\nicefrac{11}{36}\cdot\nicefrac{7}{36}}{\sqrt{\nicefrac{11}{36}\cdot\nicefrac{7}{36}\cdot\left(1-\nicefrac{11}{36}\right)\left(1-\nicefrac{7}{36}\right)}} \approx -0.021.$$

                        Table 3, adults:
                        $$\frac{\nicefrac{103}{258}-\nicefrac{181}{258}\cdot\nicefrac{138}{258}}{\sqrt{\nicefrac{181}{258}\cdot\nicefrac{138}{258}\cdot\left(1-\nicefrac{181}{258}\right)\left(1-\nicefrac{138}{258}\right)}} \approx 0.105.$$

                        Also this association is stronger when data is pooled together, for the analogous reason.

                    \item Table 2:

                        $$\frac{\nicefrac{105}{294}}{\nicefrac{145}{294}\cdot\nicefrac{192}{294}} \approx 1.109.$$

                        Table 3, students:
                        $$\frac{\nicefrac{2}{36}}{\nicefrac{7}{36}\cdot\nicefrac{11}{36}} \approx 0.935.$$

                        Table 3, adults:
                        $$\frac{\nicefrac{103}{258}}{\nicefrac{138}{258}\cdot\nicefrac{181}{258}} \approx 1.063.$$

                        Also this association is stronger when data is pooled together, for the analogous reason.
                \end{enumerate}
        \end{enumerate}

\end{enumerate}

\end{document}