5.6 Data Mining, Discovery, and Exploration

Notes from readings text readings:

Mining of Massive Datasets, 3rd Ed
Networks, 2nd edition, Mark Newman
Algorithms and data structures for massive datasets, Medjedovic, Dzejla, Tahirovic, Emin, Manning, 2022
Advanced algorithms and data structures, La Rocca, Marcello, Manning, 2021
Data Mining, Concepts and Techniques, 4th Edition, Jiawei Han, Jian Pei, Hanghang Tong, Morgan Kaufmann, 2022

5.6.1 9/3 Course introduction and introduction to data mining

5.6.1.1 What is data mining?

The concept of data mining is the science of actionable knowledge discovery from inferences on massive datasets by uses of powerful hardware, programming systems, and algorithms to solve problems in science, commerce, healthcare, government, humanities, etc.

5.6.1.2 Goals of data mining

The process of generating a model and using an algorithm to make predictions.

Ex. Detecting phishing attacksL Build a model of phishing emails by examining emails with keywords or phrases. Apply model to each email to determine weights to certain words and apply that algorithm to determine if the email is phishing.

Typically the modeal is a statistical model to visualize the underlying distribution of the data.

Ex. If our data is a set of numbers, the modoel might be a gaussian distribution with parameters mean and standard deviation.

Data mining can be done with algorithms from machine learning, such as Bayes nets, support-vector machines, decisions trees, hidden Markov models, etc. This is often done in situations when we know little about the problem that is being solved. Sometimes we do not know the explination for what the model is doing.

A model is the answer to a complex query about the data like calculating the standard deviation and mean. The general approach to modeling can be described as either 1) summarizing the sata succinctly and approximately (PageRank or clustering), or 2) finding the most prominen features of the data and ignoring the rest (feature extraction, i.e., frequent itemset or similar items).

5.6.1.3 Limitations

Response suspicious events was called TIA - Total Information Awareness, which was a program focused on mining all the data that they could find to track suspicious activity. Now, there’s also a focus on information ingegration, relating and combining different data source into one to obtain insights is a key to solve important problems. However, with large data sets, you often can find interesting things but may be artifacts with no significance.

5.6.1.4 Bonferroni’s Principle

Data, especially large data, there’s an expected inherent stochastic variability. What this means is that if we look hard enough at the data assuming an assumption with significant features, there’s a posibiility that it may be bogus. Bonferroni correction gives a statistically sound way to avoid most false positives. Informally, calculate the expected number of expected events assuming iid (independent identically distributed), if the expected is significantly larger than number of real events you expect to find, then this implies artifact of randomness. A larger observered than expected implies statistically significant.

Exercise 1.2.1 Using the information from Section 1.2.3, what would be the number of suspected pairs if the following changes were made to the data (and all other numbers remained as they were in that section)?

The number of days of observation was raised to 2000.
The number of people observed was raised to 2 billion (and there were therefore 200,000 hotels).
We only reported a pair as suspect if they were at the same hotel at the same time on three diﬀerent days.

# a The number of days of observation was raised to 2000.
P_2people_sameHotel_sameDay <- 0.01 * 0.01
hotels = 1e+05
P_sameHotel_sameDay <- P_2people_sameHotel_sameDay/hotels
print(P_sameHotel_sameDay)

## [1] 1e-09

P_2people_sameHotel_twoDays <- P_sameHotel_sameDay^2
print(P_2people_sameHotel_twoDays)

## [1] 1e-18

# 1 billion people might be evil-doers
n = 1e+09

Pairs_people <- (n^2)/2
print(Pairs_people)

## [1] 5e+17

# pairs of days
days <- 1000
Pairs_days <- (days^2)/2
print(Pairs_days)

## [1] 5e+05

Expected_events <- Pairs_people * Pairs_days * P_2people_sameHotel_twoDays
print(Expected_events)

## [1] 250000

# pairs of days, raise to 2000
days <- 2000
Pairs_days <- (days^2)/2
print(Pairs_days)

## [1] 2e+06

Expected_events <- Pairs_people * Pairs_days * P_2people_sameHotel_twoDays
print(Expected_events)

## [1] 1e+06

# The days increase observations from 5e5 to 1e6.

# b The number of people observed was raised to 2 billion
# (and there were therefore 200,000 hotels).

P_2people_sameHotel_sameDay <- 0.01 * 0.01
hotels = 2e+05
P_sameHotel_sameDay <- P_2people_sameHotel_sameDay/hotels
print(P_sameHotel_sameDay)

## [1] 5e-10

P_2people_sameHotel_twoDays <- P_sameHotel_sameDay^2
print(P_2people_sameHotel_twoDays)

## [1] 2.5e-19

# 1 billion people might be evil-doers
n = 2e+09

Pairs_people <- (n^2)/2
print(Pairs_people)

## [1] 2e+18

# pairs of days
days <- 1000
Pairs_days <- (days^2)/2
print(Pairs_days)

## [1] 5e+05

Expected_events <- Pairs_people * Pairs_days * P_2people_sameHotel_twoDays
print(Expected_events)

## [1] 250000

# There were 250,000 observations

# c We only reported a pair as suspect if they were at the
# same hotel at the same time on three diﬀerent days.


P_2people_sameHotel_sameDay <- 0.01 * 0.01
hotels = 1e+05
P_sameHotel_sameDay <- P_2people_sameHotel_sameDay/hotels
print(P_sameHotel_sameDay)

## [1] 1e-09

P_2people_sameHotel_twoDays <- P_sameHotel_sameDay^3
print(P_2people_sameHotel_twoDays)

## [1] 1e-27

# 1 billion people might be evil-doers
n = 1e+09

Pairs_people <- (n^2)/2
print(Pairs_people)

## [1] 5e+17

# pairs of days
days <- 1000
Pairs_days <- (days^3)/6
print(Pairs_days)

## [1] 166666667

Expected_events <- Pairs_people * Pairs_days * P_2people_sameHotel_twoDays
print(Expected_events)

## [1] 0.08333333

Exercise 1.2.2 Suppose we have information about the supermarket purchases of 100 million people. Each person goes to the supermarket 100 times in a year and buys 10 of the 1000 items that the supermarket sells. We believe that a pair of terrorists will buy exactly the same set of 10 items (perhaps the ingredients for a bomb?) at some time during the year. If we search for pairs of people who have bought the same set of items, would we expect that any such people found were truly terrorists?

n <- 1e+08
VisitsPerPerson <- 100
ItemsPerVisit <- 10
TotalItems <- 1000

P_TwoSameSet <- 1/choose(1000, 10)
print(P_TwoSameSet)

## [1] 3.796369e-24

Pairs_People <- choose(1e+08, 2)

Pairs_Visits <- choose(100, 2)

Expected_Events <- Pairs_People * P_TwoSameSet * Pairs_Visits
Expected_Events

## [1] 9.396014e-05

The chances are pretty low.

5.6.2 Important words

In the concept of TF.IDF, term frequency times inverse document frequency, our goal is to classify documents based on words presented. It is computed as follows:

$f_{ij}$ = number of occurances $i$ = word $$j$ = document

\[ TF_{ij} = \frac{f_{ij}}{\max_k f_{kj}} \]

normalized by dividing by the maximum number of occurance of any terms, maybe excluding stop terms

5.6.2.1 What is a hash function, a hash table?

A hash function is a function that uses a key, value structure to calculate and return a value based on the key. This converts input data into a fixed-size numeric value called a hash code, or a hash value. This is used to store or retrieve the associated data within a hash table.

Arrays and hash tables are different in these ways:

Arrays

Require a numeric index
Need the exact index to retrieve the data
Fixed memory

Hash Table

Index is calculated by key (string, number, etc)
Only need the key, not the position
Dynamic in size, when performance degrades, resize and retrieve at O(1)

What does a good hash look like?

Deterministic, reliably produces the same output
Uniform distribution to minimize collision
Fast and efficient
Minimize collisions, Different inputs produce different hash values

Examples

mmh3 <- import("mmh3")
py$mmh3 <- mmh3

# Hash for 32
mmh3$hash("Hello")

## [1] 316307400

mmh3$hash("Hello", seed = 5L, signed = TRUE)

## [1] -196410714

# Hash for 64 128
mmh3$hash("Hello")

## [1] 316307400

mmh3$hash64(key = "Hello", seed = 0L, x64arch = TRUE, signed = TRUE)

## [[1]]
## [1] 1440007196
## 
## [[2]]
## [1] 689067332

mmh3$hash64("Hello", seed = 0L, x64arch = FALSE, signed = TRUE)

## [[1]]
## [1] 593538630
## 
## [[2]]
## [1] -1387940876

py_run_string("hash128_str = str(mmh3.hash128('Hello'))")
hash128_str <- py$hash128_str
print(hash128_str)

## [1] "212681241822374483335035321234914329628"

5.6.2.2 Hashing for managing massive data sets

MapReduce is a system implement large scale calculations on computing clusters. Efficient algorithms stem from communication costs which can be traded with parallelism.

5.6.2.3 Scalable hypothesis test algorithms for data mining

Exercise 1.3.1**
Suppose there is a repository of ten million documents. What (to the nearest integer) is the IDF for a word that appears in (a) 40 documents (b) 10,000 documents?

\[ \text{IDF} = \log \left(\frac{N}{df}\right) \]

$N$ = total number of documents $df$ = number of documents containing the word

N <- 1e+07
df <- 40

IDF <- round(log(N/df))
print(IDF)

## [1] 12

df <- 10000
IDF <- round(log(N/df))
print(IDF)

## [1] 7

Exercise 1.3.2
Suppose there is a repository of ten million documents, and word $w$ appears in 320 of them. In a particular document $d$, the maximum number of occurrences of a word is 15. Approximately what is the TF.IDF score for $w$ if that word appears (a) once (b) five times?

\[ TF = \frac{f_{w,d}}{\max_k f_{k,d}} \]

and

\[ IDF = \log \left(\frac{N}{df}\right) \]

\[ TF \times IDF \]

N <- 1e+07
df <- 320
d <- 15

IDF <- round(log(N/df))
print(IDF)

## [1] 10

w <- 1
TF <- w/d
TFIDF <- TF * IDF
print(TFIDF)

## [1] 0.6666667

w <- 5
TF <- w/d
TFIDF <- TF * IDF
print(TFIDF)

## [1] 3.333333

Exercise 1.3.3
Suppose hash-keys are drawn from the population of all nonnegative integers that are multiples of some constant $c$, and hash function $h(x)$ is $x \bmod 15$. For what values of $c$ will $h$ be a suitable hash function, i.e., a large random choice of hash-keys will be divided roughly equally into buckets?

Hash function: $h(x) = x \bmod 15$

# gcd (greatest common divisor)
gcd <- function(a, b) {
    if (b == 0)
        return(a) else return(gcd(b, a%%b))
}

# Check values of c from 1 to 15
c_values <- 1:15

# Compute gcd with 15 for each c
gcd_values <- sapply(c_values, function(c) gcd(c, 15))

# Find values of c where gcd == 1
coprime_c <- c_values[gcd_values == 1]

print(coprime_c)

## [1]  1  2  4  7  8 11 13 14

Values not divisable by 15, 3 or 5.

Exercise 1.3.4
In terms of $e$, give approximations to
(a) $(1.01)^{500}$
(b) $(1.05)^{1000}$
(c) $(0.9)^{40}$

a <- 0.01
b <- 500

1.01^500

## [1] 144.7728

exp(0.01 * 500)

## [1] 148.4132

a <- 0.05
b <- 1000

1.05^1000

## [1] 1.546319e+21

exp(0.05 * 1000)

## [1] 5.184706e+21

0.9^40

## [1] 0.01478088

(1 - 0.1)^40

## [1] 0.01478088

exp(-0.1 * 40)

## [1] 0.01831564

Exercise 1.3.5
Use the Taylor expansion of $e^{x}$ to compute, to three decimal places:
(a) $e^{1/10}$
(b) $e^{-1/10}$
(c) $e^{2}$

Taylor’s expansion

\[ e^{x} = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \frac{x^4}{4!} + \cdots \]

# Taylor's function
exp_taylor <- function(x, n = 10) {
    sum <- 0
    for (i in 0:(n - 1)) {
        sum <- sum + (x^i)/factorial(i)
    }
    return(sum)
}

# a
e_x1 <- exp_taylor(1/10, 10)

# b
e_x2 <- exp_taylor(-1/10, 10)

# c
e_x3 <- exp_taylor(2, 10)
print(c(e_x1, e_x2, e_x3))

## [1] 1.1051709 0.9048374 7.3887125

5.6.3 9/10 Data mining massive and streaming data, Part 1

5.6.3.1 The nature of streaming data

5.6.3.2 Architectures for streaming data

5.6.3.3 Filters and counting for streams

5.6.3.4 Updating statistics with streams

5.6.3.5 Quantile estimation for streams

Assignment 2

5.6.4 9/17 Data mining massive and streaming data, Part 2

Assignment 3

5.6.6 10/1 Similarity search at massive scale, Part 1

5.6.6.1 Distance measures

5.6.6.2 Similarity measures

5.6.6.3 Search with KD-trees

5.6.6.4 Approximate similarity search at massive scale

5.6.6.5 Indexes for massive similarity serarch

5.6.6.6 Evaluation of approximate similarity search algorithms

Assignment 5

5.6.7 10/8 Similarity search at massive scale, Part 2

Assignment 6

5.6.8 10/15 Information retrieval for document and web search

5.6.8.1 Dense vs. sparse embedding for documents search

5.6.8.2 Dense neural embedding algorithms

5.6.8.3 Sparse embedding algorithms

5.6.8.4 Term expansion algorithms

5.6.8.5 Embedding images and CLIP

Assignment 7

5.6.9 10/22 Learning to rank for search and recommendation, Part 1

5.6.9.1 Embedding high cardinality features

5.6.9.2 Content based methods

5.6.9.3 Collaborative filtering

5.6.9.4 Latent variable algorithms

5.6.9.5 The PageRank algorithm

5.6.9.6 Priority queues and heaps

Assignment 8

5.6.10 10/29 Learning to rank for search and recommendation, Part 2

Assignment 9

5.6.11 11/5 Clustering Algorithms, Part 1

5.6.11.1 Finding structure in data

5.6.11.2 The k-means algorithm

5.6.11.3 Problems with evaluating clustering results

5.6.11.4 Probabilistic clustering, dealing with uncertainty

5.6.11.5 Agglomerative hierarchical clustering and k-medoids algorithms for non-Euclidean spaces

5.6.11.6 Density-based clustering algorithms

5.6.11.7 Graph-based clustering algorithms

5.6.11.8 Database scale clustering algorithms

Assignment 10

5.6.12 11/12 Clustering Algorithms, Part 2

Assignment 11

5.6.13 11/19 Dimensionality reduction

5.6.13.1 Eigenvalues and PCA

5.6.13.2 Singular value decomposition and its interpretation

5.6.13.3 Manifold learning and nonlinear spaces

5.6.13.4 Spectral embedding

5.6.13.5 The UMAP algorithm

5.6.14 11/23

Final project proposal due

5.6.15 12/3 Frequent item sets; market basket analysis

5.6.15.1 Market basket models

5.6.15.2 The A-Priori algorithm

5.6.15.3 Limited pass algorithms

5.6.15.4 Streaming algorithms – time permitting

Prepare final project

5.6.16 12/10 Additional Topics

Prepare final project