Generate Data from a Mixture Model
- Generate 500 samples from a mixture with two normal components via a two-stage process: first generate the latent variables Z’s and then the X samples.
n=500; Z = sample(c(1:2), n, replace=TRUE) X = rnorm(n) X[Z==1] = X[Z==1] -1; X[Z==2] = X[Z==2] +1;
- Draw the histogram and density curve.
hist(X, freq=FALSE, col="gray", breaks=25) curve(0.5/sqrt(2*pi)*exp(-(x+1)^2/2)+0.5/sqrt(2*pi)*exp(-(x-1)^2/2), col="red", add=TRUE)
- The density curve still looks like unimodal. If we change the two locations to be far apart, then you’ll see the two modes.
mu=2; X = rnorm(n) X[Z==1] = X[Z==1] - mu; X[Z==2] = X[Z==2] + mu; hist(X, freq=FALSE, col="gray", breaks=25) curve(0.5/sqrt(2*pi)*exp(-(x+mu)^2/2)+0.5/sqrt(2*pi)*exp(-(x-mu)^2/2), col="red", add=TRUE)
Build an Email Spam Filter
- Check [here] for detailed information on the spam data.
- Number of Instances: 4601 (1813 Spam = 39.4%)
- Number of Attributes: 58 (57 continuous, 1 nominal class label)
- The last column ‘Y’ (class label) denotes whether the e-mail was
considered spam (1) or not (0), i.e. unsolicited commercial e-mail.
- Load the data [spam.txt], as well as the Attribute names [spam_name], into R.
spam=read.table("spam.txt"); spam=data.frame(spam); spam.name=read.table("spam_name.txt", as.is=TRUE); for(i in 1:57) names(spam)[i]=as.character(spam.name[i,]); names(spam)[58]="Y"; spam$Y = as.factor(spam$Y)
- Take a look of the data
spam[1:5,] summary(spam) table(spam$Y)
- Fit the data with a mixture model with two normal components:
- One component is for spam emails, and
- one is for normal emails.Train a mixture model with two normal components:
For simplicity, assume the covariance matrices are the same. What are the parameters?
- Two mixing weights: w1, w2
- Two 57-dim mean vectors: mu1, mu2
- One 57-by-57 covariance matrix: S
How to estimate those parameters? (Not difficult.)
library(MASS) mymodel = lda(Y~., spam) mymodel$prior mymodel$means # Here I didn't show you the estimate of S.
- How to evaluate the effectiveness of this classification model? Divide the data into two parts: training and testing (500 test samples).
Give the 500 test samples to someone else. Train the classification model based on the training data (4701-500 = 4101 samples), and then predict each test email is a spam or not a spam based on its 57 features. Finally compare your prediction with the true labels for the 500 test samples.
# select 500 rows to be test and the remaining to be training n = dim(spam)[1] test.id = sample(1:n, 500) testdata = spam[test.id, ] traindata = spam[-test.id, ]; # train our model based on the training data mymodel = lda(Y~., traindata) # predict the test samples predictions = predict(mymodel, testdata)$class # Compare our prediction with the true label table(testdata$Y, predictions)