Rcode: Mixture Distributions

Generate Data from a Mixture Model

  • Generate 500 samples from a mixture with two normal components via a two-stage process: first generate the latent variables Z’s and then the X samples.
    n=500; 
    Z = sample(c(1:2), n, replace=TRUE)   
    X = rnorm(n)
    X[Z==1] = X[Z==1] -1; 
    X[Z==2] = X[Z==2] +1; 
    
  • Draw the histogram and density curve.
    hist(X, freq=FALSE, col="gray", breaks=25)
    curve(0.5/sqrt(2*pi)*exp(-(x+1)^2/2)+0.5/sqrt(2*pi)*exp(-(x-1)^2/2),
          col="red", add=TRUE)
    
  • The density curve still looks like unimodal. If we change the two locations to be far apart, then you’ll see the two modes.
    mu=2; 
    X = rnorm(n)
    X[Z==1] = X[Z==1] - mu; 
    X[Z==2] = X[Z==2] + mu; 
    
    hist(X, freq=FALSE, col="gray", breaks=25)
    curve(0.5/sqrt(2*pi)*exp(-(x+mu)^2/2)+0.5/sqrt(2*pi)*exp(-(x-mu)^2/2),
          col="red", add=TRUE)
    

Build an Email Spam Filter

  • Check [here] for detailed information on the spam data.
    • Number of Instances: 4601 (1813 Spam = 39.4%)
    • Number of Attributes: 58 (57 continuous, 1 nominal class label)
    • The last column ‘Y’ (class label) denotes whether the e-mail was
      considered spam (1) or not (0), i.e. unsolicited commercial e-mail.
  • Load the data [spam.txt], as well as the Attribute names [spam_name], into R.
    spam=read.table("spam.txt");
    spam=data.frame(spam);
    spam.name=read.table("spam_name.txt", as.is=TRUE);
    for(i in 1:57) 
    names(spam)[i]=as.character(spam.name[i,]);
    names(spam)[58]="Y";
    spam$Y = as.factor(spam$Y) 
    
  • Take a look of the data
    spam[1:5,]
    summary(spam)
    table(spam$Y)
    
  • Fit the data with a mixture model with two normal components:
    • One component is for spam emails, and
    • one is for normal emails.Train a mixture model with two normal components:

    For simplicity, assume the covariance matrices are the same. What are the parameters?

    • Two mixing weights: w1, w2
    • Two 57-dim mean vectors: mu1, mu2
    • One 57-by-57 covariance matrix: S

    How to estimate those parameters? (Not difficult.)

    library(MASS)
    mymodel = lda(Y~., spam)
    mymodel$prior
    mymodel$means
    # Here I didn't show you the estimate of S. 
    
  • How to evaluate the effectiveness of this classification model? Divide the data into two parts: training and testing (500 test samples).

    Give the 500 test samples to someone else. Train the classification model based on the training data (4701-500 = 4101 samples), and then predict each test email is a spam or not a spam based on its 57 features. Finally compare your prediction with the true labels for the 500 test samples.

    # select 500 rows to be test and the remaining to be training
    n = dim(spam)[1]
    test.id = sample(1:n, 500)   
    testdata = spam[test.id, ]
    traindata = spam[-test.id, ]; 
    
    # train our model based on the training data
    mymodel = lda(Y~., traindata)
    # predict the test samples
    predictions = predict(mymodel, testdata)$class
    # Compare our prediction with the true label
    table(testdata$Y, predictions)