In this problem we are going to try out the metric unfolding version of
SMACOF
("Scaling by Maximizing a Convex Function" or "Scaling by Majorizing a Complicated Function") which is discussed in
Chapter 8 of Borg and Groenen. We are going to apply it to the 2008 NES Candidate Feeling Thermometers.
Download the R program:
# THIS IS THE ONE DIMENSION SECTION OF THE PROGRAM
# ONE DIMENSIONAL RESULTS
#
zmetric1 <- result1$conf.col
zmetric1 <- as.numeric(zmetric1)
#dim(zmetric) <- c(nq,ndim) We do not need this because we have a one-dimensional vector
xmetric1 <- result1$conf.row
xmetric1 <- as.numeric(xmetric1)
#dim(xmetric) <- c(np,ndim)
sse11 <- 0.0
sse22 <- 0.0
for (i in 1:np) {
for(j in 1:nq) {
dist_i_j <- 0.0
#
# Calculate distance between points
#
dist_i_j <- dist_i_j+ (xmetric1[i]-zmetric1[j])*(xmetric1[i]-zmetric1[j])
sse11 <- sse11 + ((TEIGHT[i,j]) - sqrt(dist_i_j))*((TEIGHT[i,j]) - sqrt(dist_i_j))
sse22 <- sse22 + ((TEIGHT[i,j]) - sqrt(dist_i_j))*((TEIGHT[i,j]) - sqrt(dist_i_j))*weightmat[i,j]
}
}
obama.voters <- xmetric1[zz[,3]==1] zz[,3] has the who the respondent voted for
mccain.voters <- xmetric1[zz[,3]==3]
non.voters <- xmetric1[NOTVOTE==1 | NOTVOTE==2 | NOTVOTE==3]
# Note the logic here
obamaShare <- length(obama.voters)/(length(obama.voters)+length(mccain.voters)+length(non.voters))
mccainShare <- length(mccain.voters)/(length(obama.voters)+length(mccain.voters)+length(non.voters))
nonShare <- length(non.voters)/(length(obama.voters)+length(mccain.voters)+length(non.voters))
#
gvdens <- density(obama.voters) Get a smoothed histogram of the voters
gvdens$y <- gvdens$y*obamaShare Adjust the heigth of the density by the share of the vote
# This makes the three densities add up to 1
bvdens <- density(mccain.voters)
bvdens$y <- bvdens$y*mccainShare
#
nvdens <- density(non.voters)
nvdens$y <- nvdens$y*nonShare
# Simple trick to make the 3 densities fit nicely in the graph
maxheight <- 1.1*(max(gvdens$y,bvdens$y,nvdens$y))
windows()
plot(gvdens,xlab="",ylab="",
main="",
xlim=c(-1.5,1.5),ylim=c(0,maxheight),type="l",lwd=3,col="red",font=2)
lines(bvdens,lwd=3,col="blue")
lines(nvdens,lwd=3,col="black")
# Main title
mtext("2008 Thermometer Scaling\nObama and McCain Voters and Non-Voters",side=3,line=1.50,cex=1.2,font=2)
# x-axis title
mtext("Liberal - Conservative",side=1,line=2.75,cex=1.2)
# y-axis title
mtext("Density",side=2,line=2.5,cex=1.2)
# This puts the proportions each candidate gets onto the graph
text(-1.0,0.53,paste("Obama Voters ",
100.0*round(obamaShare, 3)),col="red",font=2)
text(-1.0,0.5,paste("McCain Voters ",
100.0*round(mccainShare, 3)),col="blue",font=2)
text(-1.0,0.47,paste("Non Voters ",
100.0*round(nonShare, 3)),col="black",font=2)
#
- Run nes2008_thermometers_smacof_2.r and turn in the two plots
AFTER YOU HAVE NEATLY
FORMATTED THEM!! In particular, be sure that Obama is on the left side of the plots and McCain is on the
right side of the plots. Also, you should adjust the candidate names that are placed on top of the voter distributions
so that they can be easily seen.
- Report result$stress, sse1, sse2, result1$stress, sse11, sse22.
- Report zmetric1. Relative to xmetric1, do the results
make sense to you? Explain.
In this problem we are going continue trying out the metric unfolding version of
SMACOF
("Scaling by Maximizing a Convex Function" or "Scaling by Majorizing a Complicated Function") which is discussed in
Chapter 8 of Borg and Groenen. We are going to apply it to the 1968 NES Candidate Feeling Thermometers and compare it
to MLSMU6.FOR -- a metric unfolding program I wrote in 1978.
Download the R program:
# Double-Center Function for a rectangular squared distance matrix
doubleCenterRect2 <- function(T){
p <- dim(T)[1]
n <- dim(T)[2]
-(T-matrix(apply(T,1,mean),nrow=p,ncol=n)-
t(matrix(apply(T,2,mean),nrow=n,ncol=p))+ mean(T))/2
}
#
etc etc etc
TTSQDC <- doubleCenterRect2(TTSQ) TTSQ is TT squared with the squared mean in the
# Missing entries
# Perform Singular Value Decomposition
#
xsvd <- svd(TTSQDC)
#
# The Two Lines Below Put the Singular Values in a
# Diagonal Matrix -- The first one creates an
# identity matrix and the second command puts
# the singular values on the diagonal
#
Lambda <- diag(nq)
diag(Lambda) <- xsvd$d
#
# Compute U*LAMBDA*V' for check below
#
XREPRODUCE <- xsvd$u %*% Lambda %*% t(xsvd$v)
ssecheck <- sum(abs(TTSQDC-XREPRODUCE)) Check the Decomposition
#
# Starting Coordinates for MLSMU6 Algorithm
#
zz <- xsvd$v[,1:2] Use the Left Singular Vectors as the Stimuli Coordinate Starts
#
xx <- rep(0,np*ndim)
dim(xx) <- c(np,ndim) Use the Right Singular Vectors as the Stimuli Coordinate Starts
xx[,1] <- xsvd$u[,1]*sqrt(xsvd$d[1]) Multiplied by the Square Root of the Singular Values
xx[,2] <- xsvd$u[,2]*sqrt(xsvd$d[2])
###################################################
### chunk number 15:
###################################################
##Here is suma over the NQ dimensions This Function calculates the Sum of Squared Error
sumaj <- function(i){ for the ith Row
jjj <- 1:nq
j <- jjj[!is.na(T[i,])] Here is the trick to have j only range over the non-missing values
s <- (xx[i,1]-zz[j,1])^2+(xx[i,2]-zz[j,2])^2 Note that "j" is a vector so this is a vector calculation
s=sqrt(s)
sx=TEIGHT[i,j]
sum((s-sx)^2) Here is the Sum of Squared Error
}
## now get SUMA over NP dimension
#xx <- xmetric # Used as a check on the MLSMU6 Algorithm
#zz <- zmetric
sumvector <- sapply(1:np,sumaj)
suma <- sum(sumvector) Here is the Total Sum of Squared Error
xxx <- xx
zzz <- zz
kp=0
ktp=0
sumalast <- 100000.00
SAVEsumalast <- sumalast
for(loop in 1:50){
xxx[,1:2] <- 0
zzz[,1:2] <- 0
kp=kp+1
ktp=ktp+1
for(i in 1:np){ This For Loop is doing the sum of the line equations for the Stimuli
for(j in 1:nq){
if(!is.na(T[i,j])){
s=0
for(k in 1:ndim)
s <- s+(xx[i,k]-zz[j,k])^2
xc <- ifelse(s==0, 1.0, TEIGHT[i,j]/sqrt(s))
for(k in 1:ndim)
zzz[j,k]=zzz[j,k]+(xx[i,k]-xc*(xx[i,k]-zz[j,k]))/xcol[j]
}
}
}
for(k in 1:ndim){ This For Loop is doing the sum of the line equations for the Respondents
for(i in 1:np){
sw <- 0.0
for(j in 1:nq){
if(!is.na(T[i,j])){
s=0
for(kk in 1:ndim)
s <- s+(xx[i,kk]-zzz[j,kk])^2
xc <- ifelse(s==0, 1.0, TEIGHT[i,j]/sqrt(s))
xxx[i,k]=xxx[i,k]+(zzz[j,k]-xc*(zzz[j,k]-xx[i,k]))
}
}
xxx[i,k]=xxx[i,k]/xrow[i]
}
}
xx <- xxx
zz <- zzz
sumvector <- sapply(1:np,sumaj)
suma <- sum(sumvector) Calculate Current SSE
#
print(c(loop,suma))
done=((sumalast-suma)/suma)<0.0005 Compare the Current with the Previous SSE
sumalast=suma
if(done) break If Convergence Break out of the Loop
}
- Run nes68_thermometers_smacof_3.r and turn in the plot that it makes.
- Report result$stress, sse1, sse2, and sumalast (this will appear on your screen).
- Make two panel plots of the SMACOF configurations against the
MLSMU6 configurations. Namely, rotate xx to best match
xmetric and apply that same rotation matrix to zz. That is,
one two-panel plot shows the respondents and one two panel plot shows the 12 Political figures. Report the
rotation matrix, T, and the before and after r-squares by dimension.
In this problem we are going to compare probit and logit coefficients
using the analysis in my book on pages 37-40 (also, work through
Notes on the Geometry of Logit and Probit).
Download the R program:
#
# bush_v_gore_example_h106.r -- GLM Examples
#
rm(list=ls(all=TRUE))
#
library(MASS)
library(foreign)
#
setwd("C:/uga_course_homework_11_2015/")
data <- read.dta("hdmg106_2009_fixed.dta") Read in the Stata Data set
attach(data,warn.conflicts = FALSE)
# Grab all the variables
T <- cbind(bush00,gore00,black00,south,hispanic00,income,owner00,dwnom1n,dwnom2n,ybush,ygore)
TT <- T
mode(TT) <- "double" Name the variables -- handy below
colnames(TT) <- c("Bushvote","Gorevote","blackpct","southdum","hispandum","income","ownerhome",
"dwnom1n","dwnom2n","ybush","ygore")
#
nrow <- length(TT[,1])
ncol <- length(TT[1,])
#
#
# STATA OUTPUT FOR REFERENCE
# &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&
#. summ
#
# Variable | Obs Mean Std. Dev. Min Max
#-------------+--------------------------------------------------------
# bush00 | 432 46.81019 14.36873 6 80
# gore00 | 432 49.71296 14.005 20 93
# black00 | 432 12.68333 16.25699 .3 96
# south | 432 .3101852 .4631056 0 1
# hispanic00 | 432 12.31597 16.25281 .6 86
#-------------+--------------------------------------------------------
# income | 432 30.75224 8.319547 15.06 57.219
# owner00 | 432 65.86875 11.92404 8 84.3
# dwnom1n | 432 .0586875 .4671698 -.815 1.269
# dwnom2n | 432 -.0381065 .4706545 -1.123 1.27
# ybush | 432 .4861111 .5003865 0 1 1 if Bush > 50% in a CD, 0 otherwise
#-------------+--------------------------------------------------------
# ygore | 432 .4375 .4966535 0 1
#. probit ybush black00 south hispanic00 income owner00 dwnom1n dwnom2n
#
#Probit regression Number of obs = 432
# LR chi2(7) = 338.29
# Prob > chi2 = 0.0000
#Log likelihood = -130.12789 Pseudo R2 = 0.5652
#
#------------------------------------------------------------------------------
# ybush | Coef. Std. Err. z P>|z| [95% Conf. Interval]
#-------------+----------------------------------------------------------------
# black00 | -.0285973 .0097913 -2.92 0.003 -.0477878 -.0094067
# south | .7695862 .2545783 3.02 0.003 .2706219 1.268551
# hispanic00 | -.0089458 .0069163 -1.29 0.196 -.0225015 .0046099
# income | -.0241489 .0126394 -1.91 0.056 -.0489217 .0006239
# owner00 | .0235461 .0143687 1.64 0.101 -.0046161 .0517083
# dwnom1n | 2.761974 .2619813 10.54 0.000 2.2485 3.275448
# dwnom2n | 1.136417 .2339829 4.86 0.000 .6778186 1.595015
# _cons | -.9697998 1.077738 -0.90 0.368 -3.082127 1.142528
#------------------------------------------------------------------------------
# How to run a Probit in R
probitbush <-glm(ybush ~ black00+south+hispanic00+income+owner00+dwnom1n+dwnom2n,family=binomial(link=probit))
sumprobitbush <- summary(probitbush) Get the Summary
#
# ---- Useful Commands To See What is in an Object
#
# > length(sumprobitbush)
# [1] 17
# > class(sumprobitbush)
# [1] "summary.glm"
# > names(sumprobitbush) The coefficients are in sumprobitbush$coefficients[1:8]
# [1] "call" "terms" "family" "deviance"
# [5] "aic" "contrasts" "df.residual" "null.deviance"
# [9] "df.null" "iter" "deviance.resid" "coefficients"
# [13] "aliased" "dispersion" "df" "cov.unscaled"
# [17] "cov.scaled"
#
# How to run a Logit in R
logitbush <-glm(ybush ~ black00+south+hispanic00+income+owner00+dwnom1n+dwnom2n,family=binomial(link=logit))
sumlogitbush <- summary(logitbush) Get the Summary
- Run the program and report all the output NEATLY FORMATTED produced by
sumprobitbush and sumlogitbush.
- Calculate Cosineq (Cosine Theta if you use a handicapped browser) between the
probit and logit coefficients (see page 40 of my book).
- The probit and logit summaries report
Null Deviance, Residual Deviance, and AIC.
Null Deviance is the deviation of a model that contains only the intercept term, that is, a
fixed probability for all observations. Residual Deviance is simply -2*Log-Likelihood.
In this instance, N1=210, the number of "1"s, N0=222, the number of "0"s, K=8, the number of coefficients, and the
formulas are:
Null Deviance = -2*N1*[log(N1/(N1+N2))] -2*N2*[log(N2/(N1+N2))]
Residual Deviance = -2*[LOG LIKELIHOOD]
Akaike Information Criterion (AIC) = -2*[LOG LIKELIHOOD] + 2*K
Calculate these three statistics for the probit output. Show all of your calculations.
In this problem we are going to compare ordered probit and
ordered logit coefficients
using the analysis in my book on pages 37-40 (also, work through
Notes on the Geometry of Logit and Probit).
Download the R program:
#
rm(list=ls(all=TRUE))
library(foreign)
library(MASS)
library(stats)
setwd("C:/uga_course_homework_11_2015")
#
# READ STATA FILE HERE
data <- read.dta("nes1968_first.dta") Read in the Stata Data set
attach(data,warn.conflicts = FALSE)
# Grab only the variables we need
T <- cbind(VAR00120,VAR00156,VAR00261,VAR00263,VAR00264,VAR00478,VAR00479,VAR00480,VAR00533)
TT <- T
mode(TT) <- "double" We do not use these variable names, we just need to keep track
colnames(TT) <- c("partyid2","education5","income2","sex2","black2","wallace2","humphrey2","nixon2","age2")
#
# partyid -- 0,1,2,3,4,5,6 The Codes we need
# education -- 00 - 22=8th grade or less, 31-61=9th-12th grades, 71=some college, 81=BS/BA, 82-87, advanced
# income -- 10 - 35
# sex -- 1=male, 2=female
# black -- 1=white, 2=black
# wallace -- 0 - 97
# humphrey -- 0 - 97
# nixon -- 0 - 97
# age -- 20 - 93
#
partyidx <- VAR00120 Recode the variables and insert NA's for missing data
partyid2 <- ifelse(partyidx > 6,NA,partyidx)
#
educationx <- VAR00156 Note the Logic here
education1 <- ifelse(educationx < 23,1,educationx)
education2 <- ifelse(((23 < educationx) & (educationx < 62)),2,education1)
education3 <- ifelse(educationx==71,3,education2)
education4 <- ifelse(((72 < educationx) & (educationx <= 87)),4,education3)
education5 <- ifelse(educationx > 87,NA,education4)
#
incomex <- VAR00261
income2 <- ifelse(incomex > 35,NA,incomex)
#
sex <- VAR00263 - 1 0=Male, 1=Female
#
racex <- VAR00264
race1 <- ifelse(racex==1,0,racex) White = 0
race2 <- ifelse(racex==2,1,race1) Black = 1
race3 <- ifelse(racex>2,NA,race2)
#
wallacex <- VAR00478 Feeling Thermometers, max=97
wallace1 <- ifelse(wallacex>97,NA,wallacex)
#
humphreyx <- VAR00479
humphrey1 <- ifelse(humphreyx>97,NA,humphreyx)
#
nixonx <- VAR00480
nixon1 <- ifelse(nixonx>97,NA,nixonx)
#
agex <- VAR00533 Maximum age in the 1968 data was 93
age1 <- ifelse(agex>93,NA,agex)
#
# VARIABLES
# partyid2, education5, income2, sex, reace3, wallace1, humphrey1, nixon1, age1
# Stick Variables in a Matrix
XDATA <- cbind(partyid2,education5,income2,sex,race3,wallace1,humphrey1,nixon1,age1)
YDATA <- na.omit(XDATA) List-Wise Deletion -- NEAT COMMAND
#
# ORDERED PROBIT
#
scarecrow <- polr(as.ordered(YDATA[,1]) ~ YDATA[,2]+YDATA[,3]+YDATA[,4]+YDATA[,5]+YDATA[,6]+YDATA[,7]+YDATA[,8]+YDATA[,9],
method="probit")
#
# ORDERED LOGIT
#
blackcat <- polr(as.ordered(YDATA[,1]) ~ YDATA[,2]+YDATA[,3]+YDATA[,4]+YDATA[,5]+YDATA[,6]+YDATA[,7]+YDATA[,8]+YDATA[,9],
method="logistic")
- Run the program and report all the output NEATLY FORMATTED produced by
summary(scarecrow) and summary(blackcat).
- Calculate Cosineq (Cosine Theta if you use a handicapped browser) between the
ordered probit and ordered logit coefficients
(see page 40 of my book).
- The ordered probit and ordered logit summaries report
Residual Deviance, and AIC.
Calculate these two statistics for both the ordered probit and
ordered logit outputs. Show all of your calculations.