Monday, 3 December 2012

Concordance and Discordance in Logistic Regression

If you run a logistic regression in SAS, you get a table which summarizes association of predicted probabilities and observed Responses. There you can see that, SAS provides %Concordance, %Discordance, %Tied and Pairs. Now, question is that how SAS calculates these numbers.

Let me explain with simple example in R.
Consider data 'admission' with 4 variables. 'Admit' is dependent variable or a variable
that we predict using variables gre, gpa and rank.

#***R CODE FOR DATA CREATION***#
admit=c(0,0,1,0,0,1,0,0,0,1,0,1,0,0,1)
gre =c(636,660,800,640,520,760,487,890,765,345,456,675,666,546,786)
gpa=c(3.61,3.67,4,3.19,2.93,3,2.98,3.4,3.2,1.98,4,5.1,3.3,5.1,4.7)
rank=c(3,3,1,4,4,2,4,4,4,3,3,3,2,2,1)
admission=data.frame(admit,gre ,gpa,rank)
#***R CODE FOR DATA CREATION ENDS***#

Fit a logistic regression model.

#***FITTING LOGISTIC REGRESSION***#
model=glm(admit~., family="binomial", data=admission )
#***FITTING LOGISTIC REGRESSION ENDS***#

STEPS TO CALCULATE %CONCORDANCE AND %DISCORDANCE

1) Predict the dependent variable in dataset 'admission' using model.
2) Create another data with only two columns. One column is observed dependent variable and other is predicted.
3) Divide the newly created data in two datasets such that one dataset contains all observations   having value  of observed dependent variable 1 (call it as one) and other will contain all observations having value of observed dependent  variable 0 (call it as Zero).
4) Compare each predicted value in dataset one with each predicted value in dataset Zero. So you have total n*m pairs of type (x,y) to compare.
   n: Number of observations in dataset one
   m: Number of observations in dataset Zero
   x: Candidate from dataset one
   y: Candidate from dataset Zero
5) Pairs in which x is greater than y, are concordant pairs
6) Pairs in which x is less than y, are discordant pairs
7) % Concordance= #(concordant pairs)/Total # pairs
    % Discordance = #(Discordant pairs)/Total # pairs

#***FUNCTION TO CALCULATE CONCORDANCE AND DISCORDANCE***#
Association=function(ModelName)
{
Con_Dis_Data = cbind(model$y, model$fitted.values)
ones = Con_Dis_Data[Con_Dis_Data[,1] == 1,]
zeros = Con_Dis_Data[Con_Dis_Data[,1] == 0,]
conc=matrix(0, dim(zeros)[1], dim(ones)[1])
disc=matrix(0, dim(zeros)[1], dim(ones)[1])
ties=matrix(0, dim(zeros)[1], dim(ones)[1])
for (j in 1:dim(zeros)[1])
{
for (i in 1:dim(ones)[1])
{
if (ones[i,2]>zeros[j,2])
{conc[j,i]=1}
else if (ones[i,2]<zeros[j,2])
{disc[j,i]=1}
else if (ones[i,2]==zeros[j,2])
{ties[j,i]=1}
}
}
Pairs=dim(zeros)[1]*dim(ones)[1]
PercentConcordance=(sum(conc)/Pairs)*100
PercentDiscordance=(sum(disc)/Pairs)*100
PercentTied=(sum(ties)/Pairs)*100
return(list("Percent Concordance"=PercentConcordance,"Percent Discordance"=PercentDiscordance,"Percent Tied"=PercentTied,"Pairs"=Pairs))
}
#***FUNCTION TO CALCULATE CONCORDANCE AND DISCORDANCE ENDS***#

Code to call above function: Association(model)
This will give you 
1) Percent Concordance
2) Percent Discordance
3) Percent Tied
4) Pairs

Note:
There is also a relation between %concordance and Area Under ROC Curve.
AUC=%concordant +(0.5 * %tied)

Monday, 19 November 2012

Harishchandra Gad


We started planning a week ago and it was final on Friday. I left office at 5:30 PM and reached Vashi at 9:00 PM. I was lucky to get a seat in bus as there was huge crowd at Wakad (Diwali Rush). Bus came from Satara and therefore I could board the bus.  I had to go to Thane at Yogesh’s room. Plan was to wake up early in the morning and get first train to Kalyan and first bus to Ahmednagar which didn’t work. We got train at 5:30 AM which was at Kalyan at 6:05 AM. We had Samosa and tea at Kalyan station. We filled thermos with tea.
As we entered Kalyan bus station, we saw a bus to Ahmednagar. It was about to leave. Conductor told that it was Diwali special bus.  We requested conductor to tell us once Khubi phata comes. Khubi phata is about 80 Kms away from Kalyan. It is just after we cross Malshej Ghat.  We had heard that this route is along the dam. But after getting down at Khubi phta, we didn’t see any dam nearby.


We asked people there the way to Khireshwar and they helped us with route. It was shorter route than the famous dam side route to Khireshwar. On this route you pass by one old temple of Lord Shiva. You must visit this temple.  






We reached khireshwar at around 10:30. We confirmed route and filled thermos again. 





We reached Tolar Khind at around 12:30 Noon. You can see Vyaghra Shilp (sculpture of Tiger) at Tolar Khind. 



After 15 min break we started climbing and reached a place where we met another group from Satara. In between, you pass through a patch where railing has been provided. It is just for moral support. Without railings, one could climb with precautions.
 



At this point we met a person who came with us till fort. He provides food on fort. We are thankful to him as he showed us a way which saved at least half an hour than a route which needs to cross 7 hills. We reached fort at around 2:45 PM. We had heard that Ganesh Guha is best to stay as it has flat surface. We were first to occupy. We were so hungry that immediately we started eating Sandwich. We had all material to make them.

Then we left to see places on the fort. First we went to take bath at Kedareshwar cave. It was chilled water. We had no courage to enter in water but somehow we jumped in. We also noticed two snakes in water. 




We were feeling so fresh that we decided to visit Konkan Kada directly from there. Kokan Kada is a place of attraction on Harishchandra gad. One can see Rohidas Shikhar, Nalichi Vat, Nane ghat, Jivdhan from Kokan kada. People visit kokan kada to see Sunset. Lots of tents are available on rent here. Rent is around Rs. 400 for medium sized tent and 4 people can easily stay.





We left Kokan Kada at 6:10 PM and came back to cave in 25 mins. We asked Amol (cook) to make Zunka and Bhakar for us. Till then we had discussion on getting down from Nalichi vat but later decided to get down via Pachnai as we could get bus to Rajur from Pachnai at 11 AM.  




We enjoyed star gazing at night and were surprised at frequency of planes. There was a plane after every minute on same route.  In night one group came towards our Cave but Amol, our cook, told them that there was no space inside. So request you to check once before you turn to find another place to stay.

We woke up at 6:30 AM and got ready to visit Taramati Peak which takes around 45 Mins from cave. It gives idea how a fort has expanded and nearby places.  Below is Rohidas shikhar captured from Taramati Shikhar. 




We took around one and half hours to go and come back to the cave. After coming back, we settled dinner bills. We paid him Rs. 40/ plate for Zunka-Bhakar.   We left fort at 9:50 AM and started coming down via Pachnai. We reached base in an hour. At sharp 11:15 AM, bus came and we left for Rajur. It takes around an hour to reach Rajur. From Rajur one can get buses to Kasara. If you want to come to Pune, you should continue with same bus which goes to Akole. From Akole there are direct buses to Pune. But it is very long route.

I spent around 5 Hrs to reach Pune from Akole via Sangamner. Yogesh and Onkar reached Thane via Kasara at 5:20 PM.

 It was nice Diwali season start. J  


Wednesday, 16 May 2012

Poisson Regression for Counts

When your question is to fit a model to predict count variable, with few assumptions, poisson regression is a solution.

One of the strong assumptions in possion regression is that the mean and variance of dependent variable are equal. If observed variance is greater than mean, then it is an indication of overdispersion and variance being less than mean indicates underdispersion. When such situations arise, use of negative binomial regression is suggested.
Ratio of deviance to its degrees of freedom is a statistic used to understand overdisperion. If this ratio is equal to 1, then there is no overdispersion. If the ratio is greater than 1, then there is overdispersion and if ratio is less than 1 then it is an indication of underdispersion.

We use natual log as link function in Poisson regression. So estimated parameters are on log scale and hence before interpretation, we need to transform. Poisson regression uses maximum likelihood estimation method to estimate parameters.

Poisson regression is also suitable to model rate data. e.g Traffic Police department may capture data of number of accidents occured, then rate would be number of accidents per hour. Rate is counts per unit time. This time variable is called as offset variable.

Negative Binomial Regression, Zero inflated regression model, OLS regression are some other techniques used to model count data.