What is logistic regression (feat. odds, odds ratio and model equation)?
Logistic regression is a type of statistical analysis used to model the relationship between a binary (yes/no) dependent variable and independent variables. The goal of logistic regression is to find a relationship between the independent variables (x) and the probability of a particular outcome for the dependent variable (y). The logistic regression model calculates the probability of a certain outcome by applying a logistic function to the linear combination of the independent variables.
Here is one example.
sulphur=c(0,5,10,12,15,20,24,26,30,35)
yield=c(4.1,6.2,7.5,8.2,8.8,9.5,10.5,10.4,10.1,10)
dataA=data.frame(sulphur,yield)
dataA
Sulphur improves plant growth, nutrient uptake, better quality, etc., in crops, and it is commonly used as sulphate of potash (SOP) fertilizer. Let’s assume this data is final grain yield in response to different SOP amounts.
Either linear or quadratic regression, we can “predict” the relationship between final grain yield and the sulphur fertilizer. This is because both variable are continuous variable.
How about this case?
The final product undergoes a quality inspection, and the result is added (0: fail, 1: pass). Now, our main interest is
“How does the amount of sulphur fertilizer affect the quality of the crop to pass in the market?
To answer this question, simple linear regression was conducted as below.
Can we analyze the relationship between the amount of sulphur fertilizer and the quality (i.e., pass rate) based on this linear regression? Can we say that there is a correlation between sulphur fertilizer and the quality?
According to this regression model equation, when sulphur fertilizer is applied at a rate of 25 kg/ha, the pass rate would be 0.92, and when applied at a rate of 30 kg/ha, the pass rate would be 1.12. In the categorical data (where 0 means fail and 1 means pass), how should 0.92 or 1.12 be explained?
When a dependent variable is binary (yes/no), predicting values beyond 0 or 1 will reduce the accuracy of the prediction. In other words, the prediction should be between 0 and 1. Therefore, when a dependent variable is binary, linear (or quadratic) regression would not be correct to be used. In this case, it would be more appropriate to talk about ‘probability’ rather than the relationship. For example, it would be more appropriate to say ‘What is the probability of the pass rate when the amount of sulfur fertilizer increases?’
Logistic regression is used for this type of analysis. This model is a combination of linear regression and logistic function. To understand this logistic function, we need to understand ‘odds’ and ‘odds ratio’.
1) Odds, log (odds) and logit
Odds and logit is the basic concept to understand logistic regression. Have you ever heard a Japanese comic book, ‘Slam Dunk’? I’ll explain ‘odds’ with this story.
Now, Shohoku high school is playing games with other high schools in the tournament. In the first round, Shohoku high school won 4 games and lost 6 games out of 10 games. Now the winning odds of Shohoku high school is 4/6 ≈ 0.67
In the 2nd round, Shohoku high school won 8 games and lost 2 games out of 10 games. Now the winning odds of Shohoku high school is 8/2 = 4.0
However, winning odds; 0.67 or 4.0 is not familiar with us. So, let’s talk about that in terms of probability. In the first round, the winning probability of Shohoku high school is 4/10 = 40.0%
and in the 2nd round, the winning probability is 8/10 = 80.0%
Now it seems much clear!! Also, we can understand the difference between odds and probability.
In the first round, the winning odds, and probability of Shohoku high school is 4/6 and 4/10 respectively.
In the 2nd round, the winning odds, and probability of Shohoku high school is 2/8 and 8/10 respectively.
Now, I’m interested in the ratio between probability of winning and probability of losing.
and the ratio can be explained as probability of winning / (1- probability of winning)
. If you have 80% probability of winning, you’ll have 20% (=1-80%) probability of losing.
Then, this ratio in the 1st and 2nd round will be calculated as below.
Aren’t you familiar with the number; 0.67 and 4.0? This is the winning odds of Shohoku high school in the 1st and 2nd round. That is, the winning odds of Shohoku high school could be calculated as
Simply, let’s say as
Eventually, odds is the ratio between p and (1-p) with regard to categorical values (i.e., win/lose, pass/fail, male/female, etc.)
log (odds)
Here is another story about Shohoku high school. The main players, Takenori Akagi, Hisashi Mitsui, Ryota Miyagi, Kaede Rukawa and Hanamichi Sakuragi graduated from Shohoku high school. The team strategy of Shohoku high school had been weakened and they’re playing games in the tournament.
A) In the 1st round, they won 1 games and lost 4 games out of 5 games.
B) In the 2nd round, they won 1 games and lost 8 games out of 9 games.
C) In the 3rd round, they won 1 games and lost 16 games out of 17 games.
D) In the 4th round, they won 1 games and lost 32 games out of 33 games.
The winning odds of Shohoku high is
A) 1/4 = 0.25
B) 1/8 = 0.125
C) 1/16 = 0.062
D) 1/32 = 0.031
The more team strategy has been weakened, the more the winning odds goes to 0. In other words, when the winning odds decreases, the value would be between 0 and 1.
After this tournament, Shohoku high school built up their strategy, and had the best players again.
In the next tournament,
A) In the 1st round, they won 4 games and lost 1 games out of 5 games.
B) In the 2nd round, they won 7 games and lost 2 games out of 9 games.
C) In the 3rd round, they won 15 games and lost 2 games out of 17 games.
D) In the 4th round, they won 30 games and lost 3 games out of 33 games.
In this case, the winning odds of Shohoku high school is
A) 4/1 = 4
B) 7/2 = 3.5
C) 15/2 = 7.5
D) 33/3 = 11.0
Let’s arrange the winning odds in the line.
This asymmetrical dispersion makes hard to compare the winning odds. How can we change this asymmetricity? People say that if your data seems weird, take logarithms, and you would solve the most of problems.
So, let’s take logarithm.
log (odds)
When team strategy was weakened,
log (0.25) = -0.60
log (0.125) = -0.90
log (0.062) = -1.21
log (0.031) = -1.51
When team strategy was strengthened,
log (4) = 0.60
log (3.5) = 0.54
log (7.5) = 0.88
log (11.0) = 1.04
When I took log of values, it seems data are symmetrical.
This process would be expressed as a formula
We call log(p/(1-p))
‘logit fuction’, and this is the basic concept of logistic regression.
Why log(odds)
is important?
Kiminobu Kogure majored in data science in a college and now he is working in Shohoku high school as a strategy analyst. He analyzed the the record of victory and defeat about Shohoku high school for several years, and he obtained 100 log(odds)
. This data would be shown as a normal distribution (because data are symmetrical).
If the data shows a normal distribution, it allows us to have more statistical analysis.
Let's wrap up!!
1) odds is the ratio between something happens and something not happens, and the equation is p/(1-p)
2) log (odds) changes asymmetrical data to symmetrical data.
However, just because odds are the ratio between something happens and something not happens, it does not mean ‘odds ratio’.
2) Odds ratio
Then, what is odds ratio? Here is another data example.
genotype=rep(c("cv1","cv2"),each=20)
score=c(0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,1,1,0,1,0,1,1,1,1,0,1,1,1,0,1,1,0,1,1)
dataA=data.frame(genotype,score)
dataA
Let’s say this data is about how heat stress tolerance is different in two genotypes (0: non-tolerance, 1: tolerance). Let’s check how data is structured.
xtabs(~score + genotype, data=dataA)
I’ll calculate odds for each genotype.
Odds (CV1) for tolerance over non-tolerance P (tolerance) = 7/20 = 0.35 P (non tolerance) = 13/20 = 0.65 Odds(CV1) = 0.35 / 0.65 ≈ 0.54
Odds (CV2) for tolerance over non-tolerance P (tolerance) = 14/20 = 0.70 P (non tolerance) = 6/20 = 0.30 Odds(Control) = 0.70 / 0.30 ≈ 2.33
Odds ratio is the ratio between those two odds.
Odds ratio (CV1 odds ratio regarding CV2) = 0.54 / 2.33 ≈ 0.23 = log (odds ratio) ≈ -1.47
Odds ratio is 0.23. What does this 0.23 indicate?
Compared with CV2, in CV1, heat stress tolerance is 0.23 times. In other words, in CV1, heat stress tolerance would be decreased 77%.
As explained in odds part, log can be taken in odds ratio, and the value is -1.47. What does this -1.47 indicate? Let’s analyze this data using a statistical program. I use R.
glm=glm(score~genotype, family="binomial",
data=within(dataA,genotype<-relevel(as.factor(genotype),ref=2)))
summary(glm)
In this outcome, it shows the coefficient of CV1 when CV2 is 0 (CV2 is reference level; this concept is Generalized linear model, GLM). What is the coefficient of CV1? It’s -1.4663. Aren’t you familiar with the number? It’s the same value of log (odds ratio)
; log(0.23) ≈ -1.47
log(((7/20)/(13/20))/((14/20)/(6/20))) = -1.466337
That is, in logistic regression, coefficient for certain variable is about log (odds ratio)
.
In other words, we need to always remember that coefficient for certain variable in statistical outcome is log (value)
. So, to interpret the result, we should take out log.
exp(-1.466337)
which is odds ratio. ≈
0.23
Simply we can interpret in this way.
(exp(
-1.466337
) -1) *100 = -77
Compared with CV2, in CV1, heat stress tolerance would be decreased 77%.
3) Model equation
We know the simple linear regression model equation is
y= β0 + β1x
The model equation of logistic regression is
I’ll explain about this model equation. Let’s see other data example.
library (readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/wheat_grain_grea_and_heat_tolerance.csv"
dataA=data.frame(read_csv(url(github),show_col_types=FALSE))
dataA
You can download this data;
https://www.kaggle.com/datasets/agronomy4future/heat-tolerance-in-wheat-for-logistic-regression
This data is about how wheat grain area (mm2) and heat stress tolerance (0: non-tolerance, 1: tolerance) is different in two genotypes in response to the increased assimilate availability. To increase assimilate availability, thinning treatment was conducted, removing alternate rows in a plot.
That is, I increased assimilate availability of wheat by reducing competitions at pre-anthesis. This concept is about source-sink manipulation which is popular in crop physiology.
First, let’s analyze the difference of heat stress tolerance in response to assimilate availability.
xtabs(~tolerance+thinning, data=dataA)
Odds (control) P (tolerance) = 2033/3509 ≈ 0.579 P (non tolerance) = 1476/3509 ≈ 0.421 Odds(Control) = 0.579 / 0.421 ≈ 1.375
Odds (manipulation) P (tolerance) = 2695/4465 ≈ 0.604 P (non tolerance) = 1770/4465 ≈ 0.396 Odds(Control) = 0.604 / 0.396 ≈ 1.525
Odds ratio (manipulation odds ratio regarding control) = 1.525 / 1.375 ≈ 1.109 = log (odds ratio) = log (1.109) ≈ 1.103
Let’s check this result using a statistical program.
logistic=glm(tolerance~thinning, family="binomial",
data=within(dataA,thinning<-relevel(as.factor(thinning),ref=1)))
summary(logistic)
In this outcome, it shows the coefficient of manipulation (increased assimilate availability) when control is 0. What is the coefficient of manipulation? It’s 0.10024. It’s the same value of log (odds ratio)
; log(1.109) ≈ 1.103
log(((2695/4465)/(1770/4465))/((2033/3509)/(1476/3509)))
= 0.1002418
Again!! we need to take out log in this value,
exp (0.1002418) = 1.105438
When assimilate availability is increased (thinning treatment), heat stress tolerance is increased 1.1 times, compared with control. In other words,
(exp(0.1002418) -1) * 100 = 10.54382
In other words, when assimilate availability is increased (thinning treatment), heat stress tolerance is increased 10.5%, compared with control.
What I explained so far is when both x and y is categorical values. Of course, in logistic regression, dependent variable is a binary (yes/no), but independent variable (x) could be continuous. Actually, it’s much easy to explain the model equation of logistic regression when independent variable is continuous
So, now my next interests is how heat stress tolerance is different according to grain area.
logistic=glm(tolerance~area, family="binomial", data=dataA)
summary(logistic)
Coefficient for grain area is 0.36. We already know how to interpret this value.
(exp(0.363665) -1) *100= 43.85922
When grain area is increase 1 unit, heat stress tolerance would be increased 43.86%.
In this outcome, we obtained b0 (-5.663033) and b1 (0.363665).
Now, we can calculate P(x). First, in excel, let’s calculate P(x).
We can draw a graph using scatter graph. Set up x as area column, and set up y as P(x) column.
This is a graph for model equation
Tip 1) How to make the same graph using R?
Using the code; stat_smooth(method="glm", method.args=list(family="binomial")
. we can draw the logistic regression graph.
library(ggplot2)
ggplot(dataA, aes(x=area, y=tolerance)) + geom_point() +
stat_smooth(method="glm", method.args=list(family="binomial"), se=TRUE, formula='y~x') +
scale_x_continuous(breaks = seq(0,30,5), limits = c(0,30)) +
scale_y_continuous(breaks = seq(0,1,0.25), limits = c(0,1)) +
labs(x="Wheat grain area (mm2)", y="Non-toleance to tolerance in response to heat stress") +
theme_classic(base_size=18, base_family="serif")+
theme_grey(base_size=15, base_family="serif")+
theme(legend.position="none",
legend.title=element_blank(),
legend.key=element_rect(color=alpha("grey",.05), fill=alpha("grey",.05)),
legend.background= element_rect(fill=alpha("grey",.05)),
axis.line=element_line(linewidth=0.5, colour="black"))+
windows(width=5.5, height=5)
Tip 2) Let’s calculate P(x) in R
We can simply accept the code; stat_smooth(method="glm", method.args=list(family="binomial")
, but it would be important to understand the principle.
So, I’ll calculate the below model equation in R.
px=1/(1+exp(-(-5.663033+0.363665*dataA$area)))
This code represents the model equation of logistic regression.
I’ll add this values in the data.
dataA$px=1/(1+exp(-(-5.663033+0.363665*dataA$area)))
dataA
Again!! let’s draw a logistic regression graph. However, now we can use this code anymore because we calculated P(x). Simply we can think the below code is to calculate this P(x).
stat_smooth(method="glm", method.args=list(family="binomial"), se=TRUE, formula='y~x')
To make a graph, I only use geom_point()
.
library(ggplot2)
ggplot(dataA, aes(x=area, y=px)) +
geom_point() +
scale_x_continuous(breaks = seq(0,30,5), limits = c(0,30)) +
scale_y_continuous(breaks = seq(0,1,0.25), limits = c(0,1)) +
labs(x="Wheat grain area (mm2)", y="Non-toleance to tolerance in response to heat stress") +
theme_classic(base_size=18, base_family="serif")+
theme_grey(base_size=15, base_family="serif")+
theme(legend.position="none",
legend.title=element_blank(),
legend.key=element_rect(color=alpha("grey",.05), fill=alpha("grey",.05)),
legend.background= element_rect(fill=alpha("grey",.05)),
axis.line=element_line(linewidth=0.5, colour="black"))+
windows(width=5.5, height=5)
4) Wrap up!!
1) Odds is the ratio between something happens and something not happens and the equation isp/(1-p)
2) log (odds or odds ratio) changes asymmetrical data to symmetrical data. 3) Odds ratio is the ratio between two odds. That is,p/(1-p) / p/(1-p)
4)log (odds ratio)
is the coefficient for certain variable.