[데이터 칼럼] 데이터의 시각화에서 데이터 정규화가 필요한 이유는 무엇일까?

May 24, 2023 JK

데이터의 정규화는 여러 가지 주요 이유로 데이터를 시각화 할 때 필요한데, 가장 중요한 이유는 척도의 균일성 (scale uniformity) 때문입니다. 서로 다른 데이터 변수들은 크게 다른 척도와 단위를 가질 수 있습니다. 예를 들어, 곡물 수확량은 Mg/ha 일 수 있고, 영양소 함량은 일반적으로 % 범위 내에 있을 수 있습니다. 이러한 데이터를 정규화 하면 단위가 다른 여러 개의 변수를 동일한 그래프에서 비교하고 시각화 할 수 있습니다.

또한, 정규화는 데이터의 해석 능력 (visualization interpretability) 을 향상시킵니다. 정규화된 데이터는 패턴에 대한 해석을 더 쉽게 할 수 있게 하며 의미 있는 시각화를 만드는 데 도움이 됩니다. 데이터 속성이 다른 척도/단위 때문에 발생되는 시각적 왜곡을 낮춰주며, 이는 데이터의 관계와 패턴을 정확하게 이해하는 데 중요합니다. 즉, 공정한 비교를 용이하게 합니다. 서로 다른 변수를 함께 그릴 때 정규화는 데이터 범위를 동등하게 조정하여 공정한 비교를 가능하게 하는 것입니다.

정규화는 데이터의 이상 값의 모든 값들을 좁은 범위로 가져와서 극단적인 값에 의해 편향되는 것을 줄일 수 있습니다. 이는 트렌드와 패턴을 더 명확하게 만들어주며, 시각화를 더욱 효과적으로 만듭니다. 전반적으로, 정규화는 시각적으로 매력적이고 분석적으로 정확한 시각화를 보장하는 역활을 하기에 데이터 분석에서 중요한 기법 중 하나 입니다.

오늘은 정규화가 데이터를 쉽게 해석할 수 있게 도와주는 방법을 실제로 보여드리겠습니다.

먼저 데이터 하나를 R 에 업로드 하겠습니다. 데이터는 제 Github 에 저장되어 있는 데이터를 사용하겠습니다. 아래 코드를 본인의 R 스크립트 창에 복사/붙여넣기 하시면 데이터가 업로드 됩니다.

library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/biomass_N_P.csv"
df= data.frame(read_csv(url(github), show_col_types=FALSE))

데이터가 잘 업로드 되었는지 살펴 보겠습니다. head(df, 5) 코드를 이용해서 데이터의 5번째 열 까지만 한번 확인해 보겠습니다.

head(df, 5)
  season cultivar treatment rep biomass nitrogen phosphorus
1   2022      cv1        N0   1    9.16     1.23       0.41
2   2022      cv1        N0   2   13.06     1.49       0.45
3   2022      cv1        N0   3    8.40     1.18       0.31
4   2022      cv1        N0   4   11.97     1.42       0.48
5   2022      cv1        N1   1   24.90     1.77       0.49
.
.
.

이 데이터는 작물의 biomass 와 다양한 질소 비료 함량 (N0 – N4) 에 따른 작물 biomass 내 질소와 인산 함량을 조사한 데이터라고 가정하겠습니다. 이제 부터 이 데이터를 사용하여 작물의 biomass 와 질소 또는 인산 사이의 회귀 그래프를 만들겠습니다. 품종 (cultivar) 및 처리 (treatment) 별 평균 데이터를 사용할 것이므로, 첫 번째 단계는 데이터를 요약하는 것으로 시작합니다.

library(dplyr)
summary = data.frame(df %>%
                   group_by(cultivar, treatment) %>%
                   dplyr::summarize(across(c(biomass, nitrogen, phosphorus), 
                      .fns= list(Mean=~mean(., na.rm= TRUE), 
                       se=~sd(.,na.rm= TRUE) / sqrt(length(.))))))

데이터를 요약하여 품종 (cultivar) 및 처리 (treatment) 별 평균과 표준오차를 구했습니다. 이제 작물의 biomass 와 질소 또는 인산 사이의 회귀 그래프를 작성하겠습니다.

library(ggplot2)
ggplot(data=summary, aes(x=biomass_Mean, y=nitrogen_Mean))+
  geom_smooth(method='lm', linetype=1, se=TRUE, formula=y~x, linewidth=0.5)+
  geom_errorbar(aes(xmin=biomass_Mean-biomass_se, 
                    xmax=biomass_Mean+biomass_se),
                    position=position_dodge(0.9), width=0.1) +
  geom_errorbar(aes(ymin=nitrogen_Mean-nitrogen_se, 
                    ymax=nitrogen_Mean+nitrogen_se),
                    position=position_dodge(0.9), width=0.1) +
  geom_point(aes(fill=treatment, shape=treatment), color="black", size=5)+
  scale_fill_manual(values=c("grey85","grey65","grey45","grey25","grey5"))+
  scale_shape_manual(values=rep(c(21),5))+
  scale_x_continuous(breaks=seq(0,80,20), limits=c(0,80))+
  scale_y_continuous(breaks=seq(0,5,1), limits=c(0,5))+
  labs(x="Canopy biomass (g)", y="Nitrogen (%) in canopy") +
  theme_classic(base_size=20, base_family="serif")+
  theme(legend.position=c(0.89,0.13),
        legend.title=element_blank(),
        legend.key=element_rect(color=alpha("white",.001), 
                   fill=alpha("white",.001)),
        legend.background=element_rect(fill=alpha("white",.001)),
        axis.line=element_line(linewidth=0.5, colour="black"))

library(ggplot2)
ggplot(data=summary, aes(x=biomass_Mean, y=phosphorus_Mean))+
  geom_smooth(method='lm', linetype=1, se=TRUE, formula=y~x, linewidth=0.5)+
  geom_errorbar(aes(xmin=biomass_Mean-biomass_se, 
                    xmax=biomass_Mean+biomass_se),
                    position=position_dodge(0.9), width=0.05) +
  geom_errorbar(aes(ymin=phosphorus_Mean-phosphorus_se, 
                    ymax=phosphorus_Mean+phosphorus_se),
                    position=position_dodge(0.9), width=0.05) +
  geom_point(aes(fill=treatment, shape=treatment), color="black", size=5)+
  scale_fill_manual(values=c("grey85","grey65","grey45","grey25","grey5"))+
  scale_shape_manual(values=rep(c(21),5))+
  scale_x_continuous(breaks=seq(0,80,20), limits=c(0,80))+
  scale_y_continuous(breaks=seq(0,1,0.2), limits=c(0,1))+
  labs(x="Canopy biomass (g)", y="phosphorus (%) in canopy") +
  theme_classic(base_size=20, base_family="serif")+
  theme(legend.position=c(0.89,0.13),
        legend.title=element_blank(),
        legend.key=element_rect(color=alpha("white",.001), 
                                fill=alpha("white",.001)),
        legend.background=element_rect(fill=alpha("white",.001)),
        axis.line=element_line(linewidth=0.5, colour="black"))

질소와 인산에 대한 별도의 그래프를 만들었지만, 때로는 동일한 단위를 사용하여 두 그래프를 표시해야 할 수도 있습니다. 그래서 facet_wrap() 을 사용하여 하나의 그래프를 만들겠습니다. 이를 위해서 두 변수 (질소와 인산) 를 동일한 열에 넣기 위해서 pivot_longer() 를 사용하여 열 데이터를 행 데이터로 변환할 것입니다.

library(dplyr)
library(tidyr)
df1= data.frame(df %>%
               pivot_longer(
               cols= c(nitrogen, phosphorus),
               names_to= "nutrient",
               values_to= "percentage"))

head(df1, 5)
  season cultivar treatment rep	biomass	nutrient   percentage
1 2022	 cv1	  N0	    1	9.16	nitrogen	1.23
2 2022	 cv1	  N0	    1	9.16	phosphorus	0.41
3 2022	 cv1	  N0	    2	13.06	nitrogen	1.49
4 2022	 cv1	  N0	    2	13.06	phosphorus	0.45
5 2022	 cv1	  N0	    3	8.40	nitrogen	1.18
.
.
.

그리고 다시 데이터를 평균과 표준오차를 구하기 위해 요약 하겠습니다.

library(dplyr)
summary = data.frame(df1 %>%
                   group_by(cultivar, treatment, nutrient) %>%
                   dplyr::summarize(across(c(biomass, percentage), 
                      .fns= list(Mean=~mean(., na.rm= TRUE), 
                       se=~sd(.,na.rm= TRUE) / sqrt(length(.))))))

그래프를 다시 그려 보겠습니다.

library(ggplot2)
ggplot(data=summary, aes(x=biomass_Mean, y=percentage_Mean))+
  geom_smooth(method='lm', linetype=1, se=TRUE, formula=y~x, linewidth=0.5)+
  geom_errorbar(aes(xmin=biomass_Mean-biomass_se, xmax=biomass_Mean+biomass_se),
                position=position_dodge(0.9), width=0.05) +
  geom_errorbar(aes(ymin=percentage_Mean-percentage_se, ymax=percentage_Mean+percentage_se),
                position=position_dodge(0.9), width=0.05) +
  geom_point(aes(fill=treatment, shape=treatment), color="black", size=5)+
  scale_fill_manual(values=c("grey85","grey65","grey45","grey25","grey5"))+
  scale_shape_manual(values=rep(c(21),5))+
  scale_x_continuous(breaks=seq(0,80,20), limits=c(0,80))+
  scale_y_continuous(breaks=seq(0,5,1), limits=c(0,5))+
  labs(x="Canopy biomass (g)", y="Nutrient (%) in canopy") +
  facet_wrap(~nutrient, scales="free") +
  annotate("segment", x=20, xend=60, y=Inf,yend=Inf, color="black", lwd=1)+
  theme_classic(base_size=20, base_family="serif")+
  theme(legend.position=c(0.40,0.13),
        legend.title=element_blank(),
        legend.key=element_rect(color=alpha("white",.001), 
                                fill=alpha("white",.001)),
        legend.background=element_rect(fill=alpha("white",.001)),
        axis.line=element_line(linewidth=0.5, colour="black"),
        strip.background=element_rect(color="white",
                                      linewidth=0.5,linetype="solid"))

단위를 동일하게 설정하면 인산의 단위가 질소보다 훨씬 작기 때문에 작물의 biomass 와 인산 간의 트렌드를 쉽게 파악하기 어렵습니다. 이 경우, 데이터를 정규화 하면 이 문제를 해결할 수 있습니다. 앞에서 정규화의 장점으로 저는 척도의 균일성과 데이터 시각화의 향상을 제안했습니다. 이것이 사실인지 확인해 보겠습니다.

데이터를 정규화 하기 전 데이터의 구조를 이해하는 것이 중요합니다. df 데이터에서 데이터의 정규화에 적합한 그룹을 결정하는 것이 중요합니다.

df
  season cultivar treatment rep biomass nitrogen phosphorus
1   2022      cv1        N0   1    9.16     1.23       0.41
2   2022      cv1        N0   2   13.06     1.49       0.45
3   2022      cv1        N0   3    8.40     1.18       0.31
4   2022      cv1        N0   4   11.97     1.42       0.48
5   2022      cv1        N1   1   24.90     1.77       0.49
.
.
.

다른 재배 시즌 (season) 에서 여러 질소 비료 함량에 따른 작물의 biomass 와 질소 또는 인산 간의 트렌드를 확인하고 싶습니다. 따라서 저는 season 과 cultivar 를 데이터 정규화의 그룹으로 설정할 것입니다.

아래와 같은 코드를 사용합니다.

library(dplyr)
Normalized1= data.frame(df %>%
            group_by(season, cultivar) %>%
            dplyr::mutate(
            biomass_n=(biomass-mean(biomass, na.rm=T))/sd(biomass, na.rm=T),
                  nitrogen_n=(nitrogen-mean(nitrogen, na.rm=T))/sd(nitrogen, na.rm=T),
                  phosphorus_n=(phosphorus-mean(phosphorus, na.rm=T))/sd(phosphorus, na.rm=T),)
)
Normalized=Normalized1[,c(-4,-5,-6,-7)]

head(Normalized,5)
  season  cultivar  treatment  biomass_n   nitrogen_n   phosphorus_n
1 2022	  cv1	    N0	      -1.6187589   -1.9459123   0.0388260
2 2022	  cv1	    N0	      -1.3429185   -1.1615136   0.6600419
3 2022	  cv1	    N0	      -1.6725124   -2.0967583   -1.5142138
4 2022	  cv1	    N0	      -1.4200123   -1.3726979   1.1259539
5 2022	  cv1	    N1	      -0.5054952   -0.3167764   1.2812579
.
.
.

모든 데이터가 정규화 되었습니다. 위 코드에서는 데이터를 정규화 하기 위해 저는 아래와 같은 계산을 사용했습니다.

biomass_n=(biomass-mean(biomass))/sd(biomass) nitrogen_n=(nitrogen-mean(nitrogen))/sd(nitrogen) phosphorus_n=(phosphorus-mean(phosphorus))/sd(phosphorus)

이것은 기본적으로 제가 계산했던 데이터 정규화가 Z-test 분포에 기인 하기 때문입니다.

다시 한 번 데이터를 열에서 행으로 전환하여 facet_wrap() 을 사용하여 그래프를 생성하겠습니다.

library(dplyr)
library(tidyr)
df2= data.frame(Normalized %>%
                   pivot_longer(
                   cols= c(nitrogen_n, phosphorus_n),
                   names_to= "nutrient",
                   values_to= "percentage"))

head(df2, 5)
  season cultivar treatment  biomass_n	 nutrient      percentage
1 2022	 cv1	  N0	     -1.618759	 nitrogen_n    -1.9459123
2 2022	 cv1	  N0	     -1.618759	 phosphorus_n  0.0388260
3 2022	 cv1	  N0	     -1.342918	 nitrogen_n    -1.1615136
4 2022	 cv1	  N0	     -1.342918	 phosphorus_n  0.6600419
5 2022	 cv1	  N0	     -1.672512	 nitrogen_n    -2.0967583
.
.
.

그리고 다시 데이터를 요약 합니다.

library(dplyr)
summary2 = data.frame(df2 %>%
                     group_by(cultivar, treatment, nutrient) %>%
                     dplyr::summarize(across(c(biomass_n, percentage), 
                              .fns= list(Mean=~mean(., na.rm= TRUE), 
                                se=~sd(.,na.rm= TRUE) / sqrt(length(.))))))

그래프를 그려 보겠습니다.

library(ggplot2)
ggplot(data=summary2, aes(x=biomass_n_Mean, y=percentage_Mean))+
  geom_smooth(method='lm', linetype=1, se=TRUE, formula=y~x, linewidth=0.5)+
  geom_errorbar(aes(xmin=biomass_n_Mean-biomass_n_se, 
                    xmax=biomass_n_Mean+biomass_n_se),
                    position=position_dodge(0.9), width=0.1) +
  geom_errorbar(aes(ymin=percentage_Mean-percentage_se, 
                    ymax=percentage_Mean+percentage_se),
                    position=position_dodge(0.9), width=0.1) +
  geom_point(aes(fill=treatment, shape=treatment), color="black", size=5)+
  scale_fill_manual(values=c("grey85","grey65","grey45","grey25","grey5"))+
  scale_shape_manual(values=rep(c(21),5))+
  scale_x_continuous(breaks=seq(-5,5,2.5), limits=c(-5,5))+
  scale_y_continuous(breaks=seq(-5,5,2.5), limits=c(-5,5))+
  labs(x="Canopy biomass (g)", y="Nutrient (%) in canopy") +
  facet_wrap(~nutrient, scales="free") +
  annotate("segment", x=-2.5, xend=2.5, y=Inf,yend=Inf, color="black", lwd=1) +
  theme_classic(base_size=20, base_family="serif")+
  theme(legend.position=c(0.35,0.22),
        legend.title=element_blank(),
        legend.key=element_rect(color=alpha("white",.001), 
                                fill=alpha("white",.001)),
        legend.background=element_rect(fill=alpha("white",.001)),
        axis.line=element_line(linewidth=0.5, colour="black"),
        strip.background=element_rect(color="white",
                                      linewidth=0.5,linetype="solid"))

이제 동일한 그래프에서 두 가지 다른 데이터 단위를 비교할 수 있습니다. 질소와 인산 간의 트렌트를 쉽게 파악할 수 있습니다. 이것이 정규화의 장점인 척도의 균일성과 데이터 시각화의 향상입니다.

추가 팁!!

한 개의 그래프 패널에서 트렌드를 보는 것이 훨씬 쉬울 것입니다. 그래서 질소와 인산을 하나의 그래프에 결합하겠습니다.

library(ggplot2)
ggplot(data=summary2, aes(x=biomass_n_Mean, y=percentage_Mean))+
  geom_smooth(aes(group=nutrient), method='lm', linetype=1, se=TRUE, formula=y~x, linewidth=0.5)+
  geom_errorbar(aes(xmin=biomass_n_Mean-biomass_n_se, 
                    xmax=biomass_n_Mean+biomass_n_se),
                    position=position_dodge(0.9), width=0.1) +
  geom_errorbar(aes(ymin=percentage_Mean-percentage_se, 
                    ymax=percentage_Mean+percentage_se),
                    position=position_dodge(0.9), width=0.1) +
  geom_point(aes(fill=nutrient, shape=nutrient), color="black", size=5)+
  scale_fill_manual(values=c("darkgreen","grey65")) +
  scale_shape_manual(values=c(21, 22))+
  scale_x_continuous(breaks=seq(-5,5,2.5), limits=c(-5,5))+
  scale_y_continuous(breaks=seq(-5,5,2.5), limits=c(-5,5))+
  labs(x="Normalized canopy biomass", y="Normalized nitrogen or phosphorus in canopy") +
  theme_classic(base_size=20, base_family="serif")+
  theme(legend.position=c(0.80,0.13),
        legend.title=element_blank(),
        legend.key=element_rect(color=alpha("white",.001), 
                                fill=alpha("white",.001)),
        legend.background=element_rect(fill=alpha("white",.001)),
        axis.line=element_line(linewidth=0.5, colour="black"),
        strip.background=element_rect(color="white",
                                      linewidth=0.5, linetype="solid"))

이제 데이터의 트렌드를 보는 것이 훨씬 명확합니다!

모든 데이터가 패널의 중앙에 위치한다는 점을 강조해 보겠습니다.

library(ggplot2)
ggplot(data=summary2, aes(x=biomass_n_Mean, y=percentage_Mean))+
  geom_smooth(aes(group=nutrient), method='lm', linetype=1, se=TRUE, formula=y~x, linewidth=0.5)+
  geom_errorbar(aes(xmin=biomass_n_Mean-biomass_n_se, xmax=biomass_n_Mean+biomass_n_se),
                position=position_dodge(0.9), width=0.1) +
  geom_errorbar(aes(ymin=percentage_Mean-percentage_se, ymax=percentage_Mean+percentage_se),
                position=position_dodge(0.9), width=0.1) +
  geom_point(aes(fill=nutrient, shape=nutrient), color="black", size=5)+
  scale_fill_manual(values=c("darkgreen","grey65")) +
  scale_shape_manual(values=c(21, 22))+
  geom_vline(xintercept=0, linetype="dashed", color="black") +
  geom_hline(yintercept=0, linetype="dashed", color="black") +
  scale_x_continuous(breaks=seq(-5,5,2.5), limits=c(-5,5))+
  scale_y_continuous(breaks=seq(-5,5,2.5), limits=c(-5,5))+
  labs(x="Normalized canopy biomass", y="Normalized nitrogen or phosphorus 
       in canopy") +
  theme_classic(base_size=20, base_family="serif")+
  theme(legend.position=c(0.85,0.13),
        legend.title=element_blank(),
        legend.key=element_rect(color=alpha("white",.001),
                                fill=alpha("white",.001)),
        legend.background=element_rect(fill=alpha("white",.001)),
        axis.line=element_line(linewidth=0.5, colour="black"),
        strip.background=element_rect(color="white",
                                      linewidth=0.5, linetype="solid"))

Full code

아래 코드를 복사하여 R 스크립트 창에 붙여 넣으면 위와 동일한 그래프를 얻을 수 있습니다.

if (require("readr") == F) install.packages("readr") 
library(readr)
if (require("dplyr") == F) install.packages("dplyr") 
library(dplyr)
if (require("tidyr") == F) install.packages("tidyr") 
library(tidyr)
if (require("ggplot2") == F) install.packages("ggplot2") 
library(ggplot2)


github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/biomass_N_P.csv"
df= data.frame(read_csv(url(github), show_col_types=FALSE))

Normalized1= data.frame(df %>%
              group_by(season, cultivar) %>%
              dplyr::mutate(
              biomass_n=(biomass-mean(biomass, na.rm=T))/sd(biomass, na.rm=T),
              nitrogen_n=(nitrogen-mean(nitrogen, na.rm=T))/sd(nitrogen, na.rm=T),
              phosphorus_n=(phosphorus-mean(phosphorus, na.rm=T))/sd(phosphorus, na.rm=T),)
)
Normalized=Normalized1[,c(-4,-5,-6,-7)]

df2= data.frame(Normalized %>%
                pivot_longer(
                 cols= c(nitrogen_n, phosphorus_n),
                 names_to= "nutrient",
                 values_to= "percentage"))

summary2 = data.frame(df2 %>%
                      group_by(cultivar, treatment, nutrient) %>%
                      dplyr::summarize(across(c(biomass_n, percentage), 
                               .fns= list(Mean=~mean(., na.rm= TRUE), 
                                se=~sd(.,na.rm= TRUE) / sqrt(length(.))))))

ggplot(data=summary2, aes(x=biomass_n_Mean, y=percentage_Mean))+
  geom_smooth(aes(group=nutrient), method='lm', linetype=1, se=TRUE, formula=y~x, linewidth=0.5)+
  geom_errorbar(aes(xmin=biomass_n_Mean-biomass_n_se, xmax=biomass_n_Mean+biomass_n_se),
                position=position_dodge(0.9), width=0.1) +
  geom_errorbar(aes(ymin=percentage_Mean-percentage_se, ymax=percentage_Mean+percentage_se),
                position=position_dodge(0.9), width=0.1) +
  geom_point(aes(fill=nutrient, shape=nutrient), color="black", size=5)+
  scale_fill_manual(values=c("darkgreen","grey65")) +
  scale_shape_manual(values=c(21, 22))+
  geom_vline(xintercept=0, linetype="dashed", color="black") +
  geom_hline(yintercept=0, linetype="dashed", color="black") +
  scale_x_continuous(breaks=seq(-5,5,2.5), limits=c(-5,5))+
  scale_y_continuous(breaks=seq(-5,5,2.5), limits=c(-5,5))+
  labs(x="Normalized canopy biomass", y="Normalized nitrogen or phosphorus 
       in canopy") +
  theme_classic(base_size=20, base_family="serif")+
  theme(legend.position=c(0.85,0.13),
        legend.title=element_blank(),
        legend.key=element_rect(color=alpha("white",.001),
                                fill=alpha("white",.001)),
        legend.background=element_rect(fill=alpha("white",.001)),
        axis.line=element_line(linewidth=0.5, colour="black"),
        strip.background=element_rect(color="white",
                                     linewidth=0.5,linetype="solid")) +
 windows(width=5.5, height=5)

Agronomy4future

Stories about cereals and statistics (plus coding). We aim to develop open-source code for agronomy.

[데이터 칼럼] 데이터의 시각화에서 데이터 정규화가 필요한 이유는 무엇일까?

May 24, 2023 JK

추가 팁!!

Full code