검색 상세

랜덤 포레스트를 이용한 청소년 알레르기 질환에 미치는 영향요인 예측 모델 : 2023년 청소년 건강행태조사 결과를 중심으로

Prediction Model for Influential Factors on Adolescent Allergic Diseases Using Random Forest : Focused on the 2023 korea Youth Health Behavior Survey

초록/요약 도움말

본 연구는 머신러닝 기반 빅데이터 2차 분석연구로 2023년 청소년 건강행태조사 데이터를 활용하여 랜덤 포레스트 알고리즘을 이용하여 청소년 알레르기 질환의 영향 요인을 예측하고, 예측 모델을 구축하기 위해 수행되었다. 자료 수집 및 모델 구축은 데이터 수집, 전처리, 예측 모델 구축, 성능 평가, 최적 모델 선정의 5단계로 진행되었다. 해당 데이터는 질병관리청의 청소년 건강행태조사 공식 웹사이트(http://www.kdca.go.kr/yhs/)에서 다운로드하였으며, 제공된 SAS 형식(.sas7bdat)의 데이터셋을 Python을 이용하여 Excel 형식으로 변환하였다. 조사 대상은 전국 중학교 1학년부터 고등학교 3학년 학생으로 구성되었으며, 층화집락추출법을 통해 선정된 800개 학교의 학생들이 포함되었다. 주요 변수는 인구사회학적 특성(성별, 학년, 거주 지역, 학업성적, 가구 소득 수준 등), 건강행태 관련 변수(흡연, 음주, 신체활동, 스트레스 등), 그리고 알레르기 질환의 진단 여부 및 증상으로 구성되었다. 총 155개의 변수 중 알레르기와 무관하거나 미응답 비율이 20%를 초과한 73개 변수를 제거하고, 최종적으로 82개의 변수를 독립변수로 선택하였다. 랜덤 포레스트 알고리즘을 통해 모델을 구축하고, 이 중 주요 변수로 상위 20개 변수를 선정하여 모델 해석을 진행하였다. 랜덤 포레스트 모델을 이용한 청소년 알레르기 질환 예측 결과, 모델의 성능은 중간 수준으로 평가되었으며, Accuracy는 57.8%, AUC 값은 0.608이었다. 주요 변수로는 불안 수준, 주관적 건강 인지, 학습 시간 등이 알레르기 발생에 중요한 영향을 미치는 것으로 나타났으며, 특히 정신건강 요인들이 알레르기 발생에 큰 영향을 미친다는 결과를 도출하였다. 혼동 행렬 분석 결과, 알레르기 없는 경우의 True Negative(TN)는 3,344건, 알레르기 있는 경우의 True Positive(TP)는 2,771건으로 나타났고, 오차로는 False Negative(FN) 2,283건과 False Positive(FP) 2,178건이 확인되었다. 성능 지표로는 Precision 0.58, Recall 0.58, F1-score 0.58이었으며, 성능 최적화를 위해 하이퍼파라미터를 조정한 결과, 최적 값에서 모델의 성능을 개선할 수 있었다. 청소년 알레르기 질환 발생에 중요한 영향을 미치는 변수들은 불안 수준, 주관적 건강 인지, 학습 시간 등으로 나타났다. 특히, 불안 수준이 높을수록 알레르기 발생 위험이 증가하는 경향을 보였으며, 주관적 건강 인지가 긍정적일수록 알레르기 발생 위험이 낮아졌다. 또한, 학습 시간이 많을수록 알레르기 발생 가능성이 높아지는 경향을 확인하였다. 이러한 결과를 바탕으로, 본 연구는 청소년 알레르기 질환 예방 및 관리를 위해 스트레스 관리와 학업 스트레스 완화 전략을 포함한 건강증진 계획을 수립하는 것이 중요하다는 결론을 도출하였다. 정신건강이 알레르기 질환 발생에 주요한 영향을 미치는 요인으로 나타났으므로, 청소년 스트레스 완화 및 심리적 안정 지원을 위한 예방적 개입과 상담 프로그램이 필요하다. 또한, 장시간의 학습과 같은 좌식 생활이 알레르기 발생 위험을 높일 수 있으므로, 학업 부담을 완화하고 신체 활동을 장려할 수 있는 방안을 학교 차원에서 마련해야 한다. 정신건강, 생활 습관, 환경적 요인을 통합적으로 고려한 예방적 개입을 통해 알레르기 질환 발생을 효과적으로 줄이기 위한 통합적이고 다차원적인 정책과 프로그램 개발이 요구된다.

more

초록/요약 도움말

This study is a machine learning-based secondary data analysis that utilized the 2023 Korea Youth Risk Behavior Survey data to predict factors influencing adolescent allergic diseases and to develop a predictive model using the Random forest algorithm. The research process was conducted in five stages: data collection, preprocessing, predictive model construction, performance evaluation, and optimal model selection. The dataset, available in SAS format (.sas7bdat), was obtained from the KYRBS official website (http://www.kdca.go.kr/yhs/) and converted into Excel format using Python for analysis. The survey targeted middle and high school students across South Korea, including students from 800 schools selected via stratified cluster sampling. Key variables included sociodemographic characteristics (e.g., gender, grade level, residential area, academic performance, household income level), health behavior-related variables (e.g., smoking, alcohol consumption, physical activity, stress), and the diagnosis and symptoms of allergic diseases. Of the 155 variables, 73 were excluded due to irrelevance to allergic diseases or a response rate below 80%, leaving 82 variables as independent variables for analysis. Using the random forest algorithm, the model identified the top 20 significant variables for interpretation. The predictive performance of the random forest model was evaluated as moderate, with an accuracy of 57.8% and an AUC of 0.608. Key predictors included anxiety levels, subjective health perception, and study hours, which were found to significantly influence the occurrence of allergic diseases. In particular, mental health factors were identified as major contributors to allergic disease development. Confusion matrix analysis revealed 3,344 true negatives (TN) and 2,771 true positives (TP), while false negatives (FN) and false positives (FP) accounted for 2,283 and 2,178 cases, respectively. Performance metrics showed a precision of 0.58, recall of 0.58, and F1-score of 0.58. Hyperparameter optimization further improved the model’s performance. The key factors influencing the occurrence of adolescent allergic diseases were identified as anxiety levels, subjective health perception, and study hours. Specifically, higher anxiety levels were associated with an increased risk of allergic diseases, whereas positive subjective health perception reduced the risk. Additionally, longer study hours were found to increase the likelihood of allergic disease occurrence. These findings underscore the necessity of developing health promotion strategies that incorporate stress management and academic stress alleviation to prevent and manage adolescent allergic diseases. Given that mental health emerged as a critical factor, preventive interventions and counseling programs aimed at mitigating stress and fostering psychological stability in adolescents are essential. Moreover, prolonged sedentary behaviors, such as extended study hours, may elevate the risk of allergic diseases. Accordingly, schools should establish measures to alleviate academic burdens and encourage physical activity. To effectively reduce the prevalence of allergic diseases, it is imperative to implement integrated and multidimensional policies and programs that comprehensively address mental health, lifestyle habits, and environmental factors.

more

목차 도움말

Ⅰ. 서 론 ··································································································1
1. 연구배경 ···········································································································1
2. 연구목적 ···········································································································5
3. 용어의 정의 ·····································································································6
Ⅱ. 문헌고찰 ·····························································································8
1. 청소년 알레르기 질환 ·····················································································8
2. 랜덤 포레스트를 이용한 빅데이터 연구 ······················································11
Ⅲ. 연구방법 ···························································································13
1. 연구 설계 ·······································································································13
2. 자료 수집 및 모델 구축 ···············································································14
Ⅳ. 연구결과 ···························································································23
1. 주요 변수의 기술통계 분석 ··········································································23
2. 혼동 행렬 ·······································································································39
3. 예측 모델 성능 평가 ·····················································································40
4. SHAP values ··································································································42
Ⅴ. 논 의 ································································································46
Ⅵ. 결론 및 제언 ···················································································53
참고문헌 ··································································································55
Abstarct ·································································································63

more