머신러닝 기법을 적용한 간호학 연구 : 체계적 문헌고찰 및 네트워크 메타분석
Application of machine learning techniques in nursing research : A systematic review and network meta-analysis
- 주제(키워드) 도움말 간호 , 머신러닝 , 체계적 문헌고찰 , 네트워크 메타분석 , 메타분석
- 발행기관 국립강릉원주대학교 일반대학원
- 지도교수 도움말 김은주
- 발행년도 2025
- 학위수여년월 2025. 8
- 학위명 박사
- 학과 및 전공 도움말 일반대학원 간호학과
- 세부분야 해당없음
- 실제URI http://www.dcollection.net/handler/kangnung/000000012162
- UCI I804:42001-000000012162
- 본문언어 한국어
초록/요약 도움말
인공지능의 하위 분야인 머신러닝은 최근 간호교육ㆍ실무ㆍ연구 전 영역에서 환자중심 의사결정과 업무 효율을 획기적으로 향상시키는 핵심 기술로 부상하였으나, 연구 설계ㆍ보고 방식의 이질성과 알고리즘별 성능 비교 근거 부족으로 인해 실제 적용과 확산에는 한계가 있었다. 특히 기존 문헌들은 알고리즘 선택 이유, 하이퍼파라미터 조정, 외부 검증 절차, 윤리적 논의 등을 충분히 기술하지 않아 결과 재현성과 임상 신뢰도를 저해해 왔고, 간호사가 연구자 집단에 포함되지 않은 사례도 다수 보고되었다. 이러한 공백을 해소하고 간호학 분야 머신러닝 활용의 현주소와 향후 방향을 실증적으로 제시하기 위해, 본 연구는 체계적 문헌고찰과 네트워크 메타분석을 결합한 종합적 검증을 수행하였다. 본 연구는 국내ㆍ외 13개 데이터베이스를 2024년 9월 27일까지 검색한 뒤 PICOTS‑SD 전략을 적용해 3,653편을 확인하고 중복 2,102편을 제거한 뒤 제목ㆍ초록ㆍ전문 심사를 거쳐 최종 125편(메타분석 101편)을 선정하였다. 개념적 기틀은 CRISP‑DM 6단계를 간호학 맥락에 맞게 재정의한 모델을 활용했고, 연구 보고의 질 평가는 TRIPOD+AI 27개 항목으로 실시하였다. 메타분석 대상 논문에서 분류 문제는 AUC‑ROCㆍF1‑score, 회귀 문제는 R²ㆍRMSEㆍMAE, 군집 문제는 ConcordanceㆍICCㆍSilhouette 등의 값을 추출하였다. 네트워크 메타분석은 빈도주의 P‑score를 이용해 알고리즘 상대 성능을 산출하고, Cochran’s QㆍI²로 이질성을, Bucher 방법으로 직접‑간접 비교 일관성을 검증했으며, Funnel plot과 Egger’s 회귀로 출판 편향을 확인하였다. 질 평가 결과, 전체 논문의 TRIPOD+AI 평균 충족률은 50.4%(±9.4%)로 중간 수준에 머물렀고, 특히 제목에서 모델 유형ㆍ대상ㆍ결과를 명시한 비율은 18.4%에불과하였다. 결측치 처리ㆍ표본크기 근거ㆍ클래스 불균형 대처ㆍ공정성 검증 등 AI 특화 항목의 기술은 전반적으로 미흡했으며, 이는 연구 재현 가능성과 임상 번역 가능성에 구조적 위험 요인으로 작용할 수 있을 것으로 사료된다. 분류 모델 68편의 AUC‑ROC 기반 NMA에서는 AdaBoost가 P‑score 0.987로 최우수 성능을 기록했고 ANN (0.941), NB (0.844), XGBoost (0.719), BN (0.710)이 뒤를 이었다. LR (0.564), RF (0.521), DT (0.408), SVM (0.399)은 ‘양호’ 군에 속했으며 ETㆍLinear Regression은 0.000으로 성능과 활용 빈도가 모두 낮았다. F1‑ score 지표(59편)에서도 AdaBoost가 P‑score 1.000으로 압도적이었고 NB (0.949), RF (0.831) 순이었으며 GBM (0.037), BN (0.005)는 최하위였다. 두 메트릭 모두 Cochran’s Q와 I²가 0에 수렴해 연구 간 이질성이 거의 없었고, Bucher 비교에서도 AdaBoost는 LRㆍRFㆍDTㆍSVMㆍGBM 대비 통계적으로 유의하게 우수했다. 회귀모델 14편 분석 결과, 앙상블 계열인 RFㆍXGBoostㆍGBM이 낮은 MSEㆍRMSE로 높은 예측 정확도를 보였으나 RF는 연구 간 신뢰구간이 넓어 데이터 특성에 따른 변동성이 지적되었다. 선형 회귀는 R² 0.672로 설명력이 상대적으로 높았지만, MAE 관점에서는 GBM이 가장 안정적인 오차 분포를 나타내었다. 군집모델 7편에서는 K‑means가 Silhouette 0.600, LDA가 Concordance 0.72, K‑means가 ICC 0.853으로 비교적 양호한 구조적 적합성을 보였으나, 군집 해석의 임상적 타당성을 확보하려면 지표 간 상충관계를 종합적으로 검토해야 한다고 판단되었다. Funnel plot과 Egger’s 회귀 분석에서 분류ㆍ회귀ㆍ군집 전 지표의 p>.05로 출판 편향은 관찰되지 않았다. 전반적인 이질성 저하와 편향 부재는 본 메타분석 결과의 신뢰도를 뒷받침하며, 간호학 데이터셋 간 비교 가능성을 높여준다. 결론적으로, 간호학에서 머신러닝 기법의 상대적 성능은 알고리즘 계열ㆍ데이터특성ㆍ목표 변수에 따라 뚜렷한 차이를 보였다. AdaBoost는 다수 지표에서 일관되게 최상위 성능과 낮은 변동성을 증명해 간호 실무 현장의 예측ㆍ분류 과제에 우선 고려할 가치가 확인되었다. 그러나 모델 선택 시 평균 성능뿐 아니라 변동성, 해석 가능성, 임상 적용 용이성, 데이터 품질과 윤리ㆍ공정성 요인을 종합해야 함이 본 연구의 핵심 시사점이다. 또한 TRIPOD+AI 체크리스트 충족률 향상을 통해 연구 과정의 투명성과 재현성을 확보하고, 간호사‑공학자 협업 구조를 제도화하여 현장 문제에 특화된 데이터 전처리ㆍ모델링ㆍ검증 체계를 구축해야 한다. 향후 연구는 대규모 다기관 임상데이터와 최신 딥러닝ㆍ강화학습 기법을 포함해 알고리즘 성능과 실질적 환자‑간호사 효과를 동시 검증함으로써, 데이터 기반 간호의 근거 수준을 한층 고도화할 필요가 있다.
more초록/요약 도움말
Machine learning (ML), a subfield of artificial intelligence (AI), has recently emerged as a key technology for dramatically improving patient-centered decision-making and operational efficiency across all domains of nursing education, practice, and research. However, its practical application and dissemination have been limited by the heterogeneity of research designs and reporting methods, as well as a lack of evidence for the comparative performance of different algorithms. Specifically, existing literature has often failed to adequately describe the rationale for algorithm selection, hyperparameter tuning, external validation procedures, and ethical considerations, thereby undermining result reproducibility and clinical credibility. Furthermore, numerous studies have been reported in which nurses were not included in the research teams. To address these gaps and empirically present the current state and future directions of ML applications in nursing, this study conducted a comprehensive evaluation combining a systematic review and a network meta-analysis. We searched 13 domestic and international databases up to September 27, 2024, identifying 3,653 articles using the PICOTS-SD (Population, Intervention, Comparators, Outcomes, Timing, Setting, Study Design) strategy. After removing 2,102 duplicates, a final selection of 125 studies (101 for meta-analysis) was made following title, abstract, and full-text screening. The conceptual framework was a model that redefined the six phases of the Cross-Industry Standard Process for Data Mining (CRISP-DM) for the nursing context. The quality of reporting was assessed using the 27 items of the TRIPOD+AI (Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis + Artificial Intelligence) checklist. For the meta-analysis, we extracted AUC-ROC and F1-scores for classification problems; R², RMSE, and MAE for regression problems; and Concordance, ICC, and Silhouette values for clustering problems. Network meta-analysis utilized frequentist P-scores to calculate the relative performance of algorithms. Heterogeneity was assessed using Cochran’s Q and I², the consistency of direct-indirect comparisons was verified with the Bucher method, and publication bias was examined using funnel plots and Egger's regression. The quality assessment revealed that the average compliance rate with the TRIPOD+AI checklist across all studies was moderate at 50.4 (±9.4) %. Notably, only 18.4 % of the studies specified the model type, target population, and results in the title. The reporting of AI-specific items such as missing-data imputation, sample-size justification, class imbalance, and fairness assessment was generally deficient. This may act as a structural risk factor for research reproducibility and clinical translatability. In the NMA of 68 classification models based on AUC-ROC, AdaBoost demonstrated the highest performance with a P-score of 0.987, followed by ANN (0.941), NB (0.844), XGBoost (0.719), and BN (0.710). LR (0.564), RF (0.521), DT (0.408), and SVM (0.399) belonged to the “good” performance group, while ET and Linear Regression had P-scores of 0.000, indicating both low performance and infrequent use. For the F1-score metric (59 studies), AdaBoost was also overwhelmingly superior with a P-score of 1.000, followed by NB (0.949) and RF (0.831), whereas GBM (0.037) and BN (0.005) ranked lowest. For both metrics, Cochran’s Q and I² values approached zero, indicating negligible heterogeneity between studies. The Bucher method also showed that AdaBoost was statistically significantly superior compared to LR, RF, DT, SVM, and GBM. In the analysis of 14 regression models, ensemble methods such as RF, XGBoost, and GBM showed high predictive accuracy with low MSE and RMSE. However, the wide confidence interval for RF indicated variability depending on data characteristics. While Linear Regression had relatively high explanatory power (R² = 0.672), GBM exhibited the most stable error distribution in terms of MAE. Among the seven clustering models, K-means (Silhouette = 0.600), LDA (Concordance = 0.72), and K-means (ICC = 0.853) showed relatively good structural fit. However, it was determined that a comprehensive review of the trade-offs between these indicators is necessary to ensure clinical validity of cluster interpretation. The funnel plots and Egger's regression analysis showed no publication bias for any metrics across classification, regression, and clustering models (all p > .05, R² ≈ 0.01). The overall low heterogeneity and absence of bias support the reliability of this meta-analysis’s findings and enhance the comparability across nursing datasets. In conclusion, the relative performance of machine-learning techniques in nursing shows clear differences depending on the algorithm family, data characteristics, and target variables. AdaBoost was confirmed to be worthy of primary consideration for prediction and classification tasks in clinical nursing practice, as it consistently demonstrated top-tier performance and low variability across multiple metrics. However, a key implication of this study is that model selection must comprehensively consider not only average performance but also variability, interpretability, ease of clinical application, data quality, and ethical and fairness factors. To advance the field, it is crucial to improve compliance with the TRIPOD+AI checklist to ensure transparency and reproducibility. Furthermore, institutionalizing collaborative structures between nurses and engineers is needed to establish data preprocessing, modeling, and validation frameworks tailored to real-world clinical problems. Future research should incorporate large-scale, multi-center clinical data and advanced deep-learning and reinforcement-learning techniques to simultaneously validate algorithmic performance and its tangible effects on patients and nurses, thereby advancing the level of evidence for data-driven nursing.
more목차 도움말
목 차
Ⅰ. 서 론 ··································································································1
1. 연구의 필요성 ··································································································1
2. 연구 목적 ·········································································································7
3. 용어의 정의 ·····································································································7
Ⅱ. 문헌고찰 ·····························································································9
1. 머신러닝 알고리즘의 유형과 간호학 연구에서의 적용 사례 ·······················9
2. 머신러닝 연구의 체계적 수행 및 보고를 위한 방법론 ······························14
Ⅲ. 개념적 기틀 ·····················································································27
Ⅳ. 연구 방법 ·························································································30
1. 연구설계 ·········································································································30
2. 핵심질문 및 선정기준 ···················································································30
3. 문헌검색 및 선정과정 ···················································································32
4. 문헌의 질 평가 ······························································································35
5. 자료 분석 ·······································································································36
6. 윤리적 고려 ····································································································38
Ⅴ. 연구 결과 ·························································································39
1. 문헌의 질 평가 ······························································································39
2. 선정된 문헌의 일반적 특성 ··········································································46
3. CRISP-DM의 개념적 기틀에 따른 결과 분석 ··············································49
4. 메타분석을 위한 머신러닝 알고리즘의 성능 ·············································138
5. Network meta-analysis의 F1-score 결과에 따른 상대적인 순위 ·············144
6. Bucher 방법을 활용한 머신러닝 모델 비교 ··············································145
7. Network meta-analysis of machine learning algorithms
based F1-score ····························································································149
8. F1-score distribution by algorithm ····························································150
9. Publication bias assessment using funnel plot and
egger's regression test ················································································151
10. Network meta-analysis의 AUC-ROC 결과에 따른 상대적인 순위 ·········154
11. Bucher 방법을 활용한 머신러닝 모델 비교 ············································155
12. Network meta-analysis of machine learning algorithms based
AUC-ROC ····································································································161
13. AUC-ROC distribution by algorithm ························································163
14. Publication bias assessment using funnel plot and
egger's regression test ··············································································164
Ⅵ. 논 의 ······························································································167
1. 선정된 문헌의 질 평가 결과에 따른 논의 ················································167
2. 선정된 문헌의 일반적 특성에 따른 논의 ··················································169
3. CRISP-DM의 연구 목적에 대한 문제 정의에 따른 논의 ··························171
4. CRISP-DM의 데이터 수집 및 탐색에 따른 논의 ······································174
5. CRISP-DM의 데이터 사전 준비에 따른 논의 ············································176
6. CRISP-DM의 모델 구축에 따른 논의 ·························································178
7. CRISP-DM의 평가 및 검토에 따른 논의 ···················································180
8. CRISP-DM의 적용 따른 논의 ······································································182
9. 머신러닝 분류 모델의 성능 비교 ·······························································185
10. 머신러닝 회귀모델의 성능 비교 ·······························································187
11. 머신러닝 군집 모델의 성능 비교 ·····························································189
12. Network meta-analysis에 따른 논의(F1-score) ·······································190
13. Network meta-analysis에 따른 논의(AUC-ROC) ·····································192
Ⅶ. 결 론 ······························································································197
References ·············································································198
Supplementary Material ·······································································214
Abstract ·································································································232

