UNSUPERVISED CLUSTERING FRAMEWORK FOR EARLY IDENTIFICATION OF COMMUNICABLE LUNG DISEASES USING THE K-MODES ALGORITHM
Abstract
Communicable lung diseases continue to cause high morbidity and mortality in low-resource healthcare settings due to delayed diagnosis, reliance on manual recordkeeping and limited availability of labeled clinical data required for supervised machine learning models. As a result, existing disease detection systems are difficult to deploy effectively, leading to late identification of infectious cases and increased disease transmission. To address this challenge, an unsupervised clustering framework based on the K-Modes algorithm is proposed for the early identification of communicable lung diseases. The approach analyzes categorical hospital patient records, including symptoms, diagnostic indicators, medical history, and demographic features, to automatically group patients into communicable and non-communicable disease categories without prior labeling. K-Modes is selected for its suitability in handling categorical medical data using dissimilarity measures and iterative mode updates. The proposed framework identifies clinically coherent clusters consistent with known disease patterns. Cluster validation using purity and clinical interpretation demonstrates effective separation between infectious and non-infectious cases. The model achieves an accuracy of 95.6%, precision of 96.1%, recall of 95.2%, and an F1-score of 92.2%, indicating strong clustering performance. These results demonstrate that unsupervised learning can support early disease identification, rapid pre-screening, and improved clinical triage in resource-constrained healthcare environments