Applying transformer-based deep learning models — including BEHRT — to automatically classify nursing diagnoses from ICD-coded administrative claims, enabling scalable population-level nursing quality research.
* Contact author · 1UNLV Department of Computer Science · 2UNLV School of Nursing
Background: Nursing diagnoses represent standardized classifications of patient conditions that guide clinical care planning. While nursing-sensitive quality indicators are increasingly important for healthcare policy, manually coding nursing diagnoses from large administrative datasets is prohibitively resource-intensive. All-payer claims databases (APCDs) and hospital discharge records encode clinical conditions through ICD (International Classification of Diseases) codes — but the mapping between these codes and formal nursing diagnoses remains largely unautomated.
Objective: This study evaluates the use of transformer-based language models — specifically BEHRT (BERT for Electronic Health Records) — for automated classification of nursing diagnoses from ICD-coded administrative claims sequences. We assess model performance across nursing diagnosis categories, compare transformer approaches against traditional machine learning baselines, and examine racial and socioeconomic fairness in model predictions.
Methods: We construct a labeled dataset from a de-identified administrative claims corpus, mapping ICD-10 diagnostic and procedure codes to standardized nursing diagnoses. BEHRT is fine-tuned on sequential ICD code inputs, with performance benchmarked against logistic regression, gradient boosting, and BioBERT baselines. Fairness analysis follows established guidelines for algorithmic equity in healthcare AI.
Results: BEHRT achieves a macro-averaged F1 score of 0.847 across nursing diagnosis categories, outperforming all baselines. Performance is strongest for high-frequency diagnoses with clear ICD-code correlates (F1 > 0.91) and lowest for complex, multi-factorial diagnoses requiring contextual clinical judgment. Racial disparities in prediction accuracy are identified and discussed.
Conclusions: Transformer-based models can reliably automate nursing diagnosis classification from administrative claims data — opening the door to scalable, reproducible nursing quality measurement at population scale. Identified fairness gaps highlight the need for careful validation across demographic subgroups before clinical deployment.
Keywords
A labeled dataset was constructed from de-identified administrative claims using established ICD-to-nursing-diagnosis crosswalk tables and clinical expert review. The corpus includes hospital discharge claims and All-Payer Claims Database records spanning multiple states.
BEHRT (Li et al., 2020) adapts the BERT transformer architecture for sequential ICD code inputs, treating each patient's claim history as a "language" sequence. The model encodes temporal patterns in diagnosis and procedure codes that traditional ML models miss.
BEHRT was fine-tuned for multi-label nursing diagnosis classification using UNLV's GPU cluster. Performance was evaluated using macro-averaged F1, precision, and recall — with stratified analysis by nursing diagnosis category, race/ethnicity, and insurance type.
Prediction performance was disaggregated across demographic subgroups (race, ethnicity, insurance status) following established algorithmic fairness frameworks. Disparities were quantified using equalized odds and demographic parity metrics.
Most NLP-for-healthcare models are trained on EHR notes. We demonstrate that ICD-coded administrative claims — far more abundant and standardized — are sufficient to train high-quality nursing diagnosis classifiers when using transformer architectures.
Rather than generic NLP benchmarks, we evaluate model outputs against clinically meaningful nursing quality indicators — with validation by a certified nurse-midwife (Dr. Vanderlaan) to ensure clinical face validity alongside statistical performance.
Fig. 1 — BEHRT F1 scores by nursing diagnosis category. Bars above 85% (blue) represent high-frequency diagnoses with clear ICD correlates. Complex multi-factorial diagnoses (red) show lower but clinically meaningful performance.
The transformer-based BEHRT model achieves a macro F1 of 0.847 — a 7-point improvement over BioBERT and 21 points over logistic regression — demonstrating the value of sequential ICD code modeling.
For well-defined nursing diagnoses with strong ICD correlates (impaired mobility, fluid imbalance, activity intolerance), BEHRT achieves F1 scores exceeding 0.91 — sufficiently high for production quality measurement systems.
Prediction performance varies by race/ethnicity and insurance status — consistent with known biases in administrative data. We identify specific nursing diagnosis categories where gaps are largest and recommend targeted re-sampling strategies.
End-to-end inference on a 500,000-record dataset completes in under 4 hours on UNLV's GPU cluster — demonstrating viability for statewide APCD-scale deployment.
Manual nursing diagnosis coding for population studies costs tens of thousands of dollars in expert labor per dataset. Automated classification reduces this to GPU compute costs — typically less than $100 per million records.
Automated diagnosis classification unlocks the ability to compute nursing-sensitive quality indicators at state or national scale — enabling the kind of comparative effectiveness research that has previously been logistically impossible.
Population-scale nursing diagnosis classification makes it possible — for the first time — to systematically identify disparities in nursing care delivery across race, insurance status, and geography using administrative data.
This research is currently progressing through the peer review process. We'll update this page with publication details as they become available. Check back for the full findings when the work is formally published.
Interested in this work or looking to collaborate? Reach out to the team at appdev@unlv.edu — we welcome conversations with researchers, clinicians, and healthcare organizations working in this space.
The nursing diagnosis classification methodology explored in this research is being applied in the APCD Maternal Health project to classify and analyze nursing diagnoses in Virginia's All-Payer Claims Database.
Learn about APCD projectThis study is one output of the broader NCSBN research initiative led by Vanderlaan and Fonseca — developing AI tools for nursing quality measurement and practice improvement at national scale.
Explore the initiative