Rizkala T., Muench N., Hassan C., Dinis-Ribeiro M.

Background This study assessed the effectiveness of large language models (LLMs) in generating lay summaries for patient education on the management of precancerous lesions and early neoplasia in the stomach. Methods In this pilot study, we used a two-period, crossover, blinded design to compare a ChatGPT-4o summary versus a Digestive Cancers Europe (DiCE) summary. Two panels rated the materials: expert physicians and DiCE Patient Advisory Committee members. Experts scored accuracy, completeness, comprehensibility, and satisfaction (across five sections); patients rated overall completeness, comprehensibility, and satisfaction. Paired comparisons used mixed-effects estimates. Readability was assessed with Flesch–Kincaid grade level (FKGL) and SMOG index. Results Median expert ratings were similar between materials across metrics. For the overall summary, median (range; IQR) scores were: accuracy 5 (4–6; 1) for ChatGPT-4o vs. 5 (3–6; 1) for DiCE (P = 0.10); completeness 4 (3–5; 1) vs. 4 (2–5; 1; P = 0.27); comprehensibility 4 (3–5; 1) vs. 4 (2–5; 1; P = 0.33); and satisfaction 4 (2–5; 1) vs. 3 (1–5; 2; P = 0.53). Patient ratings mirrored experts, with very similar results. Readability failed to meet guideline recommendations for both summaries on both FKGL and SMOG scores. Conclusion ChatGPT-4o produced patient materials comparable to DiCE, but both require readability optimization; a human-in-the-loop workflow and future tests across prompts and models are warranted.

Generative artificial intelligence for patient education material on gastric cancer prevention

Rizkala T., Muench N., Hassan C., Dinis-Ribeiro M.

DOI

Type

Publisher

Publication Date

Volume

Pages

Total pages