TUESDAY, Nov. 20, 2018 (HealthDay News) -- Artificial intelligence tools trained to detect pneumonia on chest X-rays have decreased performance when tested on data from outside health systems, according to a study published online Nov. 6 in PLOS Medicine.
John R. Zech, M.D., from the California Pacific Medical Center in San Francisco, and colleagues assessed how well convolutional neural networks (CNNs) generalized across three hospital systems (National Institutes of Health [NIH], 112,120 chest radiographs from 30,805 patients; Mount Sinai Hospital [MSH], 42,396 chest radiographs from 12,904 patients; and Indiana University [IU], 3,807 chest radiographs from 3,683 patients) for a simulated pneumonia screening task.
The researchers found that the prevalence of pneumonia was high enough at MSH (34.2 percent) versus NIH and IU (1.2 percent and 1 percent) that merely sorting by hospital system achieved an area under the receiver operating characteristic curve (AUC) of 0.861 on the joint MSH-NIH dataset. For NIH- or MSH-trained models, there was equivalent performance on IU and inferior performance on data from each other versus an internal test set. Combining training and test data from MSH and NIH (AUC, 0.931) yielded the highest internal performance, but this model demonstrated significantly lower external performance at IU (AUC, 0.815). With introduction of a 10-fold difference in pneumonia rate between sites, internal test performance improved (10x MSH risk P < 0.001; 10x NIH P = 0.002) but failed to generalize to IU (MSH 10x P < 0.001; NIH 10x P = 0.027). CNNs were able to directly detect the hospital system of a radiograph with >99.9 percent accuracy for NIH and MSH radiographs.
"Our findings should give pause to those considering rapid deployment of artificial intelligence platforms without rigorously assessing their performance in real-world clinical settings reflective of where they are being deployed," a coauthor said in a statement.