Artificial intelligence (AI) and machine learning have enabled computer systems to do many amazing things, such as diagnose skin cancer, identify a stroke on a CT scan, or find possible cancers on a colonoscopy. These digital doctors provide faster, cheaper, and more effective diagnosis and care. But there is a worry that these technologies could also worsen biases in medicine.
As the country faces systemic bias in key social institutions, we need technology to reduce health inequalities rather than make them worse. It is well known that AI algorithms trained on data that does not reflect the whole population, tend to do badly for underrepresented groups. For example, algorithms trained on data that is not balanced by gender have trouble reading chest x-rays for the less represented gender. So, there are worries that skin cancer detection algorithms, which are mainly trained on people with lighter skin, may have trouble finding skin cancer in people with darker skin.
To prevent the serious outcomes of wrong decisions, medical AI algorithms need to be trained using data sets that include diverse populations. But this kind of training is not happening. A study in JAMA looked at over 70 papers that compared how well doctors and digital systems diagnosed different areas of clinical medicine. The study showed that most of the data used to train these AI algorithms came from only three states: California, New York, and Massachusetts.
The Challenge of Data Availability
Data availability is still a problem in medicine. One of our patients (a veteran) said he was annoyed with getting his old medical records and asked” Doc, why can we see a specific car in a moving convoy on the other side of the world, but we can’t see my CT scan from the hospital across the street?”. It is hard to share medical data, even for one patient, and it takes a lot of work. It is even harder to collect hundreds or thousands of cases that are needed to train machine learning algorithms. The data in medicine are often kept in separate places, which makes it hard to use them for patient care or making AI tools.
Sharing medical data should happen more often, but it is hard because of privacy laws and the need to protect sensitive data. Economic factors can also make data sharing harder, with hospitals not wanting to share data because they might lose patients to other hospitals. Different medical records systems can also make it hard to share data technically. Also, people are worried about how big tech uses personal data, and they do not trust any efforts to collect personal data, even if they are good.
The Importance of Data Diversity
Medical data not having diversity is not a new problem. Women and minority groups have not been part of studies enough in the past, and this has shown that these groups got less benefits and more side effects from approved drugs. Solving this problem has needed a joint effort from the NIH, FDA, researchers, and industry, and also an act of Congress in 1993. Even though things have gotten better, it is still a problem. In fact, one of the companies making a COVID vaccine said they had to wait to get more diverse people in their study, showing how important diversity is in medical research.
AI is being used more and more as an expert in different areas that are very important besides medicine. For example, AI tools can help judges make decisions about sentences, guide law enforcement actions, or give advice to bank officers about loan applications. But before algorithms become a big part of these important decisions, it is very important to find and fix any biases they have.
The problem of bias in AI has many sides, and making training data more diverse may not be enough to get rid of it. Other worries include not having enough diversity among developers and funders, making problems from the point of view of the majority group, having biased ideas about data, or using AI outputs to keep biases going.
The Future of Medical AI
Due to the difficulty of obtaining high-quality data, researchers are making algorithms that can do more with less. This innovation may lead to new ways of making AI less dependent on big data sets. But making sure the data used to train algorithms have diversity is still very important to know and deal with AI biases.
To make strong and fair algorithms, it is very important to have a good system for technical, regulatory, economic, and privacy issues to give the big and diverse data that are needed to train these algorithms. It is not okay anymore to make and use tools with whatever data we have, without thinking about what might happen and hoping for the best. We have to see what might happen and work to stop it from happening.