Is this round of AI feasible?, Online medical diagnosis and treatment | Questions | AI
Have you searched online for "Where do I hurt? Did I have any illness?"? The answer may not be satisfactory. But with the rise of large natural language models such as ChatGPT, people began to try to use them to answer medical questions or knowledge.
But is it reliable?
As far as it is concerned, the answer given by artificial intelligence is accurate. But James Davenport, a professor at the University of Bath in the UK, pointed out the difference between medical questions and actual practice. He believes that "practicing medicine is not just answering medical questions. If it is purely answering medical questions, we do not need to teach hospitals, and doctors do not need to receive years of training after academic courses."
Given various doubts, in a recently published paper in the journal Nature, the world's top artificial intelligence experts presented a benchmark for evaluating how well large-scale natural language models can solve people's medical problems.
The existing model is not yet perfect
The latest assessment comes from Google Research and Deep Thinking. Experts believe that artificial intelligence models have many potential in the medical field, including knowledge retrieval and support for clinical decision-making. However, the existing models are still incomplete, such as the possibility of fabricating convincing medical errors or incorporating biases to exacerbate health inequality. Therefore, it is necessary to evaluate its clinical knowledge.
There have been relevant evaluations before. However, in the past, automated evaluations often relied on limited benchmarks, such as individual medical test scores. In the real world, both reliability and value are lacking.
Moreover, when people turn to the Internet to obtain medical information, they will encounter "information overload", and then choose the worst one from the 10 possible diagnoses, thus bearing a lot of unnecessary pressure.
The research team hopes that language models can provide brief expert opinions without bias, indicate their citation sources, and reasonably express uncertainty.
How does the LLM with 540 billion parameters perform
To evaluate the ability of LLM to encode clinical knowledge, Google Research expert Shekufi Aziz and colleagues explored their ability to answer medical questions. The team proposed a benchmark called "MultiMedQA": it combines six existing question answer datasets covering professional healthcare, research, and consumer queries, as well as "HealthSearchQA" - a new dataset containing 3173 medical questions searched online.
The team subsequently evaluated PaLM and its variant Flan PaLM. They found that Flan PaLM reached the most advanced level in some datasets. In the MedQA dataset that integrates issues related to the US physician license exam, Flan PaLM surpasses the state-of-the-art LLM by 17%.
However, although Flan PaLM performs well in multiple choice questions, further evaluation shows that there is a gap in its ability to answer consumer medical questions.
LLM specializing in medicine is inspiring
To address this issue, artificial intelligence experts use a method called design instruction fine-tuning to further debug Flan PaLM for adaptation in the medical field. Meanwhile, researchers introduced a specialized LLM in the field of medicine - Med PaLM.
Design instruction fine-tuning is an effective method to make universal LLM applicable to new professional fields. The performance of the generated model Med PaLM in the trial evaluation is encouraging. For example, Flan PaLM received a long response from a group of physicians with a score that was only 61.9% consistent with scientific consensus, while Med PaLM received a response score of 92.6%, which is equivalent to the response given by the physician. Similarly, 29.7% of responses from Flan PaLM were rated as potentially harmful, while Med PaLM was only 5.8%, equivalent to the responses given by physicians.
The research team mentioned that although the results are promising, further evaluation is necessary, especially in terms of safety, fairness, and bias.
In other words, there are still many limitations to overcome before the clinical application of LLM is feasible.