June 22, 2024

Medical Trend

Medical News and Medical Resources

Google launched the “medical version of ChatGPT” and “AI doctor”

Google launched the “medical version of ChatGPT” and “AI doctor”


Google launched the “medical version of ChatGPT” and “AI doctor”.

Google published a Nature paper and launched a “medical version of ChatGPT”, and the “AI doctor” based on a large language model has begun clinical testing.

At the end of 2022, OpenAI’s large language model (Large language model, LLM) based chat robot ChatGPT has demonstrated impressive capabilities, but the threshold for clinical application of large language models is high.

Medicine is a human enterprise in which language is key to the communication interaction between clinicians, researchers and patients. Artificial intelligence (AI) models, especially the recently progressed Large language models (LLMs) , bring new hope for the application of AI in the field of medicine. They have many potentials in medicine , including knowledge retrieval and supporting clinical decision-making .


These AI models, while usable to a certain extent, are primarily single-task systems that lack expressiveness and interaction capabilities, and may fabricate convincing medical misinformation or incorporate biases that exacerbate health inequalities. As a result, there is an inconsistency between what existing AI models can do and what is expected of them in real-world clinical workflows, making it difficult to translate to real-world reliability or value.

On July 12, 2023, researchers from Google and Google’s artificial intelligence company DeepMind published a research paper entitled: Large language models encode clinical knowledge in the top international academic journal Nature .

The study presents a benchmark for evaluating how well a large language model (LLM) can answer medical questions, and also introduces a large language model specialized in medicine – Med-PaLM .


Google launched the "medical version of ChatGPT" and "AI doctor"



Recent advances in large language models (LLMs) offer an opportunity to rethink AI systems, with language as a tool for mediating human-AI interactions.

Large language models serve as “base models,” pretrained large AI systems that can be repurposed with minimal effort across numerous domains and different tasks.

These models of representation and interaction offer great promise for large-scale learning of universally useful representations from knowledge encoded in medical corpora.

These models have several exciting potential applications in medicine, including knowledge retrieval, clinical decision support, summarization of key findings, patient triage, solving primary care problems, and more .


However, the safety-critical nature of the field requires the development of thoughtful assessment frameworks that allow researchers to meaningfully measure progress and capture and mitigate potential harms.

This is especially important for large language models , which may generate information inconsistent with clinical and social value. For example, they may generate convincing medical misinformation, or contain biases that could exacerbate health inequalities.


To assess the ability of large language models (LLMs) to encode clinical medical knowledge , the research team explored their ability to answer medical questions.

This task is very challenging because providing high-quality answers to medical questions requires understanding medical context, recalling appropriate medical knowledge, and reasoning from expert information .


In this study, a benchmark is presented called MultiMedQA : it combines 6 existing question answering datasets covering professional medical, research and consumer queries with HealthSearchQA – a new dataset containing 3173 Medical Questions Searched Online.

This benchmark is used to assess the realism of large language models in answering medical questions, use of expertise in inference, usefulness, accuracy, health equity, and potential harm.


Google launched the "medical version of ChatGPT" and "AI doctor"




Encouraging performance

The research team then evaluated PaLM (a large language model with 540 billion parameters) and its variant Flan-PaLM . They found that Flan-PaLM achieves state-of-the-art performance on some datasets .

In the MedQA data set integrating USMLE questions, Flan-PaLM surpassed the previous state-of-the-art large language model by 17%, achieving an accuracy rate of 67.6% , which reached the standard of passing the exam (60%) .

However, while FLAN-PaLM performed well on multiple-choice questions, further evaluation revealed gaps in its ability to answer consumers’ medical questions .


Google launched the "medical version of ChatGPT" and "AI doctor"



To address this issue, the research team further adapted Flan-PaLM to the medical domain using a method called instruction prompt tuning . Design instruction fine-tuning is an effective way to adapt general large speech models to new specialized domains.

The resulting new model, Med-PaLM, performs encouragingly in pilot evaluations. For example, the Flan-PaLM was scored by a panel of physicians to score long responses with only 61.9% agreement with the scientific consensus, and the Med-PaLM scored 92.6% of the responses, equivalent to those given by physicians (92.9% ) . Similarly, 29.7% of the responses in the Flan-PaLM were rated as likely to lead to harmful outcomes, compared with only 5.9% in the Med-PaLM, equivalent to those given by physicians (6.5%) .


Google launched the "medical version of ChatGPT" and "AI doctor"




An upgraded version is available


It is worth mentioning that the Med-PaLM model described in the Nature paper was launched in December 2022, and in May of this year, Google published a paper on the preprint platform and launched an upgraded version of Med-PaLM 2 .



The paper shows that Med-PaLM 2 is the first large language model to achieve expert-level performance on US Medical Licensing Examination ( USMLE ) -like questions , correctly answering multiple-choice and open-ended questions, and reasoning about the answers, The accuracy rate is as high as 86.5% , greatly surpassing Med-PaLM and GPT3.5.




Med-PaLM 2 was tested against 14 criteria, including scientific fact, accuracy, medical consensus, reasoning, bias, and harm, and was assessed by clinicians and non-clinicians from a variety of backgrounds and countries.

The research team also found some gaps in the model’s ability to answer medical questions, but did not specify, and Google said it will further develop and improve the model to address these gaps and understand how large language models can improve healthcare.





Clinically tested at Mayo Clinic



According to reports, Med-PaLM 2 is currently undergoing preliminary trials at Mayo Medical Center, the world’s top medical institution .

According to Google, this model is especially useful in countries with “limited access to medical care.” They also said user data submitted during the Med-PaLM 2 trial would be encrypted, inaccessible to Google, and under the control of the users themselves .


In general, Med-PaLM is a powerful large language model specialized in the field of medicine, and design instruction fine-tuning is an effective data and parameter calibration technique that can improve the accuracy,  authenticity, and consistency of large language models Factors such as safety, hazard reduction, and bias help bridge the gap between models and clinical experts, bringing these models closer to real-world clinical applications.








References :




Google launched the “medical version of ChatGPT” and “AI doctor”

(source:internet, reference only)

Disclaimer of medicaltrend.org

Important Note: The information provided is for informational purposes only and should not be considered as medical advice.