Medical Graph Knowledge Representation Probing
In this article, we will discuss the recent MedG-KRP paper by Rosenbaum et al., 2024.
Motivation
LLMs are being increasingly used for all sorts of patient management tasks, such as diagnosis and mortality prediction.
However, it is still not very clear how and why they can perform these tasks, as their knowledge source is parametric — meaning it's stored as numbers.
Also, medical question-answering benchmarks might suggest how well LLMs can answer these questions, but their performance could be attributed to pattern recognition and memorization rather than true clinical reasoning.
How can LLM reasoning become more reliable, especially in settings where patient care is involved?
The model
The main idea of MedG-KRP is to get a grasp of LLM medical reasoning by visualizing and evaluating how they understand causal relations surrounding a given medical concept.
To visualize relations, nothing better than probing them to generate knowledge graphs (KG). The intuition is that the relations there form the basis for their clinical reasoning involving the concept.
Methodology
The LLMs are probed to generate nodes and edges from a root medical concept, for 20 different diseases.
They use two algorithms for building the graphs: node expansion and edge refinement.
- Node expansion: Starting from the root node, recursively prompt the LLM for concepts that are either caused by or cause the root concept.
- Edge refinement: Check for extra causal connections that the LLM may infer exist between the concepts already present in the graph.
Prompt engineering guides LLMs to distinguish direct and indirect causality, and also utilize counterfactual reasoning (evaluate causal relationships through hypothetical scenarios).
The article tests three LLMs: GPT-4 (proprietary), Llama3–70b (open-source), and PalmyraMed-70b (open-source medical specialist), resulting in a set of 60 graphs.
Results and discussion
MedG-KRP evaluates the graphs using two approaches.
The first involves human annotation (medical students), who assess accuracy (medical correctness of all concepts, relationships, and causal pathways) and comprehensiveness (coverage of all relevant concepts for understanding the disease).
The second approach is a ground truth comparison, focusing on the overlap between the generated knowledge graphs (KGs) and a validated medical knowledge graph, BIOS.
In human evaluation, GPT performed the best in both metrics. While PalmyraMed generated more specific nodes, it struggled to differentiate between direct and indirect causality, and also produced hallucinated nodes.
In the GT comparison, the situation reversed: Palmyra was the best, reproducing the KG with higher fidelity, likely because it was trained on similar resources. In contrast, GPT-4 had the worst match, due to its broader causal reasoning.
Conclusion
Generalist models might be more affected by public knowledge, while specialized models tend to be more closely aligned with medical sources. This highlights the challenge of striking a balance between domain specificity and robust causal reasoning in medical AI.
Moreover, graphs generated by LLMs could further improve model interpretability and help expand current biomedical KGs. Future advancements could focus on training domain-specific models with causal inference data, refining prompting strategies, such as using chain-of-thought reasoning, and investigating how training data overall impacts LLM knowledge representation.