MSc Thesis: Activation Steering for Alignment of Medical Large Language Models

Background

Large language models (LLMs) are increasingly being used in medical contexts, from diagnostic assistance to patient communication. However, LLMs can exhibit unpredictable behavioral shifts that pose serious risks in healthcare settings - from overconfident medical advice to biased treatment recommendations.

Recent advances in mechanistic interpretability showed that specific linear directions of LLM transformer activations correspond to different behavioral traits. Due to the superposition of behaviour, these traits are exhibited orthogonally to the downstream task.

Activation steering techniques allow us to directly monitor and control the internal representations of these traits, offering high promise in controlling LLM behavior, which is of utter importance in sensitive domains like medicine.

This project aims to explore how activation steering can be applied to ensure medical AI systems remain safe, reliable, and appropriately calibrated throughout their deployment.

Your tasks

Literature review on activation steering methods and applications (especially in the medical domain).
Experimental Design: Develop evaluation frameworks for testing steering effectiveness across medical scenarios and edge cases - focussing on reducing sycophancy, hallucination and bias expression in medical LLMs.
Steering Vector Extraction: Implement methods to identify activation patterns corresponding to medically relevant behaviors, showcasing the presence or absence of these traits
Safety Implementation: Create monitoring systems to detect problematic behavioral drift during model deployment and intervening (adaptively) via activation steering.

What we offer

Opportunity to contribute to an exciting and impactful medical AI project with the goal of publishing in a top-tier conference
Close supervision and opportunities to collaborate with LLM and medical experts.
Access to computational resources, including a dedicated cluster with multiple A40 and A100 nodes, fast SSD-backed storage, and custom tools for efficient job scheduling and data handling.

Details

Duration: 6 months
Required Background:
Strong programming skills (Python/PyTorch), machine learning fundamentals, interest in healthcare applications (preferably experience with large language models, transformer architectures, or AI alignment research)
Strong abilities to work independently and as part of a team, with an interest in interdisciplinary research.
Interest in large language models in a clinical setting.

Please send a short motivation, CV and a recent transcript of records to k.schwethelm@tum.de and johannes.kaiser@tum.de.

References

Chen, R., Arditi, A., Sleight, H., Evans, O. & Lindsey, J. Persona Vectors: Monitoring and Controlling Character Traits in Language Models. Preprint at https://doi.org/10.48550/arXiv.2507.21509 (2025).
(see related work section of paper above)
Code: https://github.com/safety-research/persona_vectors
Nina Rimsky et al. Steering Llama 2 via Contrastive Activation Addition. Preprint at https://arxiv.org/html/2312.06681v2 (2023)
Anthropic Blog Post: https://www.anthropic.com/research/evaluating-feature-steering

Contact