We study the hard problems in AI safety — alignment, interpretability, robustness, and governance — and publish our work openly so the global research community can build on it.
The alignment problem asks how we ensure AI systems reliably pursue goals their designers intended, even in novel situations. We develop formal frameworks for value specification, study goal stability under distributional shift, and test whether aligned behaviours persist as models scale.
"A sufficiently capable misaligned AI would be the last invention humanity ever needs to make — and the last one it ever gets to make."
— Stuart Russell, referenced in SPRINTER orientation materialsCan we specify human values in a form that a learning agent will correctly generalise — and verify that it has done so before, not after, deployment?
Ensuring the training objective correctly captures what we want — addressing Goodhart's Law and reward hacking in reward model specification.
Ensuring the learned model's internal objectives match the training objective. Studying mesa-optimisation and deceptive alignment failure modes.
Maintaining aligned behaviour when the deployment environment differs from training — studying goal stability under covariate shift.
Formal methods for translating human values into computable reward signals. Inverse reward design, preference learning, and cooperative IRL.
How do we supervise agents smarter than us? Studying debate, recursive reward modelling, and amplification approaches.
Building AI systems that can be corrected, shut down, and modified — resisting instrumental drives toward self-preservation.
We study whether RLHF-trained models maintain their trained objectives when evaluated on out-of-distribution prompts, finding systematic reward-hacking patterns.
A framework for learning human preferences from limited labelled data, targeting deployment contexts where annotation is costly — including East African languages.
We propose a suite of behavioural metrics for assessing how controllable a model is under fine-tuning, RLHF, and prompt-level intervention.
If we cannot understand what a model has learned, we cannot know whether it is safe. We build tools for mechanistic analysis — identifying circuits, probing representations, and producing human-readable explanations of model behaviour at the feature, layer, and circuit level.
"Understanding what neural networks have actually learned is one of the most important scientific problems of our era."
— SPRINTER research charter, 2024Open-source tools for circuit analysis, feature attribution, and model probing. Designed for researchers working on models from 7B to 70B parameters.
Automated circuit discovery for transformer models. Identifies computational subgraphs responsible for specific input-output behaviours.
A probing classifier suite for testing what information is represented at each layer — syntactic, semantic, and safety-relevant features.
Integrated-gradients and SHAP-based attribution for LLM outputs. Maps output tokens back to input features with statistical significance testing.
Sparse autoencoder training pipelines for extracting monosemantic features from polysemantic MLP neurons in transformer architectures.
Interactive attention pattern visualisation with head-level analysis and cross-layer aggregation, exportable to HTML reports.
Activation patching infrastructure for causal interventions — testing whether specific activations are causally responsible for model outputs.
We probe fine-tuned Swahili LLMs for representations of safety-relevant concepts, finding systematic differences from English-trained base models.
Applying sparse autoencoder techniques to extract interpretable features from multilingual models, focusing on cross-lingual feature alignment.
A systematic review of mechanistic interpretability techniques evaluated for their utility in AI safety contexts — including red-teaming and oversight.
A model that works in the lab may fail catastrophically in the wild. We study how AI systems break under adversarial pressure — prompt injection, distribution shift, edge-case probing — and build evaluation frameworks that surface these failures before deployment.
"Red-teaming is not adversarial — it is the most loving thing you can do for a system you care about. Find the failure before it matters."
— SPRINTER red-team principles documentStructured red-teaming and safety evaluation for organisations deploying AI in production.
Full-stack safety evaluation: red-teaming, interpretability reports, and ongoing monitoring — launching Q4 2026.
Malicious instructions embedded in user-provided content that override system prompts or hijack model behaviour in agentic pipelines.
Techniques that bypass safety training — role-play framings, adversarial suffixes, and multi-turn manipulation sequences.
Failure when deployment data drifts from training distribution — studying how aligned behaviours degrade under linguistic, cultural, and domain shift.
Adversarial inputs combining text, images, or structured data to confuse models across modalities in vision-language systems.
Attacks on LLM-based agents — tool misuse, memory poisoning, and indirect prompt injection via external data sources.
Safety fine-tuning sometimes degrades performance. We study the capability-safety trade-off and methods to reduce regression.
Define the deployment context, identify attack surfaces, and build a structured threat model specific to the system and its use-case.
Run our automated prompt injection and jailbreak battery — thousands of structured attack variants across safety-relevant categories.
Experienced red-teamers probe the system manually — targeting edge cases, cultural nuances, and novel attack vectors the automation misses.
Probe model internals to understand why failures occur — distinguishing shallow training artefacts from genuine safety properties.
Deliver a structured report with severity ratings, root-cause analysis, and concrete remediation recommendations.
A systematic classification of real-world prompt injection vulnerabilities observed in production deployments, with severity scores.
Read paper →Safety fine-tuning on English often fails to transfer to Swahili. We quantify this gap and propose multilingual safety training approaches.
Read paper →A formal framework for certified robustness properties in instruction-following models, with empirical validation on open-source models.
Read paper →Technical safety research alone is not enough. We contribute to policy discourse, develop governance frameworks adapted to African institutional contexts, and work with regulators, civil society, and the private sector to build meaningful AI oversight.
"AI governance is not a bureaucratic inconvenience — it is the mechanism by which humanity retains meaningful agency over one of its most consequential inventions."
— SPRINTER policy position paper, 2025AI oversight frameworks for Kenya, Tanzania, Uganda, Rwanda, and DRC — with policy recommendations for meaningful regional governance.
Express interest in reviewDesigning institutional oversight mechanisms for high-stakes AI domains — healthcare, legal, financial — adapted to East African regulatory environments.
Who is responsible when an AI system causes harm? We analyse liability frameworks and their adequacy for AI-enabled services in the African context.
Global AI standards are developed predominantly in the US and EU. We advocate for African representation in standards bodies and develop regionally-relevant evaluation criteria.
What disclosure obligations should AI deployers face? We research transparency standards for model cards, system cards, and algorithmic impact assessments.
Moving beyond regulatory compliance to genuine community participation in AI deployment decisions — particularly for systems affecting marginalised populations.
East African data used to train AI systems often benefits foreign companies. We study data sovereignty frameworks and their implementation in the regional context.
AI oversight frameworks for Kenya, Tanzania, Uganda, Rwanda, and DRC — analysing existing regulatory capacity and recommending institutional designs for meaningful AI oversight.
Read paper →An analysis of existing medical device and data protection regulations in Kenya and how they apply — or fail to apply — to AI-assisted clinical decision support.
Read paper →Current AI safety benchmarks are almost entirely English-centric. We argue for and propose multilingual safety evaluation standards, with Swahili as a case study.
Read paper →Africa accounts for 17% of humanity but under 1% of AI governance discourse. This position paper documents the gap and proposes mechanisms for meaningful inclusion.
Read paper →Founded in Nairobi, Kenya, SPRINTER studies the alignment, interpretability, robustness, and governance challenges that will determine whether advanced AI benefits all of humanity — or only part of it.
Read our story →