Research — SPRINTER AI Safety & Research Institute

Research Area 01

AI Alignment

The alignment problem asks how we ensure AI systems reliably pursue goals their designers intended, even in novel situations. We develop formal frameworks for value specification, study goal stability under distributional shift, and test whether aligned behaviours persist as models scale.

"A sufficiently capable misaligned AI would be the last invention humanity ever needs to make — and the last one it ever gets to make."

— Stuart Russell, referenced in SPRINTER orientation materials

RLHF variants Constitutional AI Reward modelling Debate & amplification Formal verification Value learning

Area at a glance

Active projects 7

Papers in progress 3

Target publication year 2026

Problems remaining ∞

Central research question

Can we specify human values in a form that a learning agent will correctly generalise — and verify that it has done so before, not after, deployment?

Open to collaboration

Academic research partners
Safety-focused AI labs
Funding bodies & grants
Graduate student researchers

Get in touch

Core Concepts

What alignment
research covers

🎯

Outer Alignment

Ensuring the training objective correctly captures what we want — addressing Goodhart's Law and reward hacking in reward model specification.

🔄

Inner Alignment

Ensuring the learned model's internal objectives match the training objective. Studying mesa-optimisation and deceptive alignment failure modes.

🌍

Distributional Robustness

Maintaining aligned behaviour when the deployment environment differs from training — studying goal stability under covariate shift.

📐

Value Specification

Formal methods for translating human values into computable reward signals. Inverse reward design, preference learning, and cooperative IRL.

🔬

Scalable Oversight

How do we supervise agents smarter than us? Studying debate, recursive reward modelling, and amplification approaches.

⚡

Corrigibility

Building AI systems that can be corrected, shut down, and modified — resisting instrumental drives toward self-preservation.

Publications

Alignment papers
& working notes

ALIGNMENT2026

Goal Stability Under Distributional Shift in Reward-Trained Agents

We study whether RLHF-trained models maintain their trained objectives when evaluated on out-of-distribution prompts, finding systematic reward-hacking patterns.

R. Oduor, S. Kamau

ALIGNMENT2025

Value Learning from Sparse Human Feedback: A Framework for Low-Resource Settings

A framework for learning human preferences from limited labelled data, targeting deployment contexts where annotation is costly — including East African languages.

R. Oduor

ALIGNMENT2026

Corrigibility Metrics: Towards Quantitative Measures of AI Controllability

We propose a suite of behavioural metrics for assessing how controllable a model is under fine-tuning, RLHF, and prompt-level intervention.

R. Oduor, I. Ondabu

Research Area 02

Model Interpretability

If we cannot understand what a model has learned, we cannot know whether it is safe. We build tools for mechanistic analysis — identifying circuits, probing representations, and producing human-readable explanations of model behaviour at the feature, layer, and circuit level.

"Understanding what neural networks have actually learned is one of the most important scientific problems of our era."

— SPRINTER research charter, 2024

Mechanistic analysis Activation patching Sparse autoencoders Probing classifiers Attention visualisation Circuit-level tracing

Area at a glance

Active tools in dev 5

Open-source releases planned 2

Toolkit launch Q2 2026

Collaborating labs 3

Q2 2026 · Upcoming release

SPRINTER Interpretability Toolkit

Open-source tools for circuit analysis, feature attribution, and model probing. Designed for researchers working on models from 7B to 70B parameters.

View docs preview

Toolkit Components

Tools we're
building & releasing

CircuitTrace

In development

Automated circuit discovery for transformer models. Identifies computational subgraphs responsible for specific input-output behaviours.

LayerProbe

Alpha

A probing classifier suite for testing what information is represented at each layer — syntactic, semantic, and safety-relevant features.

FeatureAttr

In development

Integrated-gradients and SHAP-based attribution for LLM outputs. Maps output tokens back to input features with statistical significance testing.

SparseAE

Research

Sparse autoencoder training pipelines for extracting monosemantic features from polysemantic MLP neurons in transformer architectures.

AttentionViz

Alpha

Interactive attention pattern visualisation with head-level analysis and cross-layer aggregation, exportable to HTML reports.

ActPatch

In development

Activation patching infrastructure for causal interventions — testing whether specific activations are causally responsible for model outputs.

Publications

Interpretability papers

INTERP2026

Mechanistic Analysis of Safety-Relevant Features in East African Language Models

We probe fine-tuned Swahili LLMs for representations of safety-relevant concepts, finding systematic differences from English-trained base models.

R. Oduor

INTERP2026

Sparse Autoencoders for Multilingual Feature Decomposition

Applying sparse autoencoder techniques to extract interpretable features from multilingual models, focusing on cross-lingual feature alignment.

R. Oduor, S. Kamau

SURVEY2025

A Survey of Interpretability Methods for Safety-Critical Applications

A systematic review of mechanistic interpretability techniques evaluated for their utility in AI safety contexts — including red-teaming and oversight.

R. Oduor

Research Area 03

Adversarial Robustness

A model that works in the lab may fail catastrophically in the wild. We study how AI systems break under adversarial pressure — prompt injection, distribution shift, edge-case probing — and build evaluation frameworks that surface these failures before deployment.

"Red-teaming is not adversarial — it is the most loving thing you can do for a system you care about. Find the failure before it matters."

— SPRINTER red-team principles document

Prompt injection testing Adversarial suffixes Jailbreak analysis OOD generalisation Certified defences Behavioural red-teaming

Red-teaming services

Structured red-teaming and safety evaluation for organisations deploying AI in production.

Prompt injection audit
Distribution shift analysis
Behavioural red-team report
Ongoing monitoring setup

Request an evaluation

Q4 2026 · Upcoming product

Enterprise Safety Audit Suite

Full-stack safety evaluation: red-teaming, interpretability reports, and ongoing monitoring — launching Q4 2026.

Attack Taxonomy

Categories of
failure we study

💉

Prompt Injection

Malicious instructions embedded in user-provided content that override system prompts or hijack model behaviour in agentic pipelines.

🎭

Jailbreaking

Techniques that bypass safety training — role-play framings, adversarial suffixes, and multi-turn manipulation sequences.

📦

Distributional Shift

Failure when deployment data drifts from training distribution — studying how aligned behaviours degrade under linguistic, cultural, and domain shift.

🔀

Multi-modal Attacks

Adversarial inputs combining text, images, or structured data to confuse models across modalities in vision-language systems.

🤖

Agent Exploitation

Attacks on LLM-based agents — tool misuse, memory poisoning, and indirect prompt injection via external data sources.

📉

Capability Regression

Safety fine-tuning sometimes degrades performance. We study the capability-safety trade-off and methods to reduce regression.

Our evaluation approach

How we test
AI systems

Step 01

Scope & threat model

Define the deployment context, identify attack surfaces, and build a structured threat model specific to the system and its use-case.

Step 02

Automated red-teaming

Run our automated prompt injection and jailbreak battery — thousands of structured attack variants across safety-relevant categories.

Step 03

Human red-team exercises

Experienced red-teamers probe the system manually — targeting edge cases, cultural nuances, and novel attack vectors the automation misses.

Step 04

Interpretability audit

Probe model internals to understand why failures occur — distinguishing shallow training artefacts from genuine safety properties.

Step 05

Report & remediation

Deliver a structured report with severity ratings, root-cause analysis, and concrete remediation recommendations.

Publications

Robustness papers

ROBUSTNESS

2026 In progress

Prompt Injection Vectors in Production LLM Applications: A Taxonomy

A systematic classification of real-world prompt injection vulnerabilities observed in production deployments, with severity scores.

Read paper →

ROBUSTNESS

2025 Preprint

Cross-lingual Robustness of Safety Training: Evidence from Swahili

Safety fine-tuning on English often fails to transfer to Swahili. We quantify this gap and propose multilingual safety training approaches.

Read paper →

ROBUSTNESS

2026 In progress

Certified Defences Against Adversarial Suffixes for Instruction-Tuned LLMs

A formal framework for certified robustness properties in instruction-following models, with empirical validation on open-source models.

Read paper →

Research Area 04

AI Policy & Governance

Technical safety research alone is not enough. We contribute to policy discourse, develop governance frameworks adapted to African institutional contexts, and work with regulators, civil society, and the private sector to build meaningful AI oversight.

"AI governance is not a bureaucratic inconvenience — it is the mechanism by which humanity retains meaningful agency over one of its most consequential inventions."

— SPRINTER policy position paper, 2025

Policy analysis Regulatory mapping Stakeholder engagement Standards development Impact assessment

Q3 2026 · Upcoming publication

East Africa AI Governance White Paper

AI oversight frameworks for Kenya, Tanzania, Uganda, Rwanda, and DRC — with policy recommendations for meaningful regional governance.

Express interest in review

Policy engagements

Communications Authority of Kenya Regulator engagement

Kenya ICT Authority Policy consultation

AU Digital Transformation Strategy Continental alignment

East Africa Law Society Legal framework review

Policy Focus Areas

Governance issues
we are working on

AI Oversight Frameworks

Designing institutional oversight mechanisms for high-stakes AI domains — healthcare, legal, financial — adapted to East African regulatory environments.

Liability & Accountability

Who is responsible when an AI system causes harm? We analyse liability frameworks and their adequacy for AI-enabled services in the African context.

Inclusive AI Standards

Global AI standards are developed predominantly in the US and EU. We advocate for African representation in standards bodies and develop regionally-relevant evaluation criteria.

Transparency Requirements

What disclosure obligations should AI deployers face? We research transparency standards for model cards, system cards, and algorithmic impact assessments.

Participatory AI Governance

Moving beyond regulatory compliance to genuine community participation in AI deployment decisions — particularly for systems affecting marginalised populations.

Data Sovereignty

East African data used to train AI systems often benefits foreign companies. We study data sovereignty frameworks and their implementation in the regional context.

Policy Outputs

Papers, white papers
& policy briefs

WHITE PAPER

Q3 2026 In progress

East Africa AI Governance White Paper

AI oversight frameworks for Kenya, Tanzania, Uganda, Rwanda, and DRC — analysing existing regulatory capacity and recommending institutional designs for meaningful AI oversight.

Read paper →

POLICY BRIEF

2025 Preprint

Large Language Models in Kenyan Healthcare: Regulatory Gaps and Recommendations

An analysis of existing medical device and data protection regulations in Kenya and how they apply — or fail to apply — to AI-assisted clinical decision support.

Read paper →

RESEARCH PAPER

2026 In progress

AI Safety Standards for Low-Resource Language Contexts

Current AI safety benchmarks are almost entirely English-centric. We argue for and propose multilingual safety evaluation standards, with Swahili as a case study.

Read paper →

POSITION PAPER

2025 Published

African Representation in Global AI Governance: A Call to Action

Africa accounts for 17% of humanity but under 1% of AI governance discourse. This position paper documents the gap and proposes mechanisms for meaningful inclusion.

Read paper →

Research &Publications

AI Alignment

What alignmentresearch covers

Outer Alignment

Inner Alignment

Distributional Robustness

Value Specification

Scalable Oversight

Corrigibility

Alignment papers& working notes

Goal Stability Under Distributional Shift in Reward-Trained Agents

Value Learning from Sparse Human Feedback: A Framework for Low-Resource Settings

Corrigibility Metrics: Towards Quantitative Measures of AI Controllability

Model Interpretability

SPRINTER Interpretability Toolkit

Tools we'rebuilding & releasing

CircuitTrace

LayerProbe

FeatureAttr

SparseAE

AttentionViz

ActPatch

Interpretability papers

Mechanistic Analysis of Safety-Relevant Features in East African Language Models

Sparse Autoencoders for Multilingual Feature Decomposition

A Survey of Interpretability Methods for Safety-Critical Applications

Adversarial Robustness

Enterprise Safety Audit Suite

Categories offailure we study

Prompt Injection

Jailbreaking

Distributional Shift

Multi-modal Attacks

Agent Exploitation

Capability Regression

How we testAI systems

Scope & threat model

Automated red-teaming

Human red-team exercises

Interpretability audit

Report & remediation

Robustness papers

Prompt Injection Vectors in Production LLM Applications: A Taxonomy

Cross-lingual Robustness of Safety Training: Evidence from Swahili

Certified Defences Against Adversarial Suffixes for Instruction-Tuned LLMs

AI Policy & Governance

East Africa AI Governance White Paper

Governance issueswe are working on

AI Oversight Frameworks

Liability & Accountability

Inclusive AI Standards

Transparency Requirements

Participatory AI Governance

Data Sovereignty

Papers, white papers& policy briefs

East Africa AI Governance White Paper

Large Language Models in Kenyan Healthcare: Regulatory Gaps and Recommendations

AI Safety Standards for Low-Resource Language Contexts

African Representation in Global AI Governance: A Call to Action

An AI safety and research institute

Research &
Publications

What alignment
research covers

Alignment papers
& working notes

Tools we're
building & releasing

Categories of
failure we study

How we test
AI systems

Governance issues
we are working on

Papers, white papers
& policy briefs