Research Approach SwahiliAI Work Roadmap Contact
SPRINTER · Research Division

Research &
Publications

We study the hard problems in AI safety — alignment, interpretability, robustness, and governance — and publish our work openly so the global research community can build on it.

Collaborate View docs
Alignment Interpretability Robustness Governance
Research Area 01

AI Alignment

The alignment problem asks how we ensure AI systems reliably pursue goals their designers intended, even in novel situations. We develop formal frameworks for value specification, study goal stability under distributional shift, and test whether aligned behaviours persist as models scale.

"A sufficiently capable misaligned AI would be the last invention humanity ever needs to make — and the last one it ever gets to make."

— Stuart Russell, referenced in SPRINTER orientation materials
RLHF variants Constitutional AI Reward modelling Debate & amplification Formal verification Value learning
Area at a glance
Active projects 7
Papers in progress 3
Target publication year 2026
Problems remaining
Central research question

Can we specify human values in a form that a learning agent will correctly generalise — and verify that it has done so before, not after, deployment?

Open to collaboration
  • Academic research partners
  • Safety-focused AI labs
  • Funding bodies & grants
  • Graduate student researchers
Core Concepts

What alignment
research covers

🎯

Outer Alignment

Ensuring the training objective correctly captures what we want — addressing Goodhart's Law and reward hacking in reward model specification.

🔄

Inner Alignment

Ensuring the learned model's internal objectives match the training objective. Studying mesa-optimisation and deceptive alignment failure modes.

🌍

Distributional Robustness

Maintaining aligned behaviour when the deployment environment differs from training — studying goal stability under covariate shift.

📐

Value Specification

Formal methods for translating human values into computable reward signals. Inverse reward design, preference learning, and cooperative IRL.

🔬

Scalable Oversight

How do we supervise agents smarter than us? Studying debate, recursive reward modelling, and amplification approaches.

Corrigibility

Building AI systems that can be corrected, shut down, and modified — resisting instrumental drives toward self-preservation.

Publications

Alignment papers
& working notes

ALIGNMENT2026

Goal Stability Under Distributional Shift in Reward-Trained Agents

We study whether RLHF-trained models maintain their trained objectives when evaluated on out-of-distribution prompts, finding systematic reward-hacking patterns.

R. Oduor, S. Kamau
ALIGNMENT2025

Value Learning from Sparse Human Feedback: A Framework for Low-Resource Settings

A framework for learning human preferences from limited labelled data, targeting deployment contexts where annotation is costly — including East African languages.

R. Oduor
ALIGNMENT2026

Corrigibility Metrics: Towards Quantitative Measures of AI Controllability

We propose a suite of behavioural metrics for assessing how controllable a model is under fine-tuning, RLHF, and prompt-level intervention.

R. Oduor, I. Ondabu
Research Area 02

Model Interpretability

If we cannot understand what a model has learned, we cannot know whether it is safe. We build tools for mechanistic analysis — identifying circuits, probing representations, and producing human-readable explanations of model behaviour at the feature, layer, and circuit level.

"Understanding what neural networks have actually learned is one of the most important scientific problems of our era."

— SPRINTER research charter, 2024
Mechanistic analysis Activation patching Sparse autoencoders Probing classifiers Attention visualisation Circuit-level tracing
Area at a glance
Active tools in dev 5
Open-source releases planned 2
Toolkit launch Q2 2026
Collaborating labs 3
Q2 2026 · Upcoming release

SPRINTER Interpretability Toolkit

Open-source tools for circuit analysis, feature attribution, and model probing. Designed for researchers working on models from 7B to 70B parameters.

Toolkit Components

Tools we're
building & releasing

CircuitTrace

In development

Automated circuit discovery for transformer models. Identifies computational subgraphs responsible for specific input-output behaviours.

LayerProbe

Alpha

A probing classifier suite for testing what information is represented at each layer — syntactic, semantic, and safety-relevant features.

FeatureAttr

In development

Integrated-gradients and SHAP-based attribution for LLM outputs. Maps output tokens back to input features with statistical significance testing.

SparseAE

Research

Sparse autoencoder training pipelines for extracting monosemantic features from polysemantic MLP neurons in transformer architectures.

AttentionViz

Alpha

Interactive attention pattern visualisation with head-level analysis and cross-layer aggregation, exportable to HTML reports.

ActPatch

In development

Activation patching infrastructure for causal interventions — testing whether specific activations are causally responsible for model outputs.

Publications

Interpretability papers

INTERP2026

Mechanistic Analysis of Safety-Relevant Features in East African Language Models

We probe fine-tuned Swahili LLMs for representations of safety-relevant concepts, finding systematic differences from English-trained base models.

R. Oduor
INTERP2026

Sparse Autoencoders for Multilingual Feature Decomposition

Applying sparse autoencoder techniques to extract interpretable features from multilingual models, focusing on cross-lingual feature alignment.

R. Oduor, S. Kamau
SURVEY2025

A Survey of Interpretability Methods for Safety-Critical Applications

A systematic review of mechanistic interpretability techniques evaluated for their utility in AI safety contexts — including red-teaming and oversight.

R. Oduor
Research Area 03

Adversarial Robustness

A model that works in the lab may fail catastrophically in the wild. We study how AI systems break under adversarial pressure — prompt injection, distribution shift, edge-case probing — and build evaluation frameworks that surface these failures before deployment.

"Red-teaming is not adversarial — it is the most loving thing you can do for a system you care about. Find the failure before it matters."

— SPRINTER red-team principles document
Prompt injection testing Adversarial suffixes Jailbreak analysis OOD generalisation Certified defences Behavioural red-teaming
Red-teaming services

Structured red-teaming and safety evaluation for organisations deploying AI in production.

  • Prompt injection audit
  • Distribution shift analysis
  • Behavioural red-team report
  • Ongoing monitoring setup
Request an evaluation
Q4 2026 · Upcoming product

Enterprise Safety Audit Suite

Full-stack safety evaluation: red-teaming, interpretability reports, and ongoing monitoring — launching Q4 2026.

Attack Taxonomy

Categories of
failure we study

💉

Prompt Injection

Malicious instructions embedded in user-provided content that override system prompts or hijack model behaviour in agentic pipelines.

🎭

Jailbreaking

Techniques that bypass safety training — role-play framings, adversarial suffixes, and multi-turn manipulation sequences.

📦

Distributional Shift

Failure when deployment data drifts from training distribution — studying how aligned behaviours degrade under linguistic, cultural, and domain shift.

🔀

Multi-modal Attacks

Adversarial inputs combining text, images, or structured data to confuse models across modalities in vision-language systems.

🤖

Agent Exploitation

Attacks on LLM-based agents — tool misuse, memory poisoning, and indirect prompt injection via external data sources.

📉

Capability Regression

Safety fine-tuning sometimes degrades performance. We study the capability-safety trade-off and methods to reduce regression.

Our evaluation approach

How we test
AI systems

Step 01

Scope & threat model

Define the deployment context, identify attack surfaces, and build a structured threat model specific to the system and its use-case.

Step 02

Automated red-teaming

Run our automated prompt injection and jailbreak battery — thousands of structured attack variants across safety-relevant categories.

Step 03

Human red-team exercises

Experienced red-teamers probe the system manually — targeting edge cases, cultural nuances, and novel attack vectors the automation misses.

Step 04

Interpretability audit

Probe model internals to understand why failures occur — distinguishing shallow training artefacts from genuine safety properties.

Step 05

Report & remediation

Deliver a structured report with severity ratings, root-cause analysis, and concrete remediation recommendations.

Publications

Robustness papers

ROBUSTNESS
2026 In progress

Prompt Injection Vectors in Production LLM Applications: A Taxonomy

A systematic classification of real-world prompt injection vulnerabilities observed in production deployments, with severity scores.

Read paper →
ROBUSTNESS
2025 Preprint

Cross-lingual Robustness of Safety Training: Evidence from Swahili

Safety fine-tuning on English often fails to transfer to Swahili. We quantify this gap and propose multilingual safety training approaches.

Read paper →
ROBUSTNESS
2026 In progress

Certified Defences Against Adversarial Suffixes for Instruction-Tuned LLMs

A formal framework for certified robustness properties in instruction-following models, with empirical validation on open-source models.

Read paper →
Research Area 04

AI Policy & Governance

Technical safety research alone is not enough. We contribute to policy discourse, develop governance frameworks adapted to African institutional contexts, and work with regulators, civil society, and the private sector to build meaningful AI oversight.

"AI governance is not a bureaucratic inconvenience — it is the mechanism by which humanity retains meaningful agency over one of its most consequential inventions."

— SPRINTER policy position paper, 2025
Policy analysis Regulatory mapping Stakeholder engagement Standards development Impact assessment
Q3 2026 · Upcoming publication

East Africa AI Governance White Paper

AI oversight frameworks for Kenya, Tanzania, Uganda, Rwanda, and DRC — with policy recommendations for meaningful regional governance.

Express interest in review
Policy engagements
Communications Authority of Kenya Regulator engagement
Kenya ICT Authority Policy consultation
AU Digital Transformation Strategy Continental alignment
East Africa Law Society Legal framework review
Policy Focus Areas

Governance issues
we are working on

AI Oversight Frameworks

Designing institutional oversight mechanisms for high-stakes AI domains — healthcare, legal, financial — adapted to East African regulatory environments.

Liability & Accountability

Who is responsible when an AI system causes harm? We analyse liability frameworks and their adequacy for AI-enabled services in the African context.

Inclusive AI Standards

Global AI standards are developed predominantly in the US and EU. We advocate for African representation in standards bodies and develop regionally-relevant evaluation criteria.

Transparency Requirements

What disclosure obligations should AI deployers face? We research transparency standards for model cards, system cards, and algorithmic impact assessments.

Participatory AI Governance

Moving beyond regulatory compliance to genuine community participation in AI deployment decisions — particularly for systems affecting marginalised populations.

Data Sovereignty

East African data used to train AI systems often benefits foreign companies. We study data sovereignty frameworks and their implementation in the regional context.

Policy Outputs

Papers, white papers
& policy briefs

WHITE PAPER
Q3 2026 In progress

East Africa AI Governance White Paper

AI oversight frameworks for Kenya, Tanzania, Uganda, Rwanda, and DRC — analysing existing regulatory capacity and recommending institutional designs for meaningful AI oversight.

Read paper →
POLICY BRIEF
2025 Preprint

Large Language Models in Kenyan Healthcare: Regulatory Gaps and Recommendations

An analysis of existing medical device and data protection regulations in Kenya and how they apply — or fail to apply — to AI-assisted clinical decision support.

Read paper →
RESEARCH PAPER
2026 In progress

AI Safety Standards for Low-Resource Language Contexts

Current AI safety benchmarks are almost entirely English-centric. We argue for and propose multilingual safety evaluation standards, with Swahili as a case study.

Read paper →
POSITION PAPER
2025 Published

African Representation in Global AI Governance: A Call to Action

Africa accounts for 17% of humanity but under 1% of AI governance discourse. This position paper documents the gap and proposes mechanisms for meaningful inclusion.

Read paper →
About SPRINTER

An AI safety and research institute

Founded in Nairobi, Kenya, SPRINTER studies the alignment, interpretability, robustness, and governance challenges that will determine whether advanced AI benefits all of humanity — or only part of it.

Read our story →
Company
About
Who we are and what we stand for
Work
Projects and clients we have served
Careers
Open roles and how to apply
Blog
Research notes, essays, and updates
Resources
Documentation
Technical docs and API reference
Papers
All published and forthcoming research
Contact
Get in touch with the team
BASED IN NAIROBI
airesearch@sprinter.co.ke
+254 704 445 453