Skip to main content
Enterprise AI Analysis: Probing for Representation Manifolds in Superposition

Enterprise AI Analysis

Probing for Representation Manifolds in Superposition

This paper introduces the Manifold Probe, a supervised method for discovering representation manifolds in superposition. The method generalizes linear regression probes by learning the space of features of a concept that can be linearly predicted from the representations, and then learning the directions used to encode them. We demonstrate the probe on representations of time and space in Llama 2-7b, finding manifolds which linearly represent an interpretable set of features in each case. In the case of time, we show that by steering along the manifold, we can influence the model's completions about the years in which famous songs, movies and books were released, providing evidence that the Manifold Probe can discover manifolds which are causally involved in model behaviour.

Executive Impact Summary

The Manifold Probe is a novel supervised method for uncovering complex, continuous concept representations within large language models, specifically Llama 2-7b. Unlike traditional linear probes, it identifies both the features of a concept and the specific directions in the residual stream where these features are encoded. This allows for a deeper understanding of 'manifold geometry' in representation spaces. Crucially, the probe enables 'causal steering,' demonstrating that manipulating these discovered manifolds directly influences model behavior, such as predicting release years of media. This breakthrough offers a powerful tool for mechanistic interpretability and for developing more aligned and controllable AI systems.

0.767 Time Manifold R² (Feature 1)
0.35 Peak Steering Efficacy (Prob.)
7B Parameters (Llama 2)
46,884 Data Points Analyzed

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

The Manifold Probe successfully identified a multi-dimensional representation of 'time' within Llama 2-7b's residual stream. This manifold linearly separates different decades, offering a fine-grained understanding of how the model encodes temporal information. By analyzing the learned features, we gain insights into the model's internal representations of release dates for songs, movies, and books.

Manifold Probe Methodology

Learn Space of Features f(z)
Optimize Regression Parameters
Learn Encoding Directions uk, ck
Discover Representation Manifolds
0.7676 Highest R² for Time Feature 1 in Llama 2-7b (Layer 16)

Causal Steering of Time Beliefs in Llama 2-7b

We demonstrated the Manifold Probe's causal influence by steering Llama 2-7b's internal belief about release dates. By adding steering vectors that trace the discovered time manifold, we could consistently influence the model to complete prompts with target years. For instance, interventions peaked in efficacy around layers 8 and 14, and could steer predictions within a 2-year window of the target, showing direct control over the model's temporal understanding.

Key Results:

  • Steered model completions to target release years.
  • Peak efficacy observed at specific intermediate layers (8 and 14).
  • Direct evidence of causal involvement of the discovered manifold.

Beyond time, the Manifold Probe revealed detailed 'space' manifolds in Llama 2-7b, representing geographic coordinates of U.S. places. These manifolds linearly separate many U.S. states, indicating a robust internal representation of spatial information. The discovered features are highly interpretable and allow for precise linear prediction of location-specific attributes.

0.6835 Highest R² for Space Feature 2 in Llama 2-7b (Layer 16)

Manifold Probe vs. Standard Linear Regression

Capability Manifold Probe Standard Linear Regression
Identifies Multi-dimensional Features
Discovers Underlying Manifold Geometry
Reveals Interpretable Feature Space
Supports Causal Steering/Intervention
Tracks Simple, Fixed Features (e.g., Year)
Higher R² for Complex Concepts (e.g., Time Features)

Calculate Your Potential ROI with Advanced AI Probing

Estimate the significant efficiency gains and cost savings your enterprise could achieve by implementing AI systems with a deep understanding of their internal representations.

Annual Cost Savings $0
Annual Hours Reclaimed 0

Your Enterprise AI Implementation Roadmap

A structured approach to integrate advanced AI interpretability and steering capabilities into your organization, leveraging insights from cutting-edge research.

Phase 1: Discovery & Strategy

Identify critical business processes, define AI objectives, and assess data readiness for Manifold Probe application.

Phase 2: Data Preparation & Probing

Curate and prepare task-specific datasets, and apply the Manifold Probe to discover and map latent representations.

Phase 3: Analysis & Feature Engineering

Interpret discovered manifolds, apply factor analysis for feature interpretability, and engineer new model features.

Phase 4: Intervention & Optimization

Utilize steering vectors to causally influence model behavior and optimize for desired outcomes, ensuring alignment.

Phase 5: Deployment & Monitoring

Integrate refined AI systems into operations, and continuously monitor for performance and unexpected behaviors.

Ready to Transform Your Business with AI?

Schedule a consultation with our AI experts to explore how these advanced interpretability and steering techniques can unlock new potentials for your enterprise.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking