0
Latent Reasoning Sprint #3: Activation Difference Steering and Logit Lens
In my previous post I found evidence consistent with the scratchpad paper's compute/store alternation hypothesis — even steps showing higher intermediate answer detection and odd steps showing higher entropy along with results matching “Can we interpret latent reasoning using current mechanistic interpretability tools?”.This post investigates activation steering applied to latent reasoning and examines the resulting performance changes.Quick Summary:Tuned Logit lens sometimes does not find the final answer to a prompt and instead finds a close approximationTuned Logit lens does not seem to have a consistent location layer or latent where the final answer is positioned.Tuned logit lens variants like ones only trained on latent 3 still only have therefore on odd vectors.Activation steering for the average difference between latent vectors did not create increases in accuracy with specific latent pair combinations and instead matched closely with random vectors from “Can we interpret latent reasoning using current mechanistic interpretability tools?”Steering the kv cache to steer CODI outputs can increase accuracy while steering with hidden states do not seem to have a significant effect on CODIExperimental setupCoDI modelI use the publicly available CODI Llama 3.2 1B checkpoint from Can we interpret latent reasoning using current mechanistic interpretability tools? Tuned Logit LensTo create my tuned logit lens implementation I used the code implementation for the training of Tuned logit lens from Eliciting Latent Predictions from Transformers with the Tuned LensActivation SteeringEmbedding steeringGetting the average hidden state from each latent vector and using the difference between latent vector A and B to steer the hidden states. Since codi uses the kv values on eot token. To get new kv values that contain the info from the steered vector I needed to steer latent 1 -> run codi for one additional latent and then get the kv values of latent 2 and see the output.KV cache SteeringSteering the KV kache and adding the steered KV kache directly onto the codi model. Directly adding average difference in kv values to past_key_values.ExperimentsConfirming Previous AssumptionsPROMPT = "Out of 600 employees in a company, 30% got promoted while 10% received bonus. How many employees did not get either a promotion or a bonus?"Answer = 360Tuned Logit Lens properties:Tuned lens approximates but, doesn't find the answer in some cases like 720 (360 x 2) and 350 (360 - 10) latent 0 and 1Approximate answers are not GSM8K artifacts as neither of these numbers are in the most common answers for the datasetThe answers being found in latent 3 and 5 for my previous post with tuned lens might be prompt specific. This suggests tuned lens might just be used as a way to see potential outputsDefault TunedDefaultThe following is the answer frequency for the GSM8K data used to train the tuned logit lensThis prompted me to revisit my previous results using a tuned logit lens trained only on latent 3. Notably, 'therefore' still appears only on odd latents, even with this different prompt.Activation Difference (Steering Embeddings)Across all coefficient values tested, the steering was applied to latents 1–4, with one additional latent step run afterward to obtain updated KV values. The steered models seem to consistently underperform the baseline of no steering until the later latents match the performance of random vector patching “Can we interpret latent reasoning using current mechanistic interpretability tools?”. This might be the case because steering acts the same as random vector patching as the average difference vector might be too noisy to encode meaningful directional information.Activation Difference (Steer KV cache)Unlike the other method of steering which required another codi pass to get new kv values to pass this method steered the kv values as it was being used on the EoT token to generate the answerSet up is getting the mean activations of latents A and B and subtracting them and steering the difference with a coefficient. Latent A is the first latent vector which is in turn subtracted by another latent vector BSteering the kv values unlike steering the hidden states seemed to change work in changing the accuracy of latent step 5.Most vectors being used for steering performed worse than random latent vector activation patching. Some performed significantly better than the baselineCoefficient (0.5):The steered vectors that worked to improve performance are A1-B2, A1-B5, A2-B3, A2 - B3, A2-B4, A3-B5, A4-B5 coefficient 1. When steering with the difference of an earlier latent vector and a later latent vector it is interesting how combinations that included latent 2 performed the best for the latent A.Coefficient (-1):The negative coefficient flips A-B to B-ASince the coefficient is -1 A1-B4, A1-B6, A4-B6, A5-B6 can be interpreted as B4-A1 B6-A1, B6-A4, B6-A5. It seems latents are steered with 6 minus an earlier latent like 1,4,5 seems to have significant increase in accuracy. And differences like between the 1 and 6 and difference latents 5 and 6 seemed to have the highest increase in accuracy. Accuracy for all steering decreases as the coefficient increasesThere is no activation difference that improves accuracy in positive and negative coefficientsPositive CoefficientsNegative CoefficientsA1-B2B4-A1A1-B5A2 - B3A2-B4A3-B5A4-B5B6-A1B6-A4B6-A5BaselineFor negative coefficients A1-B4, A1-B6, A4-B6, A5-B6 performed better than the baseline after the steering for latent 5 a common pattern is with negative coefficient after steering performed significantly better than the baseline for latent 5The positive latents performed better than the baselines on A1-B2, A1-B5, A2-B3, A2 - B4, A2-B4, A3-B5, A4-B5.Activation Difference (Logit Lens)No clear pattern emerges from the activation difference logit lens. The first image is of default logit lens the second image is tuned logit lens the axis is on the y it is latent A on x latent B the activation difference is vectors A - B and the logit lens was done on the differences mean activation A and B for the different layers of the model.Future WorkFind a setup that makes activation steering work with CODIComplete the thought anchors with CODIWhy did certain activation differences for the KV cache increase accuracyLook with other methods such as PCA to observe the reason why activation steering worked on kv but, not hidden state.Discuss
No comments yet.