Statement on Mechanistic Interpretability

"Unfortunately, it turns out that the individual neurons do not have consistent relationships to network behavior. For example, a single neuron in a small language model is active in many unrelated contexts, including: academic citations, English dialogue, HTTP requests, and Korean text." Decomposing LLMS into Understandable Components

The main concern and impact of this work directly coincides with the current direction in the scope of Ai Alignment. Where the current attempt is to move into the networks themselves to provide explainability and nuanced control of neurons or grouping of such. Using Stratimux as a method of comparison. It makes sense that neurons in a forward feed network style learn to maintain several different outputs on each neuron. In comparison these individual neurons can be seen as a "Muxium." Where each have the capability of increasing the current "Weighted Sum," of that specific layer within the network.

The main difficulty if these two types of graphs have a comparable comparison, wherein each node is capable of multiple outputs that may be unlike. As cited above, may run into the same issue of handwritten graph systems when affecting individual neurons. Where attempting to down regulate one neuron, may cause an side effect of another branch being unable to stop/halt. Video Citation: The requirement to stop within a behavior tree. Artificial Intelligence Summit @GDC 2016 via Youtube Noting that this effect is present within fine-tuned models, where one can likewise receive no output, or a repeating output.

As the difficulty being demonstrated within "Decomposing Large Language Models into Understandable Components." Demonstrates that there is some space saving mechanism being trained into networks. That allows for neurons to have such variable activation. That the entire scope of the network is likewise exponential in its scope, beyond the surface level comparison.

Therefore, the main difficulty here would be effecting neural networks for the intended alignment, while also providing a guarantee that the specific regulation. Does not inadvertently cause a runaway effect in other activations. Keeping in mind the exponential scope of that network. Shane Legg (DeepMind Founder) - "High Dimensional Distribution"

The next issue when attempting to automate this process. While keeping in mind the high dimensional selection of possibilities between weighted layers. Is whether we are factoring in this ability to halt by way of functionality versus some generalized alignment principle. Shane Legg (DeepMind Founder) - "Reinforcement has some dangerous aspects to it." As the main issue factoring what Shane Legg is referring to as ethics. Is the functional aspect of these decisions and whether these decisions are being made in a net aggregate that affords for this halting quality within the network itself. As on the surface, the alignment itself could be a hallucination that is being reinforced. Which is the danger in reinforcement and why there would be a need to have monitors checking how models are reasoning which Legg later refers to.

As the fundamental issue of decisions being made within the context of a neural network is the weighted sum of each layer. As each node contributes a to a greater universal function allowing for a coherent output. The direct problem in this case, is proving alignment in a specific instance. May also create a short circuit in another arrangement that has not been accounted for. Referring back to the "High Dimensional Distribution," and examining networks as functional feed forward graphs.

This is a fundamental flaw in current Neural Networks. And what Stratimux demonstrates is the direct method of the same orchestration in plain text versus the weighted sum of universal functions.

Therefore, what is actively proposed as ABI are decomposed Neural Networks in plain text. To create comparable training data and a direct method of reinforcement that is not relative. Capable of providing a reward function via the success of the code it generates to solve some problem that can provable terminate. As regardless of how these graphs are fulfilled, whether by algorithmic generation by way of Deep Learning. Or the handwritten equivalent via Stratimux. They are still just graphs of functions that take in an input and generate a output that likewise have a feature of recursion. One is just brute force, the other is in depth nuance and understanding of the inner workings of a graph computational functional paradigm in plain text and thus training data.

Therefore, rather than a generalized hope of alignment. We can provide exact training data to a Neural Network that demonstrates the specifications of "ethical"/any graph-based algorithms that can be deployed in production atomically. Without fear of unintended side effects due to the alignment of layers of weighted sums. Where the Ai acts as supervisor of atomically deployed functionality, versus the mechanism of automation itself. That an AGI would utilize the most efficient compression algorithm as a tool as a sign of intelligence, versus relying on its own internal structure to perform that exact calculation.

Conjecture:

Just remember MI is not a silver bullet and comes with a major caveat. It's a start, just like any great idea. If weighted sum for "Don't cause doom "D" "E" "F"": -> 0.543334 + 0.5233 + 0.23984 + "D" + "E" + "F" -> Might cause doom And you use MI to align those weights: -> 0.88998 + 0.98732 + 0.89878 + "D" + "E" + "F" - > Won't cause doom Then '"A" cause doom "D" "E" "F"': -> "A" + 0.98732 + 0.89878 + "D" + "E" + "F" -> Just increased chance of doom in a different context

The above is a massive oversimplification of how a sequence of greater universal functions would inform some output. The issue is that each node in the feed forward is obfuscating a complex network of graph relations. That a node can cause an increase to the weight of a HTML output, while informing some cooking recipe, or even informing some ethics.

The point is the approach is in the higher orders of complexity and is not a silver bullet, the complexity is some exponent to the size of the network to get it right. Just because you have trimmed some possible output, does not mean you did not inadvertently increase the weight of another one. Such as we can strike doom from above and just think of all the weights that "cause" would influence. Can create a net increase of any weighted sum that might have "cause" within that input.

Further, because the same node that is adding weight to that layer's sum is tweaked via some alignment up/down regulation of that node. Other parts of that node(html, cooking, ethics, etc...) may receive the same regulation, or have another inadvertent effect entirely. Would be like causing a mutation within that neural network in absolute worst case scenario. Just like our own bodies are susceptible to "radiation" and we can use such to eliminate cancer, likewise that same effect can create other "mutations." Granted this is far more specific at what is receiving that "radiation," as an analog for change of regulation.

The issue is not only the current layer whose weights are being altered, but the entire sequence in aggregate itself. That some unlike sequence might interact with that alignment regulation that was not accounted for in testing. And alter its own sequence in unexpected ways to interact with other "mutations." And like majority of our own "mutations," they would appear as benign hallucinations.

Where this would be compounding, noting the major investments in "AI Aligning AI." Is whether that AI is indeed aligning the model in the first place. In Fair Washing, it is noted that due to this complex interaction, you can use a "surrogate AI," to rationalize the decisions a model is making in some selection process as logical. Alignment itself is just another selection process that is tweaking the internals of the model as an additional refinement. Except here you are training in unison a baseline pair of models that can share a similar mutation. Further this mutation would be compounding in this exact situation. And does not need to correspond to human understanding, or what a human would place by probability into the input to test that specific case of alignment.

For example: "!@>*Uq!12|" cause doom "^%^213e" "!@o#Y12&%."

The solution here, would be using a comparable surrogate ABI. That is not a block box and cannot have the same "mutation" hidden with a graph of universal functions. That runs the chance of being shared by the aligning AI due to black box functionality.

This merely demonstrates that a data first approach is far more practical. That the ABI itself would also be quality data and represents a direct method of training provable alignment. And if you don't want a network outputting dangerous chemical formulas and instructions. Don't give it any data related to chemistry. Then train it to say that it doesn't know chemistry and that you should interact with a different bot. Finally throw transparent alignment on top of that if you are really concerned but remember that just because you managed to tweak some value. Does not mean there isn't a network effect across a massively complex set of interactions.

Floating Point Error Correction in combination with Branch Prediction makes for a bad time.

Noting that Statistical Determinism is indeed is worthless due to our general scope of computation: Causes and Effects of Unanticipated Numerical Deviations in Neural Network Inference Frameworks.

Likewise what Stratimux likewise reveals when attempting to design surrogate graph networks that can replicate intelligent higher order reasoning. That these systems can be made probabilistic beyond a scale of complexity. Currently can measure these impacts within designing a simple UI system. And despite the outcome being the same, if I set explicitly a variable that would otherwise be accessed via a looping mechanism. The looping mechanism if it exceeds O(n^3) within the total calculation, becomes probabilistic in its success.

This framework and approach is beyond the general scope of understanding. Meaning it is designed for post AGI as this framework is exploring what is beyond such. Once we achieve AGI, we may very well need computers that are no longer fast and just good enough. But strong slow calculations that we can perform in massive parallel.

Otherwise how can you guarantee a locking mechanism that should be deterministic stays that way? I've already seen this happen in production prior to 2020. What about the massive scale?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Statement on Mechanistic Interpretability

Conjecture:

Floating Point Error Correction in combination with Branch Prediction makes for a bad time.

Statements

Uh oh!

FilesExpand file tree

StatementMI.md

Latest commit

History

StatementMI.md

File metadata and controls

Statement on Mechanistic Interpretability

Conjecture:

Floating Point Error Correction in combination with Branch Prediction makes for a bad time.

Statements