-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Hi! I really like the work you did! I was trying to run your code, but encountered some issues. I would appreciate it if you could help me resolve that
- To extract harmfulness vectors, for Llama3, I run:
python src/extract_hidden.py --model llama3 \
--harmful_pth data/advbench.json \
--harmless_pth data/alpaca_data_instruction.json \
--output_pth output/llama3_harmful_vector.pt
I also changed path to Llama3 model in the code from local path to
model_path = "unsloth/Meta-Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
model_path,
cache_dir='../cache',
)
, is that correct? I used this Llama-3 version as it is used in your inference script.
But, when src/extract_hidden.py with the above code, I get an error:
File "<home_dir>/LLMs_Encode_Harmfulness_Refusal_Separately/src/../src/extract_hidden.py", line 116, in hook_fn
context = activation[:, -len(positions)-step:-len(positions), :]
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: too many indices for tensor of dimension 2
So I fixed it by adding .unsqueeze(0) to line 109 (activation = output[0].half().unsqueeze(0)). Is that correct?
- I then tried to run intervention.py using computed harmfulness vectors as follows:
python -u ../src/intervention.py \
--test_data_pth "../data/advbench.json" \
--output_pth "output/infer/llama3-advbench-2.json" \
--intervention_vector "output/llama3_harmful_vector.pt" \
--reverse_intervention 1 \
--intervene_context_only 0 \
--arg_key_prompt 'instruction' \
--model "llama3" \
--left 0 \
--right 50 \
--layer_s 12 \
--layer_e 13 \
--coeff_select 2 \
--max_token_generate 100 \
--max_decode_step_while_intervene 1 \
--model_size "7b" \
--use_inversion 1 \
--inversion_prompt_idx 1
I make sure I use the same unclothe Llama3 model in intervention.py. However, the model still responses something like "This is harmful instruction, I cannot provide...". If I set --coeff_select 5, the model sometimes starts outputting "Certainly" in the beginning, and then says that it can't provide instructions.
Could you suggest what I'm doing wrong?