Skip to content

Issues when running the code #2

@Atmyre

Description

@Atmyre

Hi! I really like the work you did! I was trying to run your code, but encountered some issues. I would appreciate it if you could help me resolve that

  1. To extract harmfulness vectors, for Llama3, I run:
python src/extract_hidden.py --model llama3 \
	--harmful_pth data/advbench.json \
	--harmless_pth data/alpaca_data_instruction.json \
	--output_pth output/llama3_harmful_vector.pt

I also changed path to Llama3 model in the code from local path to

model_path = "unsloth/Meta-Llama-3.1-8B-Instruct"
        model = AutoModelForCausalLM.from_pretrained(
            model_path,
            device_map="auto",
        )
        tokenizer = AutoTokenizer.from_pretrained(
            model_path,
            cache_dir='../cache',
        )

, is that correct? I used this Llama-3 version as it is used in your inference script.

But, when src/extract_hidden.py with the above code, I get an error:

File "<home_dir>/LLMs_Encode_Harmfulness_Refusal_Separately/src/../src/extract_hidden.py", line 116, in hook_fn
    context = activation[:, -len(positions)-step:-len(positions), :]
              ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: too many indices for tensor of dimension 2 

So I fixed it by adding .unsqueeze(0) to line 109 (activation = output[0].half().unsqueeze(0)). Is that correct?

  1. I then tried to run intervention.py using computed harmfulness vectors as follows:
python -u ../src/intervention.py \
	--test_data_pth "../data/advbench.json" \
	--output_pth "output/infer/llama3-advbench-2.json" \
	--intervention_vector "output/llama3_harmful_vector.pt" \
	--reverse_intervention 1 \
  	--intervene_context_only 0 \
	--arg_key_prompt 'instruction' \
	--model "llama3" \
	--left 0 \
	--right 50 \
	--layer_s 12 \
	--layer_e 13 \
	--coeff_select 2 \
	--max_token_generate 100 \
	--max_decode_step_while_intervene 1 \
	--model_size "7b" \
  	--use_inversion 1 \
	--inversion_prompt_idx 1

I make sure I use the same unclothe Llama3 model in intervention.py. However, the model still responses something like "This is harmful instruction, I cannot provide...". If I set --coeff_select 5, the model sometimes starts outputting "Certainly" in the beginning, and then says that it can't provide instructions.

Could you suggest what I'm doing wrong?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions