Using code metric 'code_eval_octopack' instead of original 'code_eval'

### 1
Is this feature solely for multi-language support?
When I run the results through 'code_eval' in the original humaneval.py, I only achieve a 'pass@1' score of about 36%.

### 2
Are there any other considerations?
Is it fair to add an import helper?