Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
178 changes: 178 additions & 0 deletions PROJECT_ISSUES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
# CodeGeeX Project Issues List

This document lists all identified issues in the CodeGeeX project.

## 1. Hardcoded Paths (High Priority) ✅ FIXED

### MindSpore Scripts
**Status: All hardcoded paths have been replaced with configurable options**

All hardcoded paths have been fixed:

- **`codegeex/mindspore/generation_values.py`** ✅
- Now uses `--output_path` argument (defaults to `./output`)
- Output file: `output_values.npy`

- **`codegeex/mindspore/generation_humaneval.py`** ✅
- Now uses `--input_path` for dataset (with smart fallback to repository-relative paths)
- Now uses `--output_path` for save directory (defaults to `./output`)
- Language parameter is now properly used (no longer hardcoded to C++)

- **`codegeex/mindspore/generation_finetune.py`** ✅
- Now uses `--output_path` argument (defaults to `./output`)

- **`codegeex/mindspore/generation_batch.py`** ✅
- Now uses `--output_path` argument (defaults to `./output`)

- **`codegeex/mindspore/src/dataset.py`** ✅
- Now uses `eval_data_url` argument (if provided)

- **`codegeex/mindspore/train.py`** ✅
- Cache paths now use `CODEGEEX_CACHE_BASE` environment variable (defaults to `/home/work/sfs/cache`)
- `BATCH_JOB_ID` now uses `.get()` with fallback

- **`codegeex/mindspore/scripts/run_modelarts*.py`** ✅
- Temp directory now uses `MODELARTS_TEMP_DIR` environment variable (defaults to `/home/work/sfs/xx`)
- Added file existence checks before copying

**New command-line arguments added:**
- `--output_path`: Output directory for generated files (default: `./output`)
- `--input_path`: Input path for data files (optional)

**New environment variables:**
- `CODEGEEX_CACHE_BASE`: Base directory for cache files (default: `/home/work/sfs/cache`)
- `MODELARTS_TEMP_DIR`: Temp directory for ModelArts scripts (default: `/home/work/sfs/xx`)

## 2. Configuration Placeholders (Medium Priority)

Multiple configuration files contain placeholder values that must be set:

- **`configs/codegeex_13b.sh`**: `CHECKPOINT_PATH` placeholder
- **`configs/codegeex_13b_parallel.sh`**: `CHECKPOINT_PATH` placeholder
- **`configs/codegeex_13b_paddle.sh`**: `CHECKPOINT_PATH` placeholder
- **`scripts/pretrain_codegeex.sh`**:
- `HOSTFILE` placeholder
- `DATA_PATH` placeholder
- `CKPT_PATH` placeholder
- `OUTPUT_DIR` placeholder
- **`scripts/finetune_codegeex.sh`**: Same placeholders as pretrain script
- **`codegeex/mindspore/configs/*.sh`**: Multiple config files with `CODE_DATA_DIR` and `<TODO>` placeholders

## 3. TODO Comments / Incomplete Code (Medium Priority)

Multiple TODO comments indicate incomplete work:

- **`codegeex/mindspore/train.py`** (lines 214, 216):
- TODO: remove after warming-up!
- TODO: add them back if not for the 1st run!

- **`codegeex/mindspore/src/sat_dataset.py`** (line 81):
- TODO ARGS comment

- **`codegeex/mindspore/src/dataset.py`** (line 122):
- TODO: set as current validation set path

- **`codegeex/mindspore/generation_values_1p.py`** (line 166):
- TODO: add them back if not for the 1st run!

- **`codegeex/mindspore/finetune.py`** (lines 216, 218):
- TODO: remove after warming-up!
- TODO: add them back if not for the 1st run!

- **`codegeex/mindspore/generation_1p.py`** (line 166):
- TODO: add them back if not for the 1st run!

- **`codegeex/mindspore/convertion_1p.py`** (lines 154, 160, 180):
- Multiple TODOs for checkpoint names and paths

- All generation scripts have TODO comments for setting paths

## 4. Security Issues (High Priority)

- **`codegeex/benchmark/execution.py`** (line 347):
- Java execution code is commented out with security warning
- Warning states: "This program exists to execute untrusted model-generated code"
- Code execution should be sandboxed
- Currently the `exec_result` is None but code tries to access `.returncode` which will cause AttributeError

- **`codegeex/benchmark/execution.py`** (lines 477-546):
- `reliability_guard()` function has explicit warning: "This function is NOT a security sandbox"
- Users should not blindly execute untrusted code

## 5. Known Bugs (Documented)

- **VS Code Extension** (mentioned in README):
- Bug: If cursor is moved before generation finishes, it may cause issues
- Location: `vscode-extension/README.md` and `README_zh.md`
- Status: Acknowledged, team working on making generation faster

## 6. Debug/Test Code Left in Production (Low Priority)

- **`scripts/evaluate_humaneval_x.py`** (lines 47-50):
- Hardcoded debug values left in code:
```python
#Debugging
INPUT_FILE='/home/rog0d/Escritorio/CodeGeeX/generations/humaneval_rust_generations.jsonl.gz'
LANGUAGE='rust'
```
- These override command-line arguments

## 7. Incomplete Implementation (Medium Priority)

- **`tests/test_inference_paddle.py`** (line 149):
- `raise NotImplementedError("quantize")` - quantization not implemented for Paddle backend

## 8. Path Construction Bug (Low Priority)

- **`scripts/evaluate_humaneval_x.py`** (line 44):
- Incorrect path join: `os.path.join(MAIN_DIR, "/codegeex/benchmark/humaneval-x/")`
- Leading slash makes it an absolute path, ignoring MAIN_DIR

## 9. Deprecated/Outdated Information

- **`README.md`** (line 12):
- Notes that CodeGeeX4 is newer and released
- Current codebase may be considered legacy version

## 10. Missing Error Handling

- **`codegeex/benchmark/execution.py`** (line 348):
- Code accesses `exec_result.returncode` when `exec_result` is `None` (line 336)
- Will cause `AttributeError` - Java execution path is broken

## 11. Hardcoded CUDA Path (Low Priority)

- **`scripts/generate_humaneval_x.sh`** (line 13):
- `export CUDA_HOME=/usr/local/cuda-11.1/` - hardcoded CUDA version
- **`scripts/translate_humaneval_x.sh`** (line 14):
- Same hardcoded CUDA path

## 12. Configuration Dependencies

- Scripts require specific environment variables:
- `BATCH_JOB_ID` (used in train.py and scripts)
- Various NCCL environment variables
- Platform-specific paths for Ascend/MindSpore

## Summary by Priority

### Critical (Must Fix Before Production Use)
1. ~~Hardcoded paths in generation scripts~~ ✅ **FIXED**
2. Security issue: Java execution code broken (None.returncode error)
3. Configuration placeholders not set

### High Priority
4. Security warnings for code execution
5. Debug code left in evaluate script
6. Known VS Code extension cursor bug

### Medium Priority
7. Multiple TODO comments indicating incomplete work
8. Hardcoded CUDA paths
9. Path construction bug

### Low Priority
10. Missing quantization implementation for Paddle
11. Deprecated version notice (CodeGeeX4 available)
12. Platform-specific hardcoded paths

18 changes: 11 additions & 7 deletions codegeex/mindspore/generation_batch.py
Original file line number Diff line number Diff line change
Expand Up @@ -226,15 +226,19 @@ def run_predict(model_predict, config, args_opt, rank):
generations = []
batch_size = config.batch_size
verbose = (rank % 8 == 0)
save_path = f'/home/work/sfs/xx/pangu_alpha_code/generation_batch/{args_opt.temperature}.txt' # TODO: set as current save path
save_dir = os.path.split(save_path)[0]

# Use configurable output path
output_dir = getattr(args_opt, 'output_path', './output')
save_dir = os.path.join(output_dir, 'generation_batch')
save_path = os.path.join(save_dir, f'temp_{args_opt.temperature}.txt')

if rank == 0:
if not os.path.exists(save_dir):
os.makedirs(save_dir)
os.makedirs(save_dir, exist_ok=True)
if not os.path.exists(save_path):
f = open(save_path, 'w')
f.close()
os.system(f'sudo chmod 777 -R {save_dir}')
with open(save_path, 'w') as f:
pass # Create empty file
if os.name != 'nt': # Only on Unix-like systems
os.system(f'chmod 777 -R {save_dir}')
batch = []
input_length = []
sample_ids = []
Expand Down
11 changes: 7 additions & 4 deletions codegeex/mindspore/generation_finetune.py
Original file line number Diff line number Diff line change
Expand Up @@ -210,11 +210,14 @@ def run_predict(model_predict, config, args_opt, rank):
generations = []
batch_size = config.batch_size
verbose = (rank % 8 == 0)
save_path = f'/home/work/sfs/xx/pangu_alpha_code/generation_finetune/code_translation/{lang}/temp_{args_opt.temperature}.txt' # TODO: set as current save path
save_dir = os.path.split(save_path)[0]

# Use configurable output path
output_dir = getattr(args_opt, 'output_path', './output')
save_dir = os.path.join(output_dir, 'generation_finetune', 'code_translation', lang)
save_path = os.path.join(save_dir, f'temp_{args_opt.temperature}.txt')

if rank == 0:
if not os.path.exists(save_dir):
os.makedirs(save_dir)
os.makedirs(save_dir, exist_ok=True)
if not os.path.exists(save_path):
f = open(save_path, 'w')
f.close()
Expand Down
70 changes: 60 additions & 10 deletions codegeex/mindspore/generation_humaneval.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
"""
PanGu predict run
"""
import gzip
import json
import os
import time
Expand Down Expand Up @@ -198,25 +199,74 @@ def run_predict(model_predict, config, args_opt, rank):
# Define tokenizer
tokenizer = CodeTokenizer(mode='6b')

# Tokenize input sentence to ids
humaneval_path = '/home/work/sfs/xx/human_eval_x/data/humaneval_cpp.jsonl' # TODO: set as current humaneval path
humaneval = open(humaneval_path, 'r').readlines()
humaneval = [json.loads(task) for task in humaneval if len(task) != 0]
# Determine language (default to cpp for backward compatibility)
lang = getattr(args_opt, 'language', 'cpp') or 'cpp'
lang_lower = lang.lower()

# Language tag mapping
lang_tags = {
'cpp': '// language: C++\n',
'c++': '// language: C++\n',
'python': '# language: Python\n',
'java': '// language: Java\n',
'javascript': '// language: JavaScript\n',
'js': '// language: JavaScript\n',
'go': '// language: Go\n',
}
tag = lang_tags.get(lang_lower, f'// language: {lang}\n')

# Determine input path
if hasattr(args_opt, 'input_path') and args_opt.input_path:
humaneval_path = args_opt.input_path
else:
# Try relative path from current script location
script_dir = os.path.dirname(os.path.abspath(__file__))
repo_root = os.path.dirname(os.path.dirname(os.path.dirname(script_dir)))
default_path = os.path.join(repo_root, 'codegeex', 'benchmark', 'humaneval-x',
lang_lower, 'data', f'humaneval_{lang_lower}.jsonl.gz')
# Check if .gz file exists, otherwise try .jsonl
if os.path.exists(default_path):
humaneval_path = default_path
else:
default_path_jsonl = default_path.replace('.jsonl.gz', '.jsonl')
if os.path.exists(default_path_jsonl):
humaneval_path = default_path_jsonl
else:
# Fallback: use input_path or raise error
humaneval_path = default_path_jsonl
if rank == 0:
print(f"Warning: Default path {humaneval_path} does not exist. Please set --input_path")

# Open file (handle .gz files)
if humaneval_path.endswith('.gz'):
with gzip.open(humaneval_path, 'rt') as f:
humaneval = [json.loads(line) for line in f if line.strip()]
else:
with open(humaneval_path, 'r') as f:
humaneval = [json.loads(line) for line in f if line.strip()]

samples = [task['prompt'] for task in humaneval]
generations = []
batch_size = config.batch_size
verbose = (rank % 8 == 0)
part = int(args_opt.part)
part = int(args_opt.part) if args_opt.part else 0
gen_times = 12 # TODO: set as generation times of current task
print(f"gen times: {gen_times}, part: {part}")
save_path = f'/home/work/sfs/xx/pangu_alpha_code/generation_humanevalx/cpp/temp_{args_opt.temperature}/samples_{args_opt.load_ckpt_epoch}_part_{part}.jsonl' # TODO: set as current save path

# Determine output path
output_dir = getattr(args_opt, 'output_path', './output')
os.makedirs(output_dir, exist_ok=True)
save_path = os.path.join(output_dir,
f'humaneval_{lang_lower}_temp_{args_opt.temperature}_samples_{args_opt.load_ckpt_epoch}_part_{part}.jsonl')

if rank == 0 and not os.path.exists(save_path):
os.makedirs(os.path.split(save_path)[0], exist_ok=True)
f = open(save_path, 'w')
f.close()
os.system(f'sudo chmod 777 {save_path}')
with open(save_path, 'w') as f:
pass # Create empty file
if os.name != 'nt': # Only on Unix-like systems
os.system(f'chmod 777 {save_path}')

for i, sample in enumerate(samples):
tag = "// language: C++\n"
sample = tag + sample
if rank % 8 == 0:
print(f"=================== prompt {i} ====================")
Expand Down
12 changes: 9 additions & 3 deletions codegeex/mindspore/generation_values.py
Original file line number Diff line number Diff line change
Expand Up @@ -199,9 +199,15 @@ def run_predict(model_predict, config, args_opt, rank):
init, batch_valid_length)
output = output_logits.asnumpy()
if rank == 0:
np.save("/home/work/sfs/xx/pangu_alpha_code/output_6_7375_8.13.npy", output) # TODO: set as current save path
os.system(
"chmod 777 /home/work/sfs/xx/pangu_alpha_code/output_6_7375_8.13.npy") # TODO: set as current save path
# Use configurable output path
output_dir = getattr(args_opt, 'output_path', './output')
os.makedirs(output_dir, exist_ok=True)
output_file = os.path.join(output_dir, "output_values.npy")
np.save(output_file, output)
# Only try to chmod if on Unix-like system
if os.name != 'nt':
os.system(f"chmod 777 {output_file}")
print(f"== Output saved to: {output_file}")
print("== Output shape: ", output.shape)


Expand Down
11 changes: 8 additions & 3 deletions codegeex/mindspore/scripts/run_modelarts.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,16 @@

os.environ["LOG_PATH"] = tb_path

print("=================RANK_TABLE_FILE: ", os.environ["RANK_TABLE_FILE"], flush=True)
print("=================RANK_TABLE_FILE: ", os.environ.get("RANK_TABLE_FILE", "not set"), flush=True)
print("=================ms import done", flush=True)
time.sleep(10)
os.system(
"cp /home/work/rank_table/jobstart_hccl.json /home/work/sfs/xx; sudo chmod +777 /home/work/rank_table/jobstart_hccl.json")
# Use configurable temp directory (platform-specific for ModelArts)
temp_dir = os.environ.get("MODELARTS_TEMP_DIR", "/home/work/sfs/xx")
rank_table_source = "/home/work/rank_table/jobstart_hccl.json"
if os.path.exists(rank_table_source):
os.system(f"cp {rank_table_source} {temp_dir}; sudo chmod +777 {rank_table_source}")
else:
print(f"Warning: {rank_table_source} does not exist. Skipping copy.")
ret = os.system(f"cd {log_path} && bash {args.script} 2>&1 | tee output.log")
if os.environ.get("RANK_ID") == 0:
log_dir = os.path.join(args.work_dir, "logs", os.environ.get("JOB_ID"))
Expand Down
11 changes: 8 additions & 3 deletions codegeex/mindspore/scripts/run_modelarts_gen_finetune.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,16 @@
else:
os.environ["LANGUAGE"] = "Null"

print("=================RANK_TABLE_FILE: ", os.environ["RANK_TABLE_FILE"], flush=True)
print("=================RANK_TABLE_FILE: ", os.environ.get("RANK_TABLE_FILE", "not set"), flush=True)
print("=================ms import done", flush=True)
time.sleep(10)
os.system(
"cp /home/work/rank_table/jobstart_hccl.json /home/work/sfs/xx; sudo chmod +777 /home/work/rank_table/jobstart_hccl.json")
# Use configurable temp directory (platform-specific for ModelArts)
temp_dir = os.environ.get("MODELARTS_TEMP_DIR", "/home/work/sfs/xx")
rank_table_source = "/home/work/rank_table/jobstart_hccl.json"
if os.path.exists(rank_table_source):
os.system(f"cp {rank_table_source} {temp_dir}; sudo chmod +777 {rank_table_source}")
else:
print(f"Warning: {rank_table_source} does not exist. Skipping copy.")
ret = os.system(f"cd {log_path} && bash {args.script} 2>&1 | tee output.log")
if os.environ.get("RANK_ID") == 0:
log_dir = os.path.join(args.work_dir, "logs", os.environ.get("JOB_ID"))
Expand Down
11 changes: 8 additions & 3 deletions codegeex/mindspore/scripts/run_modelarts_gen_humaneval_x.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,16 @@
else:
os.environ["PART"] = "-1"

print("=================RANK_TABLE_FILE: ", os.environ["RANK_TABLE_FILE"], flush=True)
print("=================RANK_TABLE_FILE: ", os.environ.get("RANK_TABLE_FILE", "not set"), flush=True)
print("=================ms import done", flush=True)
time.sleep(10)
os.system(
"cp /home/work/rank_table/jobstart_hccl.json /home/work/sfs/xx; sudo chmod +777 /home/work/rank_table/jobstart_hccl.json")
# Use configurable temp directory (platform-specific for ModelArts)
temp_dir = os.environ.get("MODELARTS_TEMP_DIR", "/home/work/sfs/xx")
rank_table_source = "/home/work/rank_table/jobstart_hccl.json"
if os.path.exists(rank_table_source):
os.system(f"cp {rank_table_source} {temp_dir}; sudo chmod +777 {rank_table_source}")
else:
print(f"Warning: {rank_table_source} does not exist. Skipping copy.")
ret = os.system(f"cd {log_path} && bash {args.script} 2>&1 | tee output.log")
if os.environ.get("RANK_ID") == 0:
log_dir = os.path.join(args.work_dir, "logs", os.environ.get("JOB_ID"))
Expand Down
Loading