Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Fixed
- Build error due to base image changing.
- Issue related to JavaBridge when running parallel tasks on HPC.

## [1.2.0] - 2025-04-22

Expand Down
205 changes: 205 additions & 0 deletions JVM_CONFLICT_FIXES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
# JVM Conflict Fixes for NDPI Tile Cropper

## Problem Description

When running the NDPI tile cropper in parallel mode using `job.sh`, you may encounter Java/Javabridge errors like:

```
org.libjpegturbo.turbojpeg.TJDecompressor.init()V
javabridge.jutil.JavaException: org.libjpegturbo.turbojpeg.TJDecompressor.init()V
```

Additionally, you may encounter pickling errors when using ProcessPoolExecutor:

```
Can't pickle local object 'ArgumentParser.__init__.<locals>.identity'
```

These issues occur because:
1. Each subprocess starts its own JVM instance with `javabridge.start_vm()`
2. Multiple JVMs competing for the same resources cause conflicts
3. The Java library initialization fails when multiple processes access shared resources simultaneously
4. ProcessPoolExecutor tries to pickle (serialize) objects that contain unpicklable local functions

## Solutions Implemented

### 1. Fixed Parallel Processing (`ndpi_tile_cropper_parallel_cli.py`)

**Changes made:**
- Fixed pickling issue by moving argument parsing outside worker function
- Created picklable `process_single_file()` function with simple data types
- Switched from `ThreadPoolExecutor` to `ProcessPoolExecutor` for better isolation
- Reduced default number of processes from 8 to 4 to minimize conflicts
- Added retry logic with exponential backoff for JVM conflicts
- Added random delays between process starts to stagger JVM initialization
- Better error detection and handling for JVM-related failures
- Added timeout handling to prevent hanging processes

**New command-line options:**
- `--retry-attempts, -r`: Number of retry attempts for failed processing (default: 3)

### 2. Simple Parallel Processing (`ndpi_tile_cropper_parallel_cli_simple.py`)

**Alternative approach:**
- Uses `ThreadPoolExecutor` to avoid pickling issues entirely
- Maintains all JVM conflict handling features
- Simpler implementation with no serialization requirements
- Recommended for environments where pickling is problematic

### 3. Enhanced JVM Management (`ndpi_tile_cropper_cli.py`)

**Changes made:**
- Better JVM initialization error handling
- Improved cleanup in error scenarios
- More specific error messages for JVM conflicts
- Graceful handling of reader cleanup failures

### 4. Updated Job Script (`job.sh`)

**Changes made:**
- Reduced `--ntasks-per-node` from 4 to 2
- Added retry logic function with random delays
- Staggered process starts with 10-second delays
- Added the `-r 3` parameter for retry attempts
- Updated to use the simple parallel CLI to avoid pickling issues

## Usage

### Option 1: Use the Fixed ProcessPoolExecutor Version

```bash
# Use the fixed version with ProcessPoolExecutor
python src/ndpi_tile_cropper_parallel_cli.py -d /path/to/ndpi/files -o /path/to/output -n 2 -r 3
```

### Option 2: Use the Simple ThreadPoolExecutor Version (Recommended)

```bash
# Use the simple version that avoids pickling issues
python src/ndpi_tile_cropper_parallel_cli_simple.py -d /path/to/ndpi/files -o /path/to/output -n 2 -r 3
```

### Option 3: Using the Updated Job Script

```bash
# Submit the updated job script (uses simple version)
sbatch src/job.sh
```

The script now includes:
- Retry logic for failed processes
- Staggered starts to reduce JVM conflicts
- Better error reporting
- Uses the simple parallel CLI to avoid pickling issues

### Manual Retry Logic

If you still encounter issues, you can manually implement retry logic:

```bash
#!/bin/bash
max_retries=3
for attempt in {1..$max_retries}; do
if python src/ndpi_tile_cropper_parallel_cli_simple.py -d /path/to/files -o /path/to/output -n 2; then
echo "Success on attempt $attempt"
break
else
echo "Failed on attempt $attempt"
sleep $((RANDOM % 30 + 10)) # Random delay 10-40 seconds
fi
done
```

## Monitoring and Debugging

### Check for JVM Conflicts

Look for these patterns in your logs:
- `TJDecompressor` errors
- `javabridge` exceptions
- `JVM conflict detected` messages

### Check for Pickling Issues

Look for these patterns in your logs:
- `Can't pickle local object` errors
- `pickle` related exceptions
- Serialization errors

### Log Analysis

The improved logging will now show:
- Specific JVM conflict detection
- Retry attempt information
- Success/failure counts for parallel processing
- Clear error messages for both JVM and pickling issues

### Performance Considerations

**Trade-offs:**
- Reduced parallelism may increase total processing time
- Retry logic adds overhead but improves reliability
- Staggered starts reduce peak resource usage
- ThreadPoolExecutor (simple version) may have more JVM conflicts but no pickling issues
- ProcessPoolExecutor (fixed version) has better isolation but requires picklable functions

**Recommended settings:**
- Start with 2-4 parallel processes
- Use 3 retry attempts
- Add 5-15 second delays between process starts
- Use the simple version if you encounter pickling issues

## Testing

Use the provided test script to verify the fixes:

```bash
# Update paths in test_jvm_fix.py first
python test_jvm_fix.py
```

## Alternative Solutions

If the above fixes don't resolve the issue, consider:

1. **Sequential Processing**: Process files one at a time
2. **Resource Limits**: Set JVM memory limits to prevent conflicts
3. **Container Isolation**: Use separate containers for each process
4. **Different JVM**: Try different JVM versions or configurations
5. **Different Parallelism**: Use `multiprocessing.Pool` directly

## Troubleshooting

### Common Issues

1. **Still getting JVM errors**: Reduce the number of parallel processes further
2. **Pickling errors**: Use the simple version (`ndpi_tile_cropper_parallel_cli_simple.py`)
3. **Processes hanging**: Check for timeout settings and resource limits
4. **Memory issues**: Monitor system memory usage during processing

### Debug Commands

```bash
# Check JVM processes
ps aux | grep java

# Monitor system resources
htop

# Check for file locks
lsof | grep ndpi

# Test pickling (if using ProcessPoolExecutor version)
python -c "import pickle; from src.ndpi_tile_cropper_parallel_cli import process_single_file; pickle.dumps(process_single_file)"
```

## Version History

- **v1.1.1**: Initial JVM conflict fixes
- Added retry logic and better error handling
- Reduced default parallelism
- Improved job script with staggered starts
- **v1.1.2**: Fixed pickling issues
- Created picklable worker function
- Added simple ThreadPoolExecutor alternative
- Updated job script to use simple version
62 changes: 62 additions & 0 deletions src/job.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
#!/bin/bash
#SBATCH --job-name="PALYIM_TADP_dirs_4_7_ntcp_v1.1.1_s_2048_l_256_20250620"
#SBATCH --partition=cpu
#SBATCH --mem-per-cpu=2G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2 # Reduced from 4 to avoid JVM conflicts
#SBATCH --cpus-per-task=8 # spread out to use 1 core per numa, set to 64 if tasks is 1
#SBATCH --constraint="scratch"
#SBATCH --account=bdar-delta-cpu # <- match to a "Project" returned by the "accounts" command
##SBATCH --exclusive # dedicated node for this job
#SBATCH --no-requeue
#SBATCH -t 48:00:00
#SBATCH -e slurm-%j-%x.err
#SBATCH -o slurm-%j-%x.out
#SBATCH --mail-user=sandeeps@illinois.edu
#SBATCH --mail-type="BEGIN,END" # See sbatch or srun man pages for more email options

export OMP_NUM_THREADS=1 # 1 if code is not multithreaded, otherwise set to the number of CPUs allocated per task.
cd /projects/bdar/sandeeps/git/ndpi-tile-cropper-cli/src

# Function to run with retry logic using the simple parallel CLI
run_with_retry() {
local max_retries=3
local retry_count=0
local success=false

while [ $retry_count -lt $max_retries ] && [ "$success" = false ]; do
if [ $retry_count -gt 0 ]; then
echo "Retry attempt $retry_count for $1"
sleep $((RANDOM % 10 + 5)) # Random delay between 5-15 seconds
fi

if srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK /usr/bin/apptainer run --bind /work/hdd/bdar/data:/data ndpi-tile-cropper-parallel-pr-22.sif -d "$1" -o /data/TADP_TILE_CROPS -n 2 -s 2048 -l 256 -r 3; then
success=true
echo "Successfully processed $1"
else
retry_count=$((retry_count + 1))
echo "Failed to process $1 (attempt $retry_count)"
fi
done

if [ "$success" = false ]; then
echo "Failed to process $1 after $max_retries attempts"
return 1
fi
return 0
}

# Run each directory with retry logic and staggered starts
run_with_retry /data/TADP/4 &
sleep 10 # Stagger the starts to reduce JVM conflicts

run_with_retry /data/TADP/5 &
sleep 10

run_with_retry /data/TADP/6 &
sleep 10

run_with_retry /data/TADP/7 &
sleep 10

wait
34 changes: 27 additions & 7 deletions src/ndpi_tile_cropper_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,8 +164,14 @@ def __read_tile(self, x, y, z, width, height):
except Exception as ex:
logger.error(self.input_filename + ": Error reading tile: " + str(x) + "x_" + str(y) + "y_" + str(z) + "z")
logger.error(ex, exc_info=True)
# Check if it's a JVM-related error and provide more specific logging
if "TJDecompressor" in str(ex) or "javabridge" in str(ex).lower():
logger.error("JVM conflict detected. This may be due to multiple processes accessing the same JVM resources.")
finally:
reader.close()
try:
reader.close()
except:
pass # Ignore errors during cleanup
return img

@staticmethod
Expand Down Expand Up @@ -317,8 +323,16 @@ def exit_program(self, signum, frame):

if __name__ == '__main__':

# Start the JVM
javabridge.start_vm(class_path=bioformats.JARS, run_headless=True)
# Start the JVM with better error handling
try:
javabridge.start_vm(class_path=bioformats.JARS, run_headless=True)
logger = logging.getLogger("ndpi_tile_cropper_cli.py")
logger.info("JVM started successfully")
except Exception as e:
print(f"Failed to start JVM: {e}")
print("This may be due to JVM conflicts when running multiple instances.")
print("Try reducing the number of parallel processes or adding delays between process starts.")
exit(1)

# Parse the command line arguments
cli = NDPITileCropperCLI()
Expand Down Expand Up @@ -359,9 +373,15 @@ def exit_program(self, signum, frame):
logger.error(e, exc_info=True)
finally:
# Write metadata before exiting
ndpi_file_cropper.write_metadata_before_exiting()
try:
ndpi_file_cropper.write_metadata_before_exiting()
except:
pass # Ignore errors during cleanup

# Stop the JVM
logger.info("Shutting down JVM.")
javabridge.kill_vm()
logger.info("Stopping NDPITileCropper CLI")
try:
logger.info("Shutting down JVM.")
javabridge.kill_vm()
logger.info("Stopping NDPITileCropper CLI")
except:
pass # Ignore errors during JVM shutdown
Loading