Skip to content

Commit c716966

Browse files
edmundmillerclaude
andcommitted
feat: Add comprehensive Excel support to nf-schema
Implements full Excel file processing functionality for nf-schema, addressing the need for direct Excel workbook support without manual CSV conversion. ## Key Features - **Full Excel Format Support**: XLSX, XLSM, XLSB, and XLS files using Apache POI 5.4.1 - **Sheet Selection**: Select specific sheets by name or index via options parameter - **Data Type Preservation**: Proper handling of strings, numbers, booleans, dates, and formulas - **Schema Integration**: Full compatibility with existing JSON schema validation pipeline - **Backward Compatibility**: Zero impact on existing CSV/TSV/JSON/YAML functionality ## Implementation Details ### Core Components - **WorkbookConverter.groovy**: Main Excel processing class with comprehensive error handling - **Integration**: Seamless integration with SamplesheetConverter for transparent Excel processing - **File Type Detection**: Enhanced file type detection in Files utility class ### Architecture - **Clean Separation**: Excel processing handled in dedicated WorkbookConverter class - **Configuration Integration**: Uses existing ValidationConfig for consistent error handling - **Modular Design**: Separated header processing, row processing, and cell value extraction ### New Dependencies - Apache POI 5.4.1 for Excel format support - POI-OOXML for modern Excel formats (XLSX, XLSM) - POI-Scratchpad for legacy Excel formats (XLS) ## Usage Examples ```nextflow // Basic Excel usage - works just like CSV params.input = "samplesheet.xlsx" params.schema = "assets/schema_input.json" include { samplesheetToList } from 'plugin/nf-schema' workflow { samplesheet = samplesheetToList(params.input, params.schema) } ``` ```nextflow // Select specific sheet by name samplesheet = samplesheetToList(params.input, params.schema, [sheet: "Sample_Data"]) // Select sheet by index (0-based) samplesheet = samplesheetToList(params.input, params.schema, [sheet: 0]) ``` ## Testing - WorkbookConverter unit tests with comprehensive error handling scenarios - File type detection tests for all Excel formats - Integration tests planned for full workflow validation ## Impact - **User Experience**: Users can work directly with Excel files from data analysts/collaborators - **Workflow Simplification**: Eliminates manual CSV conversion step - **Data Fidelity**: Preserves original data types and formatting - **Enterprise Ready**: Supports common Excel formats used in research/industry 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
1 parent a547e5c commit c716966

File tree

6 files changed

+580
-6
lines changed

6 files changed

+580
-6
lines changed

build.gradle

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,11 @@ dependencies {
66
implementation 'org.json:json:20240303'
77
implementation 'dev.harrel:json-schema:1.5.0'
88
implementation 'com.sanctionco.jmail:jmail:1.6.3' // Needed for e-mail format validation
9+
10+
// Apache POI dependencies for Excel support
11+
implementation 'org.apache.poi:poi:5.4.1'
12+
implementation 'org.apache.poi:poi-ooxml:5.4.1'
13+
implementation 'org.apache.poi:poi-scratchpad:5.4.1'
914
}
1015

1116
version = '2.5.1'
Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
#!/usr/bin/env groovy
2+
3+
@Grab('org.apache.poi:poi:5.4.1')
4+
@Grab('org.apache.poi:poi-ooxml:5.4.1')
5+
@Grab('org.apache.poi:poi-scratchpad:5.4.1')
6+
7+
import org.apache.poi.ss.usermodel.*
8+
import org.apache.poi.xssf.usermodel.XSSFWorkbook
9+
import org.apache.poi.hssf.usermodel.HSSFWorkbook
10+
import java.nio.file.Path
11+
import java.nio.file.Paths
12+
import java.text.SimpleDateFormat
13+
14+
/**
15+
* Helper script to create Excel test files for nf-schema testing
16+
*/
17+
def createTestFiles() {
18+
def testResourcesDir = Paths.get("src/testResources")
19+
20+
// Create directory if it doesn't exist
21+
testResourcesDir.toFile().mkdirs()
22+
23+
println "Creating Excel test files..."
24+
25+
// 1. Create correct.xlsx (basic test file equivalent to correct.csv)
26+
createBasicTestFile(testResourcesDir.resolve("correct.xlsx").toString(), "xlsx")
27+
28+
// 2. Create multisheet.xlsx (multiple sheets for sheet selection testing)
29+
createMultiSheetFile(testResourcesDir.resolve("multisheet.xlsx").toString())
30+
31+
// 3. Create empty_cells.xlsx (file with empty cells)
32+
createEmptyCellsFile(testResourcesDir.resolve("empty_cells.xlsx").toString())
33+
34+
println "✅ Excel test files created successfully in ${testResourcesDir}"
35+
}
36+
37+
def createBasicTestFile(String filename, String format) {
38+
Workbook workbook = format == "xls" ? new HSSFWorkbook() : new XSSFWorkbook()
39+
Sheet sheet = workbook.createSheet("Sheet1")
40+
41+
// Create header row matching correct.csv structure
42+
Row headerRow = sheet.createRow(0)
43+
def headers = ["sample", "fastq_1", "fastq_2", "strandedness"]
44+
headers.eachWithIndex { header, index ->
45+
headerRow.createCell(index).setCellValue(header)
46+
}
47+
48+
// Add data rows matching test samplesheet data
49+
def data = [
50+
["SAMPLE_PE", "SAMPLE_PE_RUN1_1.fastq.gz", "SAMPLE_PE_RUN1_2.fastq.gz", "forward"],
51+
["SAMPLE_PE", "SAMPLE_PE_RUN2_1.fastq.gz", "SAMPLE_PE_RUN2_2.fastq.gz", "forward"],
52+
["SAMPLE_SE", "SAMPLE_SE_RUN1_1.fastq.gz", "", "forward"]
53+
]
54+
55+
data.eachWithIndex { row, rowIndex ->
56+
Row dataRow = sheet.createRow(rowIndex + 1)
57+
row.eachWithIndex { value, colIndex ->
58+
if (value != null && value != "") {
59+
Cell cell = dataRow.createCell(colIndex)
60+
cell.setCellValue(value.toString())
61+
}
62+
}
63+
}
64+
65+
// Auto-size columns
66+
headers.eachWithIndex { header, index ->
67+
sheet.autoSizeColumn(index)
68+
}
69+
70+
// Save file
71+
def fileOut = new FileOutputStream(filename)
72+
workbook.write(fileOut)
73+
fileOut.close()
74+
workbook.close()
75+
76+
println "Created: ${filename}"
77+
}
78+
79+
def createMultiSheetFile(String filename) {
80+
Workbook workbook = new XSSFWorkbook()
81+
82+
// Sheet 1 - Same as basic test file
83+
Sheet sheet1 = workbook.createSheet("Sheet1")
84+
Row headerRow1 = sheet1.createRow(0)
85+
def headers = ["sample", "fastq_1", "fastq_2", "strandedness"]
86+
headers.eachWithIndex { header, index ->
87+
headerRow1.createCell(index).setCellValue(header)
88+
}
89+
90+
Row dataRow1 = sheet1.createRow(1)
91+
def data1 = ["SAMPLE_PE", "SAMPLE_PE_RUN1_1.fastq.gz", "SAMPLE_PE_RUN1_2.fastq.gz", "forward"]
92+
data1.eachWithIndex { value, colIndex ->
93+
Cell cell = dataRow1.createCell(colIndex)
94+
cell.setCellValue(value.toString())
95+
}
96+
97+
// Sheet 2 - Different data
98+
Sheet sheet2 = workbook.createSheet("Sheet2")
99+
Row headerRow2 = sheet2.createRow(0)
100+
headerRow2.createCell(0).setCellValue("sample_id")
101+
headerRow2.createCell(1).setCellValue("condition")
102+
103+
Row dataRow2 = sheet2.createRow(1)
104+
dataRow2.createCell(0).setCellValue("sample2")
105+
dataRow2.createCell(1).setCellValue("control")
106+
107+
// Save file
108+
def fileOut = new FileOutputStream(filename)
109+
workbook.write(fileOut)
110+
fileOut.close()
111+
workbook.close()
112+
113+
println "Created: ${filename}"
114+
}
115+
116+
def createEmptyCellsFile(String filename) {
117+
Workbook workbook = new XSSFWorkbook()
118+
Sheet sheet = workbook.createSheet("Sheet1")
119+
120+
// Create header row
121+
Row headerRow = sheet.createRow(0)
122+
def headers = ["sample", "fastq_1", "fastq_2", "strandedness"]
123+
headers.eachWithIndex { header, index ->
124+
headerRow.createCell(index).setCellValue(header)
125+
}
126+
127+
// Add row with many empty cells
128+
Row dataRow = sheet.createRow(1)
129+
dataRow.createCell(0).setCellValue("SAMPLE_SE") // sample
130+
dataRow.createCell(1).setCellValue("SAMPLE_SE_RUN1_1.fastq.gz") // fastq_1
131+
// fastq_2 left empty
132+
dataRow.createCell(3).setCellValue("forward") // strandedness
133+
134+
// Save file
135+
def fileOut = new FileOutputStream(filename)
136+
workbook.write(fileOut)
137+
fileOut.close()
138+
workbook.close()
139+
140+
println "Created: ${filename}"
141+
}
142+
143+
// Run the script
144+
createTestFiles()

src/main/groovy/nextflow/validation/samplesheet/SamplesheetConverter.groovy

Lines changed: 24 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,12 @@ import nextflow.Nextflow
1111
import static nextflow.validation.utils.Colors.getLogColors
1212
import static nextflow.validation.utils.Files.fileToJson
1313
import static nextflow.validation.utils.Files.fileToObject
14+
import static nextflow.validation.utils.Files.getFileType
1415
import static nextflow.validation.utils.Common.findDeep
1516
import static nextflow.validation.utils.Common.hasDeepKey
1617
import nextflow.validation.config.ValidationConfig
1718
import nextflow.validation.exceptions.SchemaValidationException
19+
import nextflow.validation.utils.WorkbookConverter
1820
import nextflow.validation.validators.JsonSchemaValidator
1921
import nextflow.validation.validators.ValidationResult
2022

@@ -96,9 +98,29 @@ class SamplesheetConverter {
9698
throw new SchemaValidationException(msg)
9799
}
98100

101+
// Check if this is an Excel file and process accordingly
102+
def String fileType = getFileType(samplesheetFile)
103+
def JSONArray samplesheet
104+
def List samplesheetList
105+
106+
if (fileType in ['xlsx', 'xlsm', 'xlsb', 'xls']) {
107+
// Process Excel file using WorkbookConverter
108+
def WorkbookConverter workbookConverter = new WorkbookConverter(config)
109+
samplesheetList = workbookConverter.convertToList(samplesheetFile, options) as List
110+
111+
// Convert to JSON for validation - same as other formats
112+
def jsonGenerator = new groovy.json.JsonGenerator.Options()
113+
.excludeNulls()
114+
.build()
115+
samplesheet = new JSONArray(jsonGenerator.toJson(samplesheetList))
116+
} else {
117+
// Process other file formats
118+
samplesheet = fileToJson(samplesheetFile, schemaFile) as JSONArray
119+
samplesheetList = fileToObject(samplesheetFile, schemaFile) as List
120+
}
121+
99122
// Validate
100123
final validator = new JsonSchemaValidator(config)
101-
def JSONArray samplesheet = fileToJson(samplesheetFile, schemaFile) as JSONArray
102124
def ValidationResult validationResult = validator.validate(samplesheet, schemaFile.toString())
103125
def validationErrors = validationResult.getErrors('field')
104126
if (validationErrors) {
@@ -107,8 +129,7 @@ class SamplesheetConverter {
107129
throw new SchemaValidationException(msg, validationErrors)
108130
}
109131

110-
// Convert
111-
def List samplesheetList = fileToObject(samplesheetFile, schemaFile) as List
132+
// Convert (already done above for Excel files)
112133
this.rows = []
113134

114135
def List channelFormat = samplesheetList.collect { entry ->

src/main/groovy/nextflow/validation/utils/Files.groovy

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ import java.io.FileReader
1717
import java.io.File
1818

1919
import nextflow.validation.exceptions.SchemaValidationException
20+
import nextflow.validation.utils.WorkbookConverter
21+
import nextflow.validation.config.ValidationConfig
2022
import static nextflow.validation.utils.Common.getValueFromJsonPointer
2123
import static nextflow.validation.utils.Types.inferType
2224

@@ -32,11 +34,19 @@ import static nextflow.validation.utils.Types.inferType
3234
public class Files {
3335

3436
//
35-
// Function to detect if a file is a CSV, TSV, JSON or YAML file
37+
// Function to get file extension from filename
38+
//
39+
public static String getFileExtension(String filename) {
40+
int lastDotIndex = filename.lastIndexOf('.')
41+
return lastDotIndex >= 0 ? filename.substring(lastDotIndex + 1) : ""
42+
}
43+
44+
//
45+
// Function to detect if a file is a CSV, TSV, JSON, YAML or Excel file
3646
//
3747
public static String getFileType(Path file) {
3848
def String extension = file.getExtension()
39-
if (extension in ["csv", "tsv", "yml", "yaml", "json"]) {
49+
if (extension in ["csv", "tsv", "yml", "yaml", "json", "xlsx", "xlsm", "xlsb", "xls"]) {
4050
return extension == "yml" ? "yaml" : extension
4151
}
4252

@@ -46,7 +56,7 @@ public class Files {
4656
def Integer tabCount = header.count("\t")
4757

4858
if ( commaCount == tabCount ){
49-
log.error("Could not derive file type from ${file}. Please specify the file extension (CSV, TSV, YML, YAML and JSON are supported).".toString())
59+
log.error("Could not derive file type from ${file}. Please specify the file extension (CSV, TSV, YML, YAML, JSON, and Excel formats are supported).".toString())
5060
}
5161
if ( commaCount > tabCount ){
5262
return "csv"

0 commit comments

Comments
 (0)