|
| 1 | +<!-- |
| 2 | + Licensed to the Apache Software Foundation (ASF) under one or more |
| 3 | + contributor license agreements. See the NOTICE file distributed with |
| 4 | + this work for additional information regarding copyright ownership. |
| 5 | + The ASF licenses this file to You under the Apache License, Version 2.0 |
| 6 | + (the "License"); you may not use this file except in compliance with |
| 7 | + the License. You may obtain a copy of the License at |
| 8 | +
|
| 9 | + https://www.apache.org/licenses/LICENSE-2.0 |
| 10 | +
|
| 11 | + Unless required by applicable law or agreed to in writing, software |
| 12 | + distributed under the License is distributed on an "AS IS" BASIS, |
| 13 | + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 14 | + See the License for the specific language governing permissions and |
| 15 | + limitations under the License. |
| 16 | +--> |
| 17 | +<html> |
| 18 | +<head> |
| 19 | +<title>Apache Commons CSV Overview</title> |
| 20 | +</head> |
| 21 | +<body> |
| 22 | + <img src="../images/commons-logo.png" alt="Apache Commons CSV"> |
| 23 | + <p> |
| 24 | + You can find the Javadoc package list at the <a href="#all-packages-table">bottom of this page</a>. |
| 25 | + </p> |
| 26 | + <section> |
| 27 | + <h1>Introducing Commons CSV</h1> |
| 28 | + <p>Apache Commons CSV reads and writes files in variations of the Comma Separated Value (CSV) format.</p> |
| 29 | + <p> |
| 30 | + Common CSV formats are predefined in the <a href="org/apache/commons/csv/CSVFormat.html">CSVFormat</a> class: |
| 31 | + <table> |
| 32 | + <caption>CSV Formats</caption> |
| 33 | + <thead> |
| 34 | + <tr> |
| 35 | + <th scope="col">CSVFormat</th> |
| 36 | + <th scope="col">Description</th> |
| 37 | + <th scope="col">Since Version</th> |
| 38 | + </tr> |
| 39 | + </thead> |
| 40 | + <tbody> |
| 41 | + <tr> |
| 42 | + <td><a href="org/apache/commons/csv/CSVFormat.html#DEFAULT">DEFAULT</a></td> |
| 43 | + <td>IO for the Standard Comma Separated Value format, like <a href="https://datatracker.ietf.org/doc/html/rfc4180">RFC 4180</a> but allowing |
| 44 | + empty lines. |
| 45 | + </td> |
| 46 | + <td>1.0</td> |
| 47 | + </tr> |
| 48 | + <tr> |
| 49 | + <td><a href="org/apache/commons/csv/CSVFormat.html#EXCEL">EXCEL</a></td> |
| 50 | + <td>IO for the <a href="https://support.microsoft.com/en-us/office/import-or-export-text-txt-or-csv-files-5250ac4c-663c-47ce-937b-339e391393ba">Microsoft |
| 51 | + Excel CSV.</a> format. |
| 52 | + </td> |
| 53 | + <td>1.0</td> |
| 54 | + </tr> |
| 55 | + <tr> |
| 56 | + <td><a href="org/apache/commons/csv/CSVFormat.html#INFORMIX_UNLOAD">INFORMIX_UNLOAD</a></td> |
| 57 | + <td>IO for the <a href="https://www.ibm.com/docs/en/informix-servers/14.10?topic=statements-unload-statement">Informix UNLOAD TO file_name</a> |
| 58 | + command. |
| 59 | + </td> |
| 60 | + <td>1.3</td> |
| 61 | + </tr> |
| 62 | + <tr> |
| 63 | + <td><a href="org/apache/commons/csv/CSVFormat.html#INFORMIX_UNLOAD_CSV">INFORMIX_UNLOAD_CSV</a></td> |
| 64 | + <td>IO for the <a href="https://www.ibm.com/docs/en/informix-servers/14.10?topic=statements-unload-statement">Informix UNLOAD CSV TO |
| 65 | + file_name</a> command with escaping disabled. |
| 66 | + </td> |
| 67 | + <td>1.3</td> |
| 68 | + </tr> |
| 69 | + <tr> |
| 70 | + <td><a href="org/apache/commons/csv/CSVFormat.html#MONGODB_CSV">MONGODB_CSV</a></td> |
| 71 | + <td>IO for the <a href="https://docs.mongodb.com/manual/reference/program/mongoexport/">MongoDB CSV <code>mongoexport</code></a> command. |
| 72 | + </td> |
| 73 | + <td>1.7</td> |
| 74 | + </tr> |
| 75 | + <tr> |
| 76 | + <td><a href="org/apache/commons/csv/CSVFormat.html#MONGODB_TSV">MONGODB_TSV</a></td> |
| 77 | + <td>IO for the <a href="https://docs.mongodb.com/manual/reference/program/mongoexport/">MongoDB Tab Separated Values (TSV)<code>mongoexport</code></a> |
| 78 | + command. |
| 79 | + </td> |
| 80 | + <td>1.7</td> |
| 81 | + </tr> |
| 82 | + <tr> |
| 83 | + <td><a href="org/apache/commons/csv/CSVFormat.html#MYSQL">MYSQL</a></td> |
| 84 | + <td>IO for the <a href="https://dev.mysql.com/doc/refman/8.0/en/mysqldump-delimited-text.html">MySQL CSV</a> format. |
| 85 | + </td> |
| 86 | + <td>1.0</td> |
| 87 | + </tr> |
| 88 | + <tr> |
| 89 | + <td><a href="org/apache/commons/csv/CSVFormat.html#ORACLE">ORACLE</a></td> |
| 90 | + <td>IO for the <a href="https://docs.oracle.com/database/121/SUTIL/GUID-D1762699-8154-40F6-90DE-EFB8EB6A9AB0.htm#SUTIL4217">Oracle CSV</a> format |
| 91 | + of the SQL*Loader utility. |
| 92 | + </td> |
| 93 | + <td>1.6</td> |
| 94 | + </tr> |
| 95 | + <tr> |
| 96 | + <td><a href="org/apache/commons/csv/CSVFormat.html#POSTGRESQL_CSV">POSTGRESQL_CSV</a></td> |
| 97 | + <td>IO for the <a href="https://www.postgresql.org/docs/current/static/sql-copy.html">PostgreSQL CSV</a> format used by the <code>COPY</code> |
| 98 | + operation. |
| 99 | + </td> |
| 100 | + <td>1.5</td> |
| 101 | + </tr> |
| 102 | + <tr> |
| 103 | + <td><a href="org/apache/commons/csv/CSVFormat.html#POSTGRESQL_TEXT">POSTGRESQL_TEXT</a></td> |
| 104 | + <td>IO for the <a href="https://www.postgresql.org/docs/current/static/sql-copy.html">PostgreSQL Text</a> format used by the <code>COPY</code> |
| 105 | + operation. |
| 106 | + </td> |
| 107 | + <td>1.5</td> |
| 108 | + </tr> |
| 109 | + <tr> |
| 110 | + <td><a href="org/apache/commons/csv/CSVFormat.html#RFC4180">RFC4180</a></td> |
| 111 | + <td>IO for the RFC-4180 format defined by<a href="https://datatracker.ietf.org/doc/html/rfc4180">RFC 4180</a>. |
| 112 | + </td> |
| 113 | + <td>1.0</td> |
| 114 | + </tr> |
| 115 | + <tr> |
| 116 | + <td><a href="org/apache/commons/csv/CSVFormat.html#TDF">TDF</a></td> |
| 117 | + <td>IO for the <a href="https://en.wikipedia.org/wiki/Tab-separated_values">Tab Delimited Format</a> (also known as Tab Separated Values). |
| 118 | + </td> |
| 119 | + <td>1.0</td> |
| 120 | + </tr> |
| 121 | + </tbody> |
| 122 | + </table> |
| 123 | + <p>Custom formats can be created using a fluent style API.</p> |
| 124 | + </section> |
| 125 | + <section> |
| 126 | + <h1>Parsing Standard CSV Files</h1> |
| 127 | + <p> |
| 128 | + Parsing files with Apache Commons CSV is relatively straight forward. Pick a |
| 129 | + <code>CSVFormat</code> |
| 130 | + and go from there. |
| 131 | + </p> |
| 132 | + <section> |
| 133 | + <h2>Parsing an Excel CSV File</h2> |
| 134 | + <p>To parse an Excel CSV file, write:</p> |
| 135 | + <pre> |
| 136 | + <code> |
| 137 | +Reader in = new FileReader("path/to/file.csv"); |
| 138 | +Iterable<CSVRecord> records = CSVFormat.EXCEL.parse(in); |
| 139 | +for (CSVRecord record : records) { |
| 140 | + String lastName = record.get("Last Name"); |
| 141 | + String firstName = record.get("First Name"); |
| 142 | +} |
| 143 | + </code> |
| 144 | + </pre> |
| 145 | + </section> |
| 146 | + </section> |
| 147 | + <section> |
| 148 | + <h1>Parsing Custom CSV Files</h1> |
| 149 | + <p> |
| 150 | + You can define your own using IO rules by building your own CSVFormat instance. Starting with |
| 151 | + <code>CSVFormat.builder()</code> |
| 152 | + lets you start from a predefined format and customize. For example: |
| 153 | + </p> |
| 154 | + <pre> |
| 155 | + <code> |
| 156 | +CSVFormat myFormat = CSVFormat.DEFAULT.builder() |
| 157 | + .setCommentMarker('#') |
| 158 | + .setEscape('+') |
| 159 | + .setIgnoreSurroundingSpaces(true) |
| 160 | + .setQuote('"') |
| 161 | + .setQuoteMode(QuoteMode.ALL) |
| 162 | + .get() |
| 163 | + </code> |
| 164 | + </pre> |
| 165 | + </section> |
| 166 | + <section> |
| 167 | + <h1>Handling Byte Order Marks</h1> |
| 168 | + <p> |
| 169 | + To handle files that start with a Byte Order Mark (BOM), like some Excel CSV files, you need an extra step to deal with the optional BOM bytes. Using the |
| 170 | + <a href="https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/BOMInputStream.html"> BOMInputStream </a> class from <a |
| 171 | + href="https://commons.apache.org/proper/commons-io/">Apache Commons IO</a> simplifies this task; for example: |
| 172 | + </p> |
| 173 | + <pre> |
| 174 | + <code> |
| 175 | +try (Reader reader = new InputStreamReader(BOMInputStream.builder() |
| 176 | + .setPath(path) |
| 177 | + .get(), "UTF-8"); |
| 178 | + CSVParser parser = CSVFormat.EXCEL.builder() |
| 179 | + .setHeader() |
| 180 | + .get() |
| 181 | + .parse(reader)) { |
| 182 | + for (CSVRecord record : parser) { |
| 183 | + String string = record.get("ColumnA"); |
| 184 | + // ... |
| 185 | + } |
| 186 | +} |
| 187 | + </code> |
| 188 | + </pre> |
| 189 | + <p>You might find it handy to create something like this:</p> |
| 190 | + <pre> |
| 191 | + <code> |
| 192 | +/** |
| 193 | + * Creates a reader capable of handling BOMs. |
| 194 | + * |
| 195 | + * @param path The path to read. |
| 196 | + * @return a new InputStreamReader for UTF-8 bytes. |
| 197 | + * @throws IOException if an I/O error occurs. |
| 198 | + */ |
| 199 | +public InputStreamReader newReader(final Path path) throws IOException { |
| 200 | + return new InputStreamReader(BOMInputStream.builder() |
| 201 | + .setPath(path) |
| 202 | + .get(), StandardCharsets.UTF_8); |
| 203 | +} |
| 204 | + </code> |
| 205 | + </pre> |
| 206 | + </section> |
| 207 | + <section> |
| 208 | + <h1>Using Headers</h1> |
| 209 | + <p> |
| 210 | + Apache Commons CSV provides several ways to access record values. The simplest way is to access values by their index in the record. However, columns in |
| 211 | + CSV files often have a name, for example: ID, CustomerNo, Birthday, etc. The CSVFormat class provides an API for specifying these <i>header</i> names and |
| 212 | + CSVRecord on the other hand has methods to access values by their corresponding header name. |
| 213 | + </p> |
| 214 | + <section> |
| 215 | + <h2>Accessing column values by index</h2> |
| 216 | + <p>To access a record value by index, no special configuration of the CSVFormat is necessary:</p> |
| 217 | + <pre> |
| 218 | + <code> |
| 219 | +Reader in = new FileReader("path/to/file.csv"); |
| 220 | +Iterable<CSVRecord> records = CSVFormat.RFC4180.parse(in); |
| 221 | +for (CSVRecord record : records) { |
| 222 | + String columnOne = record.get(0); |
| 223 | + String columnTwo = record.get(1); |
| 224 | +} |
| 225 | + </code> |
| 226 | + </pre> |
| 227 | + </section> |
| 228 | + <section> |
| 229 | + <h2>Defining a header manually</h2> |
| 230 | + <p>Indices may not be the most intuitive way to access record values. For this reason it is possible to assign names to each column in the file:</p> |
| 231 | + <pre> |
| 232 | + <code> |
| 233 | +Reader in = new FileReader("path/to/file.csv"); |
| 234 | +Iterable<CSVRecord> records = CSVFormat.RFC4180.builder() |
| 235 | + .setHeader("ID", "CustomerNo", "Name") |
| 236 | + .build() |
| 237 | + .parse(in); |
| 238 | +for (CSVRecord record : records) { |
| 239 | + String id = record.get("ID"); |
| 240 | + String customerNo = record.get("CustomerNo"); |
| 241 | + String name = record.get("Name"); |
| 242 | +} |
| 243 | + </code> |
| 244 | + </pre> |
| 245 | + Note that column values can still be accessed using their index. |
| 246 | + </section> |
| 247 | + <section> |
| 248 | + <h2>Using an enum to define a header</h2> |
| 249 | + <p>Using String values all over the code to reference columns can be error prone. For this reason, it is possible to define an enum to specify header |
| 250 | + names. Note that the enum constant names are used to access column values. This may lead to enums constant names which do not follow the Java coding |
| 251 | + standard of defining constants in upper case with underscores:</p> |
| 252 | + <pre> |
| 253 | + <code> |
| 254 | +public enum Headers { |
| 255 | + ID, CustomerNo, Name |
| 256 | +} |
| 257 | +Reader in = new FileReader("path/to/file.csv"); |
| 258 | +Iterable<CSVRecord> records = CSVFormat.RFC4180.builder() |
| 259 | + .setHeader(Headers.class) |
| 260 | + .build() |
| 261 | + .parse(in); |
| 262 | +for (CSVRecord record : records) { |
| 263 | + String id = record.get(Headers.ID); |
| 264 | + String customerNo = record.get(Headers.CustomerNo); |
| 265 | + String name = record.get(Headers.Name); |
| 266 | +} |
| 267 | + </code> |
| 268 | + </pre> |
| 269 | + Again it is possible to access values by their index and by using a String (for example "CustomerNo"). |
| 270 | + </section> |
| 271 | + <section> |
| 272 | + <h2>Header auto detection</h2> |
| 273 | + <p>Some CSV files define header names in their first record. If configured, Apache Commons CSV can parse the header names from the first record:</p> |
| 274 | + <pre> |
| 275 | + <code> |
| 276 | +Reader in = new FileReader("path/to/file.csv"); |
| 277 | +Iterable<CSVRecord> records = CSVFormat.RFC4180.builder() |
| 278 | + .setHeader() |
| 279 | + .setSkipHeaderRecord(true) |
| 280 | + .build() |
| 281 | + .parse(in); |
| 282 | +for (CSVRecord record : records) { |
| 283 | + String id = record.get("ID"); |
| 284 | + String customerNo = record.get("CustomerNo"); |
| 285 | + String name = record.get("Name"); |
| 286 | +} |
| 287 | + </code> |
| 288 | + </pre> |
| 289 | + This will use the values from the first record as header names and skip the first record when iterating. |
| 290 | + </section> |
| 291 | + <section> |
| 292 | + <h2>Printing with headers</h2> |
| 293 | + <p>To print a CSV file with headers, you specify the headers in the format:</p> |
| 294 | + <pre> |
| 295 | + <code> |
| 296 | +Appendable out = ...; |
| 297 | +CSVPrinter printer = CSVFormat.DEFAULT.builder() |
| 298 | + .setHeader("H1", "H2") |
| 299 | + .build() |
| 300 | + .print(out); |
| 301 | + </code> |
| 302 | + </pre> |
| 303 | + <p>To print a CSV file with JDBC column labels, you specify the ResultSet in the format:</p> |
| 304 | + <pre> |
| 305 | + <code> |
| 306 | +try (ResultSet resultSet = ...) { |
| 307 | + CSVPrinter printer = CSVFormat.DEFAULT.builder() |
| 308 | + .setHeader(resultSet) |
| 309 | + .build() |
| 310 | + .print(out); |
| 311 | +} |
| 312 | + </code> |
| 313 | + </pre> |
| 314 | + </section> |
| 315 | + </section> |
| 316 | +</body> |
| 317 | +</html> |
0 commit comments