Excel provides a way to save spreadsheets as a CSV file, but it seems to fail at handling UTF-8 characters. See how a real programmer deals with the problem.
CSV stands
for Comma-Separated-Values and is a very common format used for
exchanging data between diverse applications. While the Excel
Spreadsheet file format is complex (since it has to accommodate a lot
more!), CSV is a simpler format representing just tabular data.
In this article, we show you a way of exporting the data from
an Excel spreadsheet to CSV. We use the Apache POI library for the
purpose.
But Why?
Excel directly provides for exporting CSV data using the Save As functionality in the File menu. While it gets the job done in most cases, it leaves something to be desired.
Specifically, if your spreadsheet contains Unicode data, you
are out of luck. It appears that Excel uses Windows default character
set Windows-1252
(or cp-1252) to perform the export. This character set is a very
limited set and cannot represent characters from most foreign languages.
This leaves your CSV output from Excel severely broken if it contains
such characters.
The proper solution is to use Unicode (specifically UTF-8) for encoding the CSV file so all your data is preserved.
It is indeed a surprise that the Excel team has not yet
figured out in 2017 that people may have Unicode data that need to be
exported properly to CSV.
Maybe the Excel developers have not read this article by Joel Spolsky stressing
the need for programmers to be aware of Unicode. Ironically, Joel was
(until 1995?) the project manager for the Excel team. Joel penned the
article in 2003, long after leaving the Excel group.
Using Apache POI
We use Apache POI to read the Excel spreadsheet. Building the program requires this dependency to be declared in pom.xml (assuming Maven for the build) as follows:
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>${poi.version}</version>
</dependency>
Take 1 – The Naive Approach
The first cut of the approach to convert Excel spreadsheet data to CSV is shown in the block below.
Workbook wb = new XSSFWorkbook(new File(xlsxFile));
DataFormatter formatter = new DataFormatter();
PrintStream out = new PrintStream(new FileOutputStream(csvFile),
true, "UTF-8");
for (Sheet sheet : wb) {
for (Row row : sheet) {
boolean firstCell = true;
for (Cell cell : row) {
if ( ! firstCell ) out.print(',');
String text = formatter.formatCellValue(cell);
out.print(text);
firstCell = false;
}
out.println();
}
}
Use the UTF-8 BOM
There are several problems with the code above. While the
code above correctly outputs UTF-8 and encodes characters properly,
Excel cannot load the generated CSV file. The reason is that Excel needs
the Byte-Order-Marker to indicate that the file is encoded in UTF-8. With that modification, the code is now:
PrintStream out = new PrintStream(new FileOutputStream(csvFile),
true, "UTF-8");
byte[] bom = {(byte)0xEF, (byte)0xBB, (byte)0xBF};
out.write(bom);
for (Sheet sheet : wb) {
...
Exporting Formulas
When Excel exports CSV, it evaluates all the formulas and
writes out the data. In the code below, we have an option of exporting
the formula itself (starting with a "
=
") to CSV, or evaluating the formula and exporting the result.Workbook wb = new XSSFWorkbook(new File(xlsxFile));
FormulaEvaluator fe = wb.getCreationHelper().createFormulaEvaluator();
...
if ( fe != null ) cell = fe.evaluateInCell(cell);
String value = formatter.formatCellValue(cell);
if ( cell.getCellTypeEnum() == CellType.FORMULA ) value = "=" + value;
...
Escape Quotes and Commas
When exporting a field value, it is necessary to properly
escape certain characters such as double quotes, commas, and line
separators. This is handled as follows:
static private Pattern rxquote = Pattern.compile("\"");
static private String encodeValue(String value) {
boolean needQuotes = false;
if ( value.indexOf(',') != -1 || value.indexOf('"') != -1 ||
value.indexOf('\n') != -1 || value.indexOf('\r') != -1 )
needQuotes = true;
Matcher m = rxquote.matcher(value);
if ( m.find() ) needQuotes = true; value = m.replaceAll("\"\"");
if ( needQuotes ) return "\"" + value + "\"";
else return value;
}
Including Empty Rows and Cells
When a formula is exported as-is, it needs the cell
references to remain intact. We achieve this by exporting empty rows and
cells to CSV to maintain these references.
To ensure that empty rows and cells are also output to CSV,
we use row and cell numbers explicitly since the for-each loop shown
above skips empty rows and cells.
Sheet sheet = wb.getSheetAt(sheetNo);
for (int r = 0, rn = sheet.getLastRowNum() ; r <= rn ; r++) {
Row row = sheet.getRow(r);
if ( row == null ) { out.println(','); continue; }
for (int c = 0, cn = row.getLastCellNum() ; c < cn ; c++) {
Cell cell = row.getCell(c,Row.MissingCellPolicy.RETURN_BLANK_AS_NULL);
}
}
Export Specified Sheet Only
If the spreadsheet contains multiple sheets, a sheet number must be explicitly specified for export.
Sheet sheet = wb.getSheetAt(sheetNo);
The Final Cut
The core exporting program segment with all these updates now looks like this.
Workbook wb = new XSSFWorkbook(new File(xlsxFile));
int sheetNo = Integer.parseInt(args[index++]);
FormulaEvaluator fe = null;
if ( index < args.length ) {
fe = wb.getCreationHelper().createFormulaEvaluator();
}
DataFormatter formatter = new DataFormatter();
PrintStream out = new PrintStream(new FileOutputStream(csvFile),
true, "UTF-8");
byte[] bom = {(byte)0xEF, (byte)0xBB, (byte)0xBF};
out.write(bom);
{
Sheet sheet = wb.getSheetAt(sheetNo);
for (int r = 0, rn = sheet.getLastRowNum() ; r <= rn ; r++) {
Row row = sheet.getRow(r);
if ( row == null ) { out.println(','); continue; }
boolean firstCell = true;
for (int c = 0, cn = row.getLastCellNum() ; c < cn ; c++) {
Cell cell = row.getCell(c, Row.MissingCellPolicy.RETURN_BLANK_AS_NULL);
if ( ! firstCell ) out.print(',');
if ( cell != null ) {
if ( fe != null ) cell = fe.evaluateInCell(cell);
String value = formatter.formatCellValue(cell);
if ( cell.getCellTypeEnum() == CellType.FORMULA ) {
value = "=" + value;
}
out.print(encodeValue(value));
}
firstCell = false;
}
out.println();
}
}
Summary
Exporting data in Excel spreadsheet to CSV is quite simple;
it is also provided directly in Excel. However exporting Unicode data
within the spreadsheet is not done correctly by Excel. In this article,
we covered some issues arising from this process including exporting
formulas, maintaining cell references and proper escaping of data.