Translate

Thursday, March 23, 2017

The Programmer's Way to Convert Excel to CSV (With UTF-8)

Excel provides a way to save spreadsheets as a CSV file, but it seems to fail at handling UTF-8 characters. See how a real programmer deals with the problem.


· Java Zone

CSV stands for Comma-Separated-Values and is a very common format used for exchanging data between diverse applications. While the Excel Spreadsheet file format is complex (since it has to accommodate a lot more!), CSV is a simpler format representing just tabular data.
In this article, we show you a way of exporting the data from an Excel spreadsheet to CSV. We use the Apache POI library for the purpose.

But Why?

Excel directly provides for exporting CSV data using the Save As functionality in the File menu. While it gets the job done in most cases, it leaves something to be desired.
Specifically, if your spreadsheet contains Unicode data, you are out of luck. It appears that Excel uses Windows default character set Windows-1252 (or cp-1252) to perform the export. This character set is a very limited set and cannot represent characters from most foreign languages. This leaves your CSV output from Excel severely broken if it contains such characters.
The proper solution is to use Unicode (specifically UTF-8) for encoding the CSV file so all your data is preserved.
It is indeed a surprise that the Excel team has not yet figured out in 2017 that people may have Unicode data that need to be exported properly to CSV.
Maybe the Excel developers have not read this article by Joel Spolsky stressing the need for programmers to be aware of Unicode. Ironically, Joel was (until 1995?) the project manager for the Excel team. Joel penned the article in 2003, long after leaving the Excel group.

Using Apache POI

We use Apache POI to read the Excel spreadsheet. Building the program requires this dependency to be declared in pom.xml (assuming Maven for the build) as follows:
<dependency>
  <groupId>org.apache.poi</groupId>
  <artifactId>poi-ooxml</artifactId>
  <version>${poi.version}</version>
</dependency>

Take 1 – The Naive Approach

The first cut of the approach to convert Excel spreadsheet data to CSV is shown in the block below.
Workbook wb = new XSSFWorkbook(new File(xlsxFile));
DataFormatter formatter = new DataFormatter();
PrintStream out = new PrintStream(new FileOutputStream(csvFile),
                                  true, "UTF-8");
for (Sheet sheet : wb) {
    for (Row row : sheet) {
        boolean firstCell = true;
        for (Cell cell : row) {
            if ( ! firstCell ) out.print(',');
            String text = formatter.formatCellValue(cell);
            out.print(text);
            firstCell = false;
        }
        out.println();
    }
}

Use the UTF-8 BOM

There are several problems with the code above. While the code above correctly outputs UTF-8 and encodes characters properly, Excel cannot load the generated CSV file. The reason is that Excel needs the Byte-Order-Marker to indicate that the file is encoded in UTF-8. With that modification, the code is now:
PrintStream out = new PrintStream(new FileOutputStream(csvFile),
                                  true, "UTF-8");
byte[] bom = {(byte)0xEF, (byte)0xBB, (byte)0xBF};
out.write(bom);
for (Sheet sheet : wb) {
...

Exporting Formulas

When Excel exports CSV, it evaluates all the formulas and writes out the data. In the code below, we have an option of exporting the formula itself (starting with a "=") to CSV, or evaluating the formula and exporting the result.
Workbook wb = new XSSFWorkbook(new File(xlsxFile));
FormulaEvaluator fe = wb.getCreationHelper().createFormulaEvaluator();
...
if ( fe != null ) cell = fe.evaluateInCell(cell);
String value = formatter.formatCellValue(cell);
if ( cell.getCellTypeEnum() == CellType.FORMULA ) value = "=" + value;
...

Escape Quotes and Commas

When exporting a field value, it is necessary to properly escape certain characters such as double quotes, commas, and line separators. This is handled as follows:
static private Pattern rxquote = Pattern.compile("\"");
static private String encodeValue(String value) {
    boolean needQuotes = false;
    if ( value.indexOf(',') != -1 || value.indexOf('"') != -1 ||
         value.indexOf('\n') != -1 || value.indexOf('\r') != -1 )
        needQuotes = true;
    Matcher m = rxquote.matcher(value);
    if ( m.find() ) needQuotes = true; value = m.replaceAll("\"\"");
    if ( needQuotes ) return "\"" + value + "\"";
    else return value;
}

Including Empty Rows and Cells

When a formula is exported as-is, it needs the cell references to remain intact. We achieve this by exporting empty rows and cells to CSV to maintain these references.
To ensure that empty rows and cells are also output to CSV, we use row and cell numbers explicitly since the for-each loop shown above skips empty rows and cells.
Sheet sheet = wb.getSheetAt(sheetNo);
for (int r = 0, rn = sheet.getLastRowNum() ; r <= rn ; r++) {
    Row row = sheet.getRow(r);
    if ( row == null ) { out.println(','); continue; }
    for (int c = 0, cn = row.getLastCellNum() ; c < cn ; c++) {
        Cell cell = row.getCell(c,Row.MissingCellPolicy.RETURN_BLANK_AS_NULL);
    }
}

Export Specified Sheet Only

If the spreadsheet contains multiple sheets, a sheet number must be explicitly specified for export.
Sheet sheet = wb.getSheetAt(sheetNo);

The Final Cut

The core exporting program segment with all these updates now looks like this.
Workbook wb = new XSSFWorkbook(new File(xlsxFile));
int sheetNo = Integer.parseInt(args[index++]);
FormulaEvaluator fe = null;
if ( index < args.length ) {
    fe = wb.getCreationHelper().createFormulaEvaluator();
}
DataFormatter formatter = new DataFormatter();
PrintStream out = new PrintStream(new FileOutputStream(csvFile),
                                  true, "UTF-8");
byte[] bom = {(byte)0xEF, (byte)0xBB, (byte)0xBF};
out.write(bom);
{
    Sheet sheet = wb.getSheetAt(sheetNo);
    for (int r = 0, rn = sheet.getLastRowNum() ; r <= rn ; r++) {
        Row row = sheet.getRow(r);
        if ( row == null ) { out.println(','); continue; }
        boolean firstCell = true;
        for (int c = 0, cn = row.getLastCellNum() ; c < cn ; c++) {
            Cell cell = row.getCell(c, Row.MissingCellPolicy.RETURN_BLANK_AS_NULL);
            if ( ! firstCell ) out.print(',');
            if ( cell != null ) {
                if ( fe != null ) cell = fe.evaluateInCell(cell);
                String value = formatter.formatCellValue(cell);
                if ( cell.getCellTypeEnum() == CellType.FORMULA ) {
                    value = "=" + value;
                }
                out.print(encodeValue(value));
            }
            firstCell = false;
        }
        out.println();
    }
}

Summary

Exporting data in Excel spreadsheet to CSV is quite simple; it is also provided directly in Excel. However exporting Unicode data within the spreadsheet is not done correctly by Excel. In this article, we covered some issues arising from this process including exporting formulas, maintaining cell references and proper escaping of data.

Source: DZone The Programmer's Way to Convert Excel to CSV (With UTF-8)