less than 1 minute read

When you are dealing with opendata or any sort of dataset, you should reprocess your dataset including cleansing. It can be separating or merging columns or deleting some miss-placed data. CSV and TSV are the most well-known formats but what if you have to deal with xls and xlsx?

For sake of ease formatting, let’s convert them into csv (with a package :))

  • The dev environment is Ubuntu.
$ sudo apt-get install gnumeric

First of all, install gnumeric first. It will provide us a fancy tool, ssconvert. ssconvert or SpreadSheetconvert, is a converting tool for spreadsheet data.

Next,

$ ssconvert something.xlsx something.csv

Just a piece of cake.

If you want to know what types of format supported, try this. Or just checkout the manual page.

$ ssconvert --list-importers
$ ssconvert --list-exporters

Great job. But… what if your files crashed because of bad unicodes..

Try this.

# ex) $ iconv -f "from this encode" -t "to this encode" filename.csv > filename.csv

$ iconv -f euc-kr -t utf-8 2015_broken_encode_set.csv > 2015_dataset.csv

Happy Hacking!

Leave a comment