Fixing the weird characters in your files

Published

I was looking for a dataset of all the neighborhoods in Puerto Rico, and I was really excited to see that Puerto Rico has an Open Data Portal with a bunch of datasets available for public use. Including one with the information I needed! But this is where my excitement ended.

Side Note: If you are also interested in such a dataset, I published a version of it in CSV and JSON as a gist.

The problem

I found that some of the rows had some weird characters in some columns, and upon closer inspection I found a pattern. Words like "Añasco" appeared as "Añasco" or "Bayamón" instead of "Bayamón". In Spanish we use letters with diacritics (i.e. á, é, í, ó, ú, ü, ñ, etc.) to determine things like word stress, so I wanted them back!

Import-Csv "data.csv" -Encoding:UTF8

municipio: Manatí
#...
municipio: Añasco
# ...
municipio: Bayamón
# ...

After some light research I found that the problem probably was from trying to encode Latin (aka Latin1 or ISO-8859-1) content to UTF8. So I tried every single encoding available via the cmdlet and they all returned an undesirable output. I then thought "Ok, I'll do the encoding myself." and there I went trying to encode every value of every column of every row. This didn't take me anywhere either.

The Solution

I was defeated and all the answers online were complex (hence why I'm writing this). So as a last, and desperate, attempt I tried piping different commands together to see if some combination of them would solve all my problems. And that's when I accidentally stumbled into this:

$path = "data.csv"
Import-Csv $path -Encoding:UTF8 `
| Export-Csv $path -Encoding:Default

# OR (if your data is not in csv)

Get-Content $path -Encoding:UTF8 `
| Set-Content $path -Encoding:Default

That was it, that fixed all the encoding issues. So, if you're ever in a similar situation, try this out!