Fixing the weird characters in your files

Published 16 May 2021

I was looking for a dataset of all the neighborhoods in Puerto Rico, and I was really excited to see that Puerto Rico has an Open Data Portal with a bunch of datasets available for public use. Including one with the information I needed! But this is where my excitement ended.

Side Note: If you are also interested in such a dataset, I published a version of it in CSV and JSON as a gist.

The problem

I found that some of the rows had some weird characters in some columns, and upon closer inspection I found a pattern. Words like "Añasco" appeared as "AÃ±asco" or "BayamÃ³n" instead of "Bayamón". In Spanish we use letters with diacritics (i.e. á, é, í, ó, ú, ü, ñ, etc.) to determine things like word stress, so I wanted them back!

Import-Csv "data.csv" -Encoding:UTF8

municipio: ManatÃ
#...
municipio: AÃ±asco
# ...
municipio: BayamÃ³n
# ...

After some light research I found that the problem probably was from trying to encode Latin (aka Latin1 or ISO-8859-1) content to UTF8. So I tried every single encoding available via the cmdlet and they all returned an undesirable output. I then thought "Ok, I'll do the encoding myself." and there I went trying to encode every value of every column of every row. This didn't take me anywhere either.

The Solution

I was defeated and all the answers online were complex (hence why I'm writing this). So as a last, and desperate, attempt I tried piping different commands together to see if some combination of them would solve all my problems. And that's when I accidentally stumbled into this:

$path = "data.csv"
Import-Csv $path -Encoding:UTF8 `
| Export-Csv $path -Encoding:Default

# OR (if your data is not in csv)

Get-Content $path -Encoding:UTF8 `
| Set-Content $path -Encoding:Default

That was it, that fixed all the encoding issues. So, if you're ever in a similar situation, try this out!