I was looking for a dataset of all the neighborhoods in Puerto Rico, and I was really excited to see that Puerto Rico has an Open Data Portal with a bunch of datasets available for public use. Including one with the information I needed! But this is where my excitement ended.
Side Note: If you are also interested in such a dataset, I published a version of it in CSV and JSON as a gist.
The problem
I found that some of the rows had some weird characters in some columns, and upon closer inspection I found a pattern. Words like "Añasco" appeared as "Añasco" or "Bayamón" instead of "Bayamón". In Spanish we use letters with diacritics (i.e. á, é, í, ó, ú, ü, ñ, etc.) to determine things like word stress, so I wanted them back!
Import-Csv "data.csv" -Encoding:UTF8
municipio: ManatÃ
#...
municipio: Añasco
# ...
municipio: Bayamón
# ...
After some light research I found that the problem probably was from trying to encode Latin
(aka Latin1
or ISO-8859-1
) content to UTF8
. So I tried every single encoding available via the cmdlet and they all returned an undesirable output. I then thought "Ok, I'll do the encoding myself." and there I went trying to encode every value of every column of every row. This didn't take me anywhere either.
The Solution
I was defeated and all the answers online were complex (hence why I'm writing this). So as a last, and desperate, attempt I tried piping different commands together to see if some combination of them would solve all my problems. And that's when I accidentally stumbled into this:
$path = "data.csv"
Import-Csv $path -Encoding:UTF8 `
| Export-Csv $path -Encoding:Default
# OR (if your data is not in csv)
Get-Content $path -Encoding:UTF8 `
| Set-Content $path -Encoding:Default
That was it, that fixed all the encoding issues. So, if you're ever in a similar situation, try this out!