Converting all text files with multiple encodings in a directory into a utf-8 encoded text files

09:17 30 Nov 2020

I am new starter in Python and, in general, in coding. So any help is greatly appreciated.

I have more than 3000 text files in a single directory with multiple encodings. And I need to convert them into a single encoding (e.g. utf8) for further NLP work. When I checked the type of these files using shell, I identified the following encodings:

Algol 68 source text, ISO-8859 text, with very long lines
Algol 68 source text, Little-endian UTF-16 Unicode text, with very long lines
Algol 68 source text, Non-ISO extended-ASCII text, with very long lines
Algol 68 source text, Non-ISO extended-ASCII text, with very long lines, with LF, NEL line terminators
ASCII text
ASCII text, with very long lines
data
diff output text, ASCII text
ISO-8859 text, with very long lines
ISO-8859 text, with very long lines, with LF, NEL line terminators
Little-endian UTF-16 Unicode text, with very long lines
Non-ISO extended-ASCII text
Non-ISO extended-ASCII text, with very long lines
Non-ISO extended-ASCII text, with very long lines, with LF, NEL line terminators
UTF-8 Unicode (with BOM) text, with CRLF line terminators
UTF-8 Unicode (with BOM) text, with very long lines, with CRLF line terminators
UTF-8 Unicode text, with very long lines, with CRLF line terminators

Any ideas how to convert text files with the above mentioned encodings into text files with a utf-8 encoding?

python encoding utf-8

Your Answer

Privacy & Cookie Consent