I’m tired of transform shift-jis encoding to UTF-8 encoding for each file in my project these days, so I want to write a script to automatically do this job for me. After searching the Internet, I find it’s an easy job with the tool of Python.
Python, at least 2.6 version, has a library called codecs, and all we have to do is just using this library to read and write files in different encodings.
The code to transform all files, including files in sub-folders, from shift-jis encoding to UTF-8 encoding is here:
#!/usr/bin/python
import sys, os, codecs, glob, re
#Created by Liang Sun on March, 5, 2011
#Translate all files in current folder from
# shif-jis encoding to utf-8 encoding
errList = [];
def transcode(infile, outfile, incoding = "shift-jis", outcoding = "utf-8"):
print "infile = " + infile
print "outfile = " + outfile
fin = codecs.open(infile, "rb", incoding)
fout = codecs.open(outfile, "wb", outcoding)
try:
fout.write(fin.read())
except:
errList.append(outfile)
print "!!!" + outfile + " is not encoded in shift-jis."
fi = open(infile)
fo = open(outfile, 'w')
fo.write(fi.read())
fo.close()
fi.close()
fin.close()
fout.close()
path = os.path.abspath(os.path.dirname(sys.argv[0]))
print "Current Path: " + path
for dirpath, dirs, files in os.walk(path):
for filename in files:
if re.search(r".(h|m|mm|cpp|inl|def|txt)$", filename):
print "----" + filename + "...."
fi = open(os.path.join(dirpath, filename))
fo = open(os.path.join(dirpath, filename + ".bk"), 'w')
fo.write(fi.read())
fo.close()
fi.close()
transcode(os.path.join(dirpath, filename + ".bk"),
os.path.join(dirpath, filename))
print "Done."
os.remove(os.path.join(dirpath, filename + ".bk"))
if errList:
print "--------------------------------------------------------"
print "These files are not encoded in shift-jis:"
for err in errList:
print "t" + err
print "--------------------------------------------------------"
else:
print
print "All files have been translated successfully."
print
print "Created for you by Liang Sun on March, 5, 2011."
raw_input()
Thank you very much. I used to write HTML in Shift_JIS and explicitly specify Shift_JIS as the character set. This works OK in Japan but a friend of mine in USA (who is learning Japanese language) told me that characters in my homepage do not look right. Maybe Shift_JIS is not supported in his web browser. So I decided to switch from Shift_JIS to UTF-8. I slightly modified the Python script (to include .html in the file name extensions) and successfully converted all HTML files. This is good.
I’m glad to hear that it’s helpful.
This looks like a great script, but I have a question. I have mp3 files that the title or ID3 tags are a mixture of UTF-8 and Shift-JIS. What I’d like to do is read each tag and the file name, check if it’s shift-jis, then map and convert it to UTF8.
Can this be done using the codecs module? It looks like it based on this code.
That sounds like an interesting job. Though I have little knowledge about mp3, I think the codecs module can be used to do this.