Writing a Unicode file via perl ...

Artigo
06/07/2006

Several months ago someone filed bugs across
Windows Vista to make sure all performance monitoring .ini files were
Unicode, so the files could be properly localized ("translated") to
various languages (so we could have Korean, or Hindi descriptions). A noble goal to be sure.

For most people this was as easy as checking out the file for editing,
opening it in notepad, doing a "save as", picking Unicode. ESE
however has ALOT of perf counters (esp. when you Squeaky Lobster a
machine - more on that later) so we use a perl script to generate
several parts of the performance monitoring files, the .ini, the .hxx,
and some fairly repatitive .cxx code that gets compiled into ESE binaries ... I know some of you are saying,
"You can use perl on Windows?" ...

Anyway, I looked all over the internet, and couldn't even find help
when I scoped to Mr. Unicode's blog ... then I posted to an internal
alias on perl at Microsoft, and someone came to my rescue, since he
said he didn't mind and I couldn't find it on the internet (at least at
the time), I'd figured I'd post his comments ...

His comments, I wholesale included in our perl code
(I had to read it twice to really grok how the ":raw" type parts were
like piping through converting text commands, but in reverse so you read
them right to left):

 #
# Some notes from someone smarter than me about Perl and Unicode ...
# ----
#
# Which encoding do you want to use? UTF16-LE is the standard on Windows (nearly
# all characters are encoded as 2 bytes), UTF8 is the standard everywhere else 
# (characters are variable length and all ASCII characters are a single byte).
#
# Here's what I've figured out after lots of experimentation. To get UTF16-LE 
# output you need to play a few games with perl...
#
#   open my $FH, ">:raw:encoding(UTF16-LE):crlf:utf8", "e:\\test.txt";
#   print $FH "\x{FEFF}";
#   print $FH "hello unicode world!\nThis is a test.\n";
#   close $FH;
# Reading the IO layers from right to left (the order that they will be applied 
# as they pass from perl to the file) ...
#
# Apply the :utf8 layer first. This doesn't do much except tell perl that we're 
# going to pass "characters" to this file handle instead of bytes so that it 
# doesn't give us "Wide character in print ..." warnings.
#
# Next, apply the :crlf layer as text goes from perl out to the file. This 
# transforms \n (0x0A) into \r\n (0x0D 0x0A) giving you DOS line endings. Perl 
# normally applies this by default on Windows but it would do it at the wrong 
# stage of the pipeline so we removed it (see below), this is where it ought to 
# be.
#
# Next apply the UTF16-LE (little endian) encoding. This takes the characters 
# and transforms them to that encoding. So 0x0A turns into 0x0A 0x00. Note that 
# if you just say 'UTF16' the default endianness is big endian which is 
# backwards from how Windows likes it. However, because we're explicitly 
# specifiying the endianness perl will not write a BOM (byte order mark) at the 
# beginning of the file. We have to make up for that later.
#
# Finally, the :raw psuedo layer just removes the default (on Windows) :crlf 
# layer that transforms \n into \r\n for DOS style line endings. This is 
# necessary because otherwise it would be applied at the wrong place in the 
# pipeline. Without this the encoding layer would turn 0x0A into 0x0A 0x00 and 
# then the crlf layer would turn that into 0x0D 0x0A 0x0A and that's just goofy.
#
# Now that we've got the file opened with the right IO layers in place we can 
# almost write to it. First we need to manually write the BOM that will tell 
# readers of this file what endianness it is in. That's what the 
# print $FH "\x{FEFF}" does.
#
# Finally we can just print text out.
#
# If you want UTF8, I'm pretty sure it's a lot easier. Also, this is also a lot 
# easier on unix, the CRLF ordering problem is definitely a bug but the default 
# to big endian (and ensuing games to get the BOM to output without a warning) 
# are by design. I'm pretty sure that none of the core perl maintainers use perl 
# on Windows (even though at least one keeps perl on VMS working...).
#

#
# Until Exchange decides it wants a Unicode eseperf.ini, we're going to generate
# the old ASCII one.  Also if Exchange wants one, it will have to update it's
# version of Perl to understand the open modes we're using below.  Currently we
# get this error:
#   1>Unknown open() mode '>:raw:encoding(UTF16-LE):crlf:utf8' at .\perfdata.pl line 325,  line 6189.
#


if ( $ESENT ){ #ifdef ESENT 

    open( INIFILE, ">:raw:encoding(UTF16-LE):crlf:utf8", "$INIFILE" ) || die "Cannot open $INIFILE: ";
   print INIFILE "\x{FEFF}";  # print BOM (Byte Order Mark) for the unicode file

} else { #else

 open( INIFILE, ">$INIFILE" ) || die "Cannot open $INIFILE: ";

} #endif

The code worked like a charm, yeah Unicode esentprf.ini. Well,
until I sync'd the code to
Exchange then it broke, that is the source of the "if ( $ESENT )" which
is only defined when we build the ESE sources for Windows. I
should mention in closing that I know this code works for perl 5.8.7,
and I know it does not work for perl 5.6.1. I've heard the perl support got much better in 5.8 or so...

Oh I guess that's code, so I'm required to say something like:

Use of included script samples are subject to the terms specified at
https://www.microsoft.com/info/cpyright.htm (I'm having a hard time
imagining how such a small snipit could be subject to that, but
whatever).

Oh here is what the BOM is, and more on the BOM.

Update 2006/08/20: Turned out Exchange wanted a Unicode
eseperf.ini after all, and has updated thier version of perl, so good
news the NT - Ex code bases grow that much more similar.

Comments

Anonymous
June 14, 2006
this saved me a pile of time today as i was getting lost in the perl unicode documentation.

a fellow MSFTer
Anonymous
October 07, 2006
PingBack from http://jeremy.marzhillstudios.com/index.php/software-development/perl-tip-chained-encodings-and-binmode-magic/
Anonymous
October 19, 2006
Echo the "saving a ton of time" comment. Thanks for documenting what should be in the standard perl docs.
Anonymous
January 05, 2007
Well, for UTF 8, you can do this:open FH, ">:utf8", "file";and it works fine. I just wanted to make sure that's here so people slog along the hard road unless they really need 16LE!
Anonymous
July 10, 2007
PingBack from http://yftsai.wordpress.com/2007/07/11/perlio-layers/
Anonymous
February 20, 2008
what about file names? open FH, ">:utf8", "file"; is storing my files fine in utf8 but the dam filenames are turned into gobbledegook if the file name is utf8. Anyone know how to solve that?
Anonymous
June 28, 2008
Was tearing out what remains of my hair for several hours. Found this article, problem solved in seconds. Wonderfully useful - Thank You!!!
Anonymous
September 08, 2008
Very useful, thanks a lot! I especially liked that the snippet has all those explanations!
Anonymous
April 17, 2009
That should be "UTF-16LE" not "UTF16-LE". Surprised it worked as listed, "like a charm"!
Anonymous
May 09, 2009
Do you see any chance to open files with a name/path that can only be represented in unicode?The only way I found is using the Win32 APIs CreateFileW, but it is not compatible with the Perl open() api and therefore would require a major platform dependent rewrite of existing I/O code.Any clues would be much appreciated!
Anonymous
June 09, 2009
PingBack from http://hairgrowthproducts.info/story.php?id=4350
Anonymous
June 13, 2009
PingBack from http://gardenstatuesgalore.info/story.php?id=2276
Anonymous
June 17, 2009
PingBack from http://patioumbrellasource.info/story.php?id=2775
Anonymous
June 22, 2009
当サイトは、みんなの「勝ち組負け組度」をチェックする性格診断のサイトです。ホントのあなたをズバリ分析しちゃいます！勝ち組負け組度には、期待以上の意外な結果があるかもしれません
Anonymous
July 14, 2009
The comment has been removed
Anonymous
November 10, 2009
wow, thank you for this... my issue is over...
Anonymous
July 19, 2010
The comment has been removed
Anonymous
September 11, 2010
Not sure if you gave File::BOM a try too. :)
Anonymous
December 18, 2010
Thank you for making this so accessible. You saved me a bunch of time, and I'm not even using any Microsoft technologies (I'm on FreeBSD). I owe you a beer.
Anonymous
December 19, 2010
How can we create files with names that contain Unicode characters?

Compartilhar via

Writing a Unicode file via perl ...

Comments

Recursos adicionais