Downloading content from the web using different encodings

The other day, somebody asked me: How do I download a webpage, or other content from a webserver, where the content is stored using a specific encoding ? They want to do this using for eg: System.Net.HttpWebRequest

Why is this necessary ?

Well, for starters, webservers around the world store their content in various encodings. For eg, webadmins in Japan server their pages using the Shift-JIS encoding to account for the japanese characters in their pages.

If you just attach a StreamReader to the stream given by HttpWebResponse.GetResponseStream(), then you will most likely get bad characters in your data. Or, your stream might be truncated in the middle. This is because StreamReader uses a default encoding (UTF8) which might not match the encoding of the bytes you are reading into the StreamReader.

So, lets get down to coding.

There are two places where a server can indicate the encoding of the entity in the response. The first is the response header. The second is the entity body itself, if the entity is an HTML page (this is indicated by “content-type: text/html“ response header).

 The response headers you need to look at are:

“Content-Type: foo/bar; charset=<charset encoding>“

If the Content-Type header exists, and the value for this header contains a charset=<value>, then the <value> portion gives the encoding of the response entity.

If this header is not present, or if a “charset=” token is not present in the header value, then you need to look at the header of the HTML page (if the entity contains HTML). There will be some meta tags in the begining of the entity which indicate the charset of the entity:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1" />

What you need to do is to read the entity as ASCII into a string. Then, you extract the encoding information from the header of the entity. Once you know the encoding info, you can reprocess the raw entity using the correct encoding. of course, you should make sure to store the raw entity in a MemoryStream or other buffer, so that you can use it when you want to read the entity using its actual encoding.

Here is the code which demonstrates this:


private static String DecodeData(WebResponse w) {

//
// first see if content length header has charset = calue
//
String charset = null;
String ctype = w.Headers["content-type"];
if(ctype != null) {
int ind = ctype.IndexOf("charset=");
if(ind != -1) {
charset = ctype.Substring(ind + 8);
Console.WriteLine("CT: charset=" + charset);
}
}

                // save data to a memorystream
MemoryStream rawdata = new MemoryStream();
byte [] buffer = new byte[1024];
Stream rs = w.GetResponseStream();
int read = rs.Read(buffer,0,buffer.Length);
while(read > 0) {
rawdata.Write(buffer,0,read);
read = rs.Read(buffer,0,buffer.Length);
}

                rs.Close();

      //
// if ContentType is null, or did not contain charset, we search in body
//
if(charset == null) {
MemoryStream ms = rawdata;
ms.Seek(0,SeekOrigin.Begin);

          StreamReader srr = new StreamReader(ms,Encoding.ASCII);
String meta = srr.ReadToEnd();

if(meta != null) {
int start_ind = meta.IndexOf("charset=");
int end_ind = -1;
if(start_ind != -1) {
end_ind = meta.IndexOf("\"", start_ind);
if(end_ind != -1) {
int start = start_ind + 8;
charset = meta.Substring(start, end_ind - start + 1);
charset = charset.TrimEnd(new Char[] { '>','"' });
Console.WriteLine("META: charset=" + charset);
}
}
}
}

      Encoding e = null;
if(charset == null) {
e = Encoding.ASCII; //default encoding
} else {
try {
e = Encoding.GetEncoding(charset);
} catch(Exception ee) {
Console.WriteLine("Exception: GetEncoding: " + charset);
Console.WriteLine(ee.ToString());
e = Encoding.ASCII;
}
}

      rawdata.Seek(0,SeekOrigin.Begin);

      StreamReader sr = new StreamReader(rawdata, e);

      String s = sr.ReadToEnd();

      return s.ToLower();
}