Working with Low Level File I/O and Encodings

Posted by Loren Shure, September 20, 2006

11 views (last 30 days) | 0 Likes | 44 comments

I'm pleased to introduce Vadim Teverovsky, our guest blogger this week, who gives us his take on MATLAB low level file I/O and how it works with encodings.

It is fairly common for users to write and read character data where the characters are not 7-Bit ASCII (see ASCII wiki). Such characters may include both characters from languages other than English and various symbolic characters, such as a pound sign. Unfortunately, users may run into trouble when such files are shared across either platform or language boundaries. In order to be able to reliably read and write such data, there are certain things the user should know.

Common Problem #1: Platform Differences
Common Problem #2: Files Coming From Outside Source
A Possible Solution
Some Helpful Links

Common Problem #1: Platform Differences

Starting in R2006a, MATLAB's low level file IO has been, across the board, taking a "character" to mean an actual character as opposed to a "single byte", which has often been the case in the past. In today's multi-language environment, this is simply a necessity.

In order to understand what MATLAB writes out and reads in, we need to understand the concept of an "encoding". An encoding is a way of representing certain symbols, such as letters or numbers, in a language and locale specific way. An example of an encoding is 7-Bit ASCII, which many people are familiar with. Another example is Shift-JIS, which is commonly used in Japan. Yet others include windows-1252 and ISO-8859-1, which are both slightly different variants of what is commonly known as Extended ASCII. Each computer (user) will typically have a default locale. Thus, a user running Windows in the US will typically be running with the windows-1252 encoding. If the locale is changed, the encoding may change as well. Typically, all of the default encodings you may encounter will have the ASCII set of values in common, but after that, all bets are off.

MATLAB, unless you specify a particular encoding (more on that later), will use the computer's (user's) default encoding. Thus, when you are working on your Windows computer in Natick, Massachusetts, you could write the following code:

fid = fopen('sample1.txt', 'w','l');
fwrite(fid, 'abcdefg', 'char');
fclose(fid);

the result will be a file written in the windows-1252 encoding. When this file is read in again, assuming you are still running the same machine in the same environment, MATLAB will know how to translate this encoded data into its internal representation, and will read the characters properly.

type('sample1.txt');

abcdefg

But what happens if you write the file out on Windows, and try to read it on a Solaris machine. Well, it turns out that Solaris has, as its default, the 7-Bit ASCII encoding, which can not represent all of the symbols which are found in the windows-1252 encoding. As long as you stick to the set of ASCII characters (values <= 127), everything will look exactly the same. But what if you try to write out and read a pound sign (163 in windows-1252)?

char(163)

ans =

£

That value is not part of the 7-Bit ASCII encoding, and will therefore not be read in successfully on your Solaris computer. It will likely manifest itself as the ASCII value 26, which stands for "I don't know". What would the file contain? For now, trust me that providing the last argument to fopen below will result in a file very similar to what you would see on Solaris:

fid = fopen('sample2.txt', 'w','l', 'US-ASCII');
fwrite(fid, 'abcdefg £¥§©', 'char');
fclose(fid);
type('sample2.txt');

abcdefg

What happened? The odd looking characters which correspond to non ASCII characters did not get written out properly, because MATLAB tried to convert them to US-ASCII, and could not do so.

Similarly, what if you are in Japan, working on a computer set to a Japanese environment, and write out a file that you wish to be able to read from a German environment machine? The same problem can occur, because the default encodings are different.

Common Problem #2: Files Coming From Outside Source

Yet another manifestation of this kind of issue is data that looks like this:

fid = fopen('sample3.txt', 'r', 'l');
str = fscanf(fid, '%s')
abs(str)
fclose(fid);

str =

ÿþa b c d e f g h 


ans =

  Columns 1 through 14

   255   254    97     0    98     0    99     0   100     0   101     0   102     0

  Columns 15 through 18

   103     0   104     0

Looks odd, doesn't it? First there are some odd characters in the beginning, then there are extra zeros inserted everywhere. What happened? In this case, the sample file was saved from Notepad, using the Save As... menu item and choosing to save it using the "Unicode" encoding. (BTW, when Notepad says "Unicode", they really mean UTF-16 encoding.) Since the file was opened with the default encoding, MATLAB transformed the data using the windows-1252 encoding, and what you see was the result. The first two bytes were a Byte Order Marker, which we would simply need to skip, since they do not actually represent data.

A Possible Solution

So much for this problem, now what can you do about them? If you wish to be more robust to platform and language/locale, you can specify an encoding to use when you fopen a file. For example, for the first problem described above:

fid = fopen('sample4.txt', 'w','l', 'ISO-8859-1');
fwrite(fid, 'abcdefg £¥§©', 'char');
fclose(fid);
type('sample4.txt');

abcdefg £¥§©

For the second problem, we will skip the Byte Order Marker, open the file in a Unicode encoding, and also turn off a warning, which indicates that not all functionality is supported for this encoding. For our purposes, which are reading text, the warning can be ignored.:

warning off MATLAB:iofun:UnsupportedEncoding;
fid = fopen('sample3.txt', 'r', 'l', 'UTF16-LE');
fseek(fid, 2, 0);
str = fscanf(fid, '%s')
abs(str)
fclose(fid);

str =

abcdefgh


ans =

    97    98    99   100   101   102   103   104

As you can see, the string is read correctly.

In general, if your program specifies the encoding for both reading and writing, then you don't have to worry about the default encoding, since you are specifying it explicitly. The character data is then saved in a file with the specified encoding, and MATLAB will read with that encoding as well, thus preserving all of the consistency. You just need to make sure that the character data you are saving is representable in the encoding you have chosen. For example, if you chose 'US-ASCII', you would not be able to write out a pound sign. If you are dealing with values in the range from 128 to 255, I would suggest using ISO-8859-1 as above. If you are writing out Japanese, Shift-JIS may be a good one to use.