[OpenIndiana-discuss] Cannot open: Illegal byte sequence with a file containing a question mark

James Carlson carlsonj at workingcode.com
Thu Apr 26 15:31:54 UTC 2012


Flo wrote:
>> If that shows that the property is set "on", then that's what's causing
>> the failure.  Sadly, it's configurable only when creating a file system,
>> so if you wanted to change it, you'd have to create a new file system
>> and copy everything over.
> 
> utf8only is on. I created a new folder with utf8only=off and this worked!
> 
> Are there any disadvantages with utf8only disabled?
> I use Napp-It and Napp-It enables it automatically

You'd probably want to talk with the author of "Napp-It" to find out why
he set that parameter.

More generally speaking, there are a few file-system-level choices that
you can make that determine how names are treated.  Allowing only UTF8
is one of them.  Selecting case-insensitive matches is another.

Which one you choose depends mostly on what you're doing with those
files.  UTF8 has some great advantages -- it's an unambiguous encoding
of UNICODE characters, so it fixes the usual national language character
set problems you have with something like ISO 8859.  And because the
character values are exactly equal for at least the ASCII characters, it
mostly works without having to think too much about it.

One of the downsides, as you've found, is that it's a somewhat
restrictive format.  UNIX has traditionally allowed you to use any
arbitrary byte value other than hex 00 (NUL) and 2F (/) in the name of a
file (obviously, 2F is used for path separation), and in any sequence.
Because UNIX allows "anything" here, two users with different LANG
settings will see different characters when they look at the same files.

UTF8, though, has rules for how multibyte characters are formed, and
those rules result in the possibility that some arbitrary sequences of
bytes are not necessarily legal encodings.

That leads to an application compatibility problem.  If an application
issues an open(2) (or creat(2)) system call with a file name that has a
legal UNIX name but has an illegal UTF8 sequence, what do you do?
Failing the system call means a break in compatibility.  Allowing the
access means that the integrity of the file names is compromised.
That's why there's an option, and why the normal ZFS default for the
option is "off" -- to preserve compatibility.

There's probably a deeper issue here concerning what was going on with
the 'tar' program you were running.  I had _thought_ that file names
inside the tar format were encoded using UTF8, which would imply that
the problem is that 'tar' erroneously translated that to a national
language code point when trying to create the file.  If so, then that
could just be a configuration problem on your part -- e.g., attempting
to use a national language character set when the rest of your world is
set up for UTF8.

But maybe I'm wrong about that.  Someone who knows the internals of tar
better should probably look at it.

-- 
James Carlson         42.703N 71.076W         <carlsonj at workingcode.com>



More information about the OpenIndiana-discuss mailing list