CONCEPT unicode DESCRIPTION LPC strings come in two flavors: As byte sequences and as unicode strings. For both types almost the full range of string operations is available, but the types are not to be mixed. So for example you cannot add a byte sequence to an unicode string or vice versa. Byte sequences can store only bytes (values from 0 to 255), but unicode strings can store the full unicode character set (values from 0 to 1114111). There are two conversion functions to convert between byte sequences and unicode strings: to_text() which will return a unicode string, and to_bytes() which returns a byte sequence. Both take either a string or an array, and when converting between bytes and unicode also the name of the encoding (to be) used for the byte sequence. -- File handling -- When a file is accessed either by compiling, read_file(), write_file() (not read_bytes() or write_bytes(), or when an explicit encoding was given), the master is asked via the driver hook H_FILE_ENCODING for the encoding of the file. If none is given, 7 bit ASCII is assumed. Whenever codes are encounted that are not valid in the given encoding a compile or runtime error will be raised. -- File names -- The filesystem encoding can be set with a call to configure_driver(DC_FILESYSTEM_ENCODING, ). The default encoding is derived from the LC_CTYPE environment setting. If there is no environment setting (or it is set to the default "C" locale), then UTF-8 is used. -- Interactives -- Each interactive has its own encoding. It can be set with configure_interactive(IC_ENCODING, ). The default is "ISO-8859-1//TRANSLIT" which maps each incoming byte to the first 256 unicode characters and uses transliteration to encode characters that are not in this character set. If an input or output character can not be converted to/from the configured encoding it will be silently discarded. -- Encoding error handling -- When reading from a specific encoding or writing to specific encodings several errors may happen: The source material might contain invalid codes or the target encoding might not support all characters. The specified encoding can contain a suffix to determine how errors should be handled. Without any suffix runtime errors will be raised. For reading these are: //IGNORE Ignore any illegal codes (skip them). //REPLACE Replace any illegal codes with '\ufffd'. For writing the following suffixes are available (these may depend on the local iconv library, the suffixes listed here are commonly available): //IGNORE Ignore any unsupported characters (skip them). //TRANSLIT Transliterate any unsupported characters if possible. (If no transliteration is available, replace them with '?'.) For interactives only the ouput suffixes can be specified. If no suffix is specified, //IGNORE is the default handling. For the input encoding //IGNORE is the fixed setting. -- ERQ / UDP -- Only byte sequences can be sent to the ERQ or via UDP, and only byte sequences can be received from them. -- Formatting -- The efuns sprintf() and terminal_colour() work not just on unicode codepoints but on extended grapheme clusters. Those are several unicode characters that will form a single user-perceived character. These clusters will not be split by those efuns. Also those efuns will try to guess the width of those characters and apply formatting modifiers accordingly. But as this is dependent upon the client's rendering of such characters it will never be perfect. The grapheme cluster handling only applies to sprintf() and terminal_colour(). Other efun and operators like sizeof(), explode() and the index operator still work on single unicode codepoints and thus allow to split such grapheme clusters. However the regular expressions support \X to match a whole grapheme cluster. PCRE and the traditional package differ in the handling of ANSI escape sequences which the traditional package will treat also as a cluster. HISTORY Introduced in LDMud 3.6. SEE ALSO to_text(E), to_bytes(E), configure_driver(E)