Libparserutils
Data Structures | Macros | Typedefs | Enumerations | Functions
codec.h File Reference
#include <inttypes.h>
#include <parserutils/errors.h>
#include <parserutils/functypes.h>

Go to the source code of this file.

Data Structures

union  parserutils_charset_codec_optparams
 Charset codec option parameters. More...
 

Macros

#define PARSERUTILS_CHARSET_CODEC_NULL   (0xffffffffU)
 

Typedefs

typedef struct parserutils_charset_codec parserutils_charset_codec
 
typedef enum parserutils_charset_codec_errormode parserutils_charset_codec_errormode
 Charset codec error mode.
 
typedef enum parserutils_charset_codec_opttype parserutils_charset_codec_opttype
 Charset codec option types.
 
typedef union parserutils_charset_codec_optparams parserutils_charset_codec_optparams
 Charset codec option parameters.
 

Enumerations

enum  parserutils_charset_codec_errormode { PARSERUTILS_CHARSET_CODEC_ERROR_STRICT = 0 , PARSERUTILS_CHARSET_CODEC_ERROR_LOOSE = 1 , PARSERUTILS_CHARSET_CODEC_ERROR_TRANSLIT = 2 }
 Charset codec error mode. More...
 
enum  parserutils_charset_codec_opttype { PARSERUTILS_CHARSET_CODEC_ERROR_MODE = 1 }
 Charset codec option types. More...
 

Functions

parserutils_error parserutils_charset_codec_create (const char *charset, parserutils_charset_codec **codec)
 Create a charset codec.
 
parserutils_error parserutils_charset_codec_destroy (parserutils_charset_codec *codec)
 Destroy a charset codec.
 
parserutils_error parserutils_charset_codec_setopt (parserutils_charset_codec *codec, parserutils_charset_codec_opttype type, parserutils_charset_codec_optparams *params)
 Configure a charset codec.
 
parserutils_error parserutils_charset_codec_encode (parserutils_charset_codec *codec, const uint8_t **source, size_t *sourcelen, uint8_t **dest, size_t *destlen)
 Encode a chunk of UCS-4 data into a codec's charset.
 
parserutils_error parserutils_charset_codec_decode (parserutils_charset_codec *codec, const uint8_t **source, size_t *sourcelen, uint8_t **dest, size_t *destlen)
 Decode a chunk of data in a codec's charset into UCS-4.
 
parserutils_error parserutils_charset_codec_reset (parserutils_charset_codec *codec)
 Clear a charset codec's encoding state.
 

Macro Definition Documentation

◆ PARSERUTILS_CHARSET_CODEC_NULL

#define PARSERUTILS_CHARSET_CODEC_NULL   (0xffffffffU)

Definition at line 23 of file codec.h.

Typedef Documentation

◆ parserutils_charset_codec

typedef struct parserutils_charset_codec parserutils_charset_codec

Definition at line 21 of file codec.h.

◆ parserutils_charset_codec_errormode

Charset codec error mode.

A codec's error mode determines its behaviour in the face of:

  • characters which are unrepresentable in the destination charset (if encoding data) or which cannot be converted to UCS-4 (if decoding data).
  • invalid byte sequences (both encoding and decoding)

The options provide a choice between the following approaches:

  • draconian, "stop processing" ("strict")
  • "replace the unrepresentable character with something else" ("loose")
  • "attempt to transliterate, or replace if unable" ("translit")

The default error mode is "loose".

In the "loose" case, the replacement character will depend upon:

  • Whether the operation was encoding or decoding
  • If encoding, what the destination charset is.

If decoding, the replacement character will be:

U+FFFD (REPLACEMENT CHARACTER)

If encoding, the replacement character will be:

U+003F (QUESTION MARK) if the destination charset is not UTF-(8|16|32)
U+FFFD (REPLACEMENT CHARACTER) otherwise.

In the "translit" case, the codec will attempt to transliterate into the destination charset, if encoding. If decoding, or if transliteration fails, this option is identical to "loose".

◆ parserutils_charset_codec_optparams

typedef union parserutils_charset_codec_optparams parserutils_charset_codec_optparams

Charset codec option parameters.

◆ parserutils_charset_codec_opttype

Charset codec option types.

Enumeration Type Documentation

◆ parserutils_charset_codec_errormode

Charset codec error mode.

A codec's error mode determines its behaviour in the face of:

  • characters which are unrepresentable in the destination charset (if encoding data) or which cannot be converted to UCS-4 (if decoding data).
  • invalid byte sequences (both encoding and decoding)

The options provide a choice between the following approaches:

  • draconian, "stop processing" ("strict")
  • "replace the unrepresentable character with something else" ("loose")
  • "attempt to transliterate, or replace if unable" ("translit")

The default error mode is "loose".

In the "loose" case, the replacement character will depend upon:

  • Whether the operation was encoding or decoding
  • If encoding, what the destination charset is.

If decoding, the replacement character will be:

U+FFFD (REPLACEMENT CHARACTER)

If encoding, the replacement character will be:

U+003F (QUESTION MARK) if the destination charset is not UTF-(8|16|32)
U+FFFD (REPLACEMENT CHARACTER) otherwise.

In the "translit" case, the codec will attempt to transliterate into the destination charset, if encoding. If decoding, or if transliteration fails, this option is identical to "loose".

Enumerator
PARSERUTILS_CHARSET_CODEC_ERROR_STRICT 

Abort processing if unrepresentable character encountered.

PARSERUTILS_CHARSET_CODEC_ERROR_LOOSE 

Replace unrepresentable characters with single alternate.

PARSERUTILS_CHARSET_CODEC_ERROR_TRANSLIT 

Transliterate unrepresentable characters, if possible.

Definition at line 62 of file codec.h.

◆ parserutils_charset_codec_opttype

Charset codec option types.

Enumerator
PARSERUTILS_CHARSET_CODEC_ERROR_MODE 

Set codec error mode.

Definition at line 74 of file codec.h.

Function Documentation

◆ parserutils_charset_codec_create()

parserutils_error parserutils_charset_codec_create ( const char * charset,
parserutils_charset_codec ** codec )

Create a charset codec.

Parameters
charsetTarget charset
codecPointer to location to receive codec instance
Returns
PARSERUTILS_OK on success, PARSERUTILS_BADPARM on bad parameters, PARSERUTILS_NOMEM on memory exhaustion, PARSERUTILS_BADENCODING on unsupported charset

Definition at line 38 of file codec.c.

References parserutils_charset_codec::errormode, handler_table, parserutils_charset_aliases_canon::mib_enum, parserutils_charset_codec::mibenum, parserutils_charset_aliases_canon::name, parserutils__charset_alias_canonicalise(), PARSERUTILS_BADENCODING, PARSERUTILS_BADPARM, PARSERUTILS_CHARSET_CODEC_ERROR_LOOSE, and PARSERUTILS_OK.

Referenced by filter_set_encoding(), and parserutils__filter_create().

◆ parserutils_charset_codec_decode()

parserutils_error parserutils_charset_codec_decode ( parserutils_charset_codec * codec,
const uint8_t ** source,
size_t * sourcelen,
uint8_t ** dest,
size_t * destlen )

Decode a chunk of data in a codec's charset into UCS-4.

Parameters
codecThe codec to use
sourcePointer to pointer to source data
sourcelenPointer to length (in bytes) of source data
destPointer to pointer to output buffer
destlenPointer to length (in bytes) of output buffer
Returns
PARSERUTILS_OK on success, appropriate error otherwise.

source, sourcelen, dest and destlen will be updated appropriately on exit

Call this with a source length of 0 to flush any buffers.

Definition at line 163 of file codec.c.

References parserutils_charset_codec::decode, parserutils_charset_codec::handler, and PARSERUTILS_BADPARM.

Referenced by parserutils__filter_process_chunk().

◆ parserutils_charset_codec_destroy()

parserutils_error parserutils_charset_codec_destroy ( parserutils_charset_codec * codec)

Destroy a charset codec.

Parameters
codecThe codec to destroy
Returns
PARSERUTILS_OK on success, appropriate error otherwise

Definition at line 86 of file codec.c.

References parserutils_charset_codec::destroy, parserutils_charset_codec::handler, PARSERUTILS_BADPARM, and PARSERUTILS_OK.

Referenced by filter_set_encoding(), parserutils__filter_create(), and parserutils__filter_destroy().

◆ parserutils_charset_codec_encode()

parserutils_error parserutils_charset_codec_encode ( parserutils_charset_codec * codec,
const uint8_t ** source,
size_t * sourcelen,
uint8_t ** dest,
size_t * destlen )

Encode a chunk of UCS-4 data into a codec's charset.

Parameters
codecThe codec to use
sourcePointer to pointer to source data
sourcelenPointer to length (in bytes) of source data
destPointer to pointer to output buffer
destlenPointer to length (in bytes) of output buffer
Returns
PARSERUTILS_OK on success, appropriate error otherwise.

source, sourcelen, dest and destlen will be updated appropriately on exit

Definition at line 136 of file codec.c.

References parserutils_charset_codec::encode, parserutils_charset_codec::handler, and PARSERUTILS_BADPARM.

Referenced by parserutils__filter_process_chunk().

◆ parserutils_charset_codec_reset()

parserutils_error parserutils_charset_codec_reset ( parserutils_charset_codec * codec)

Clear a charset codec's encoding state.

Parameters
codecThe codec to reset
Returns
PARSERUTILS_OK on success, appropriate error otherwise

Definition at line 182 of file codec.c.

References parserutils_charset_codec::handler, PARSERUTILS_BADPARM, and parserutils_charset_codec::reset.

Referenced by parserutils__filter_reset().

◆ parserutils_charset_codec_setopt()

Configure a charset codec.

Parameters
codecThe codec to configure
typeThe codec option type to configure
paramsOption-specific parameters
Returns
PARSERUTILS_OK on success, appropriate error otherwise

Definition at line 107 of file codec.c.

References parserutils_charset_codec_optparams::error_mode, parserutils_charset_codec::errormode, parserutils_charset_codec_optparams::mode, PARSERUTILS_BADPARM, PARSERUTILS_CHARSET_CODEC_ERROR_MODE, and PARSERUTILS_OK.