Welcome to fastavro’s documentation!

Contents:

The current Python avro package is packed with features but dog slow.

On a test case of about 10K records, it takes about 14sec to iterate over all of them. In comparison the JAVA avro SDK does it in about 1.9sec.

fastavro is less feature complete than avro, however it’s much faster. It iterates over the same 10K records in 2.9sec, and if you use it with PyPy it’ll do it in 1.5sec (to be fair, the JAVA benchmark is doing some extra JSON encoding/decoding).

If the optional C extension (generated by Cython) is available, then fastavro will be even faster. For the same 10K records it’ll run in about 1.7sec.

You can also use the fastavro script from the command line to dump avro files. Each record will be dumped to standard output in one line of JSON.

fastavro weather.avro

You can also dump the avro schema:

fastavro --schema weather.avro

fastavro script

usage: fastavro [-h] [--schema] [--version] file [file ...]

iter over avro file, emit records as JSON

positional arguments:
  file        file(s) to parse

optional arguments:
  -h, --help  show this help message and exit
  --schema    dump schema instead of records
  --version   show program's version number and exit

fastavro module

Fast Avro file iteration.

Most of the code here is ripped off the Python avro package. It’s missing a lot of features in order to get speed.

The only onterface function is iter_avro, example usage:

# Reading
import fastavro as avro

with open('some-file.avro', 'rb') as fo:
    reader = fastavro.reader(fo)
    schema = reader.schema

    for record in reader:
        process_record(record)


# Writing
from fastavro import writer

schema = {
    'doc': 'A weather reading.',
    'name': 'Weather',
    'namespace': 'test',
    'type': 'record',
    'fields': [
        {'name': 'station', 'type': 'string'},
        {'name': 'time', 'type': 'long'},
        {'name': 'temp', 'type': 'int'},
    ],
}

# 'records' can be an iterable (including generator)
records = [
    {u'station': u'011990-99999', u'temp': 0, u'time': 1433269388},
    {u'station': u'011990-99999', u'temp': 22, u'time': 1433270389},
    {u'station': u'011990-99999', u'temp': -11, u'time': 1433273379},
    {u'station': u'012650-99999', u'temp': 111, u'time': 1433275478},
]

with open('weather.avro', 'wb') as out:
    writer(out, schema, records)
class fastavro.iter_avro(fo, reader_schema=None)

Iterator over avro file.

fastavro.load(fo, writer_schema, reader_schema=None)

Read data from file object according to schema.

fastavro.writer(fo, schema, records, codec='null', sync_interval=16000, metadata=None, validator=None)

Write records to fo (stream) according to schema

fo: file like
Output stream
records: iterable
Records to write
codec: string, optional
Compression codec, can be ‘null’, ‘deflate’ or ‘snappy’ (if installed)
sync_interval: int, optional
Size of sync interval
metadata: dict, optional
Header metadata
validator: None, True or a function
Validator function. If None (the default) - no validation. If True then then fastavro.writer.validate will be used. If it’s a function, it should have the same signature as fastavro.writer.validate and raise an exeption on error.
>>> from fastavro import writer
>>> schema = {
>>>     'doc': 'A weather reading.',
>>>     'name': 'Weather',
>>>     'namespace': 'test',
>>>     'type': 'record',
>>>     'fields': [
>>>         {'name': 'station', 'type': 'string'},
>>>         {'name': 'time', 'type': 'long'},
>>>         {'name': 'temp', 'type': 'int'},
>>>     ],
>>> }
>>> records = [
>>>     {u'station': u'011990-99999', u'temp': 0, u'time': 1433269388},
>>>     {u'station': u'011990-99999', u'temp': 22, u'time': 1433270389},
>>>     {u'station': u'011990-99999', u'temp': -11, u'time': 1433273379},
>>>     {u'station': u'012650-99999', u'temp': 111, u'time': 1433275478},
>>> ]
>>> with open('weather.avro', 'wb') as out:
>>>     writer(out, schema, records)
fastavro.dump(fo, datum, schema)

Write a datum of data to output stream.

fo: file like
Output file
datum: object
Data to write
schema: dict
Schemda to use
fastavro.schemaless_reader(fo, schema)

Reads a single record writen using the schemaless_writer

fo: file like
Input stream
schema: dict
Reader schema
fastavro.reader

alias of fastavro._read_py.iter_avro

fastavro.is_avro(path_or_buffer)

Return True if path (or buffer) points to an Avro file.

path_or_buffer: path to file or file line object
Path to file
fastavro.schemaless_writer(fo, schema, record)

Write a single record without the schema or header information

fo: file like
Output file
schema: dict
Schema
record: dict
Record to write
fastavro.acquaint_schema(schema)

Add a new schema to the schema repo.

schema: dict
Schema to add to repo

Indices and tables