Welcome to fastavro’s documentation!¶
Contents:
The current Python avro package is packed with features but dog slow.
On a test case of about 10K records, it takes about 14sec to iterate over all of them. In comparison the JAVA avro SDK does it in about 1.9sec.
fastavro is less feature complete than avro, however it’s much faster. It iterates over the same 10K records in 2.9sec, and if you use it with PyPy it’ll do it in 1.5sec (to be fair, the JAVA benchmark is doing some extra JSON encoding/decoding).
If the optional C extension (generated by Cython) is available, then fastavro will be even faster. For the same 10K records it’ll run in about 1.7sec.
You can also use the fastavro script from the command line to dump avro files. Each record will be dumped to standard output in one line of JSON.
fastavro weather.avro
You can also dump the avro schema:
fastavro --schema weather.avro
fastavro script¶
usage: fastavro [-h] [--schema] [--version] file [file ...]
iter over avro file, emit records as JSON
positional arguments:
file file(s) to parse
optional arguments:
-h, --help show this help message and exit
--schema dump schema instead of records
--version show program's version number and exit
fastavro module¶
Fast Avro file iteration.
Most of the code here is ripped off the Python avro package. It’s missing a lot of features in order to get speed.
The only onterface function is iter_avro, example usage:
# Reading
import fastavro as avro
with open('some-file.avro', 'rb') as fo:
reader = fastavro.reader(fo)
schema = reader.schema
for record in reader:
process_record(record)
# Writing
from fastavro import writer
schema = {
'doc': 'A weather reading.',
'name': 'Weather',
'namespace': 'test',
'type': 'record',
'fields': [
{'name': 'station', 'type': 'string'},
{'name': 'time', 'type': 'long'},
{'name': 'temp', 'type': 'int'},
],
}
# 'records' can be an iterable (including generator)
records = [
{u'station': u'011990-99999', u'temp': 0, u'time': 1433269388},
{u'station': u'011990-99999', u'temp': 22, u'time': 1433270389},
{u'station': u'011990-99999', u'temp': -11, u'time': 1433273379},
{u'station': u'012650-99999', u'temp': 111, u'time': 1433275478},
]
with open('weather.avro', 'wb') as out:
writer(out, schema, records)
-
class
fastavro.
iter_avro
(fo, reader_schema=None)¶ Iterator over avro file.
-
fastavro.
load
(fo, writer_schema, reader_schema=None)¶ Read data from file object according to schema.
-
fastavro.
writer
(fo, schema, records, codec='null', sync_interval=16000, metadata=None, validator=None)¶ Write records to fo (stream) according to schema
- fo: file like
- Output stream
- records: iterable
- Records to write
- codec: string, optional
- Compression codec, can be ‘null’, ‘deflate’ or ‘snappy’ (if installed)
- sync_interval: int, optional
- Size of sync interval
- metadata: dict, optional
- Header metadata
- validator: None, True or a function
- Validator function. If None (the default) - no validation. If True then then fastavro.writer.validate will be used. If it’s a function, it should have the same signature as fastavro.writer.validate and raise an exeption on error.
>>> from fastavro import writer
>>> schema = { >>> 'doc': 'A weather reading.', >>> 'name': 'Weather', >>> 'namespace': 'test', >>> 'type': 'record', >>> 'fields': [ >>> {'name': 'station', 'type': 'string'}, >>> {'name': 'time', 'type': 'long'}, >>> {'name': 'temp', 'type': 'int'}, >>> ], >>> }
>>> records = [ >>> {u'station': u'011990-99999', u'temp': 0, u'time': 1433269388}, >>> {u'station': u'011990-99999', u'temp': 22, u'time': 1433270389}, >>> {u'station': u'011990-99999', u'temp': -11, u'time': 1433273379}, >>> {u'station': u'012650-99999', u'temp': 111, u'time': 1433275478}, >>> ]
>>> with open('weather.avro', 'wb') as out: >>> writer(out, schema, records)
-
fastavro.
dump
(fo, datum, schema)¶ Write a datum of data to output stream.
- fo: file like
- Output file
- datum: object
- Data to write
- schema: dict
- Schemda to use
-
fastavro.
schemaless_reader
(fo, schema)¶ Reads a single record writen using the schemaless_writer
- fo: file like
- Input stream
- schema: dict
- Reader schema
-
fastavro.
reader
¶ alias of
fastavro._read_py.iter_avro
-
fastavro.
is_avro
(path_or_buffer)¶ Return True if path (or buffer) points to an Avro file.
- path_or_buffer: path to file or file line object
- Path to file
-
fastavro.
schemaless_writer
(fo, schema, record)¶ Write a single record without the schema or header information
- fo: file like
- Output file
- schema: dict
- Schema
- record: dict
- Record to write
-
fastavro.
acquaint_schema
(schema)¶ Add a new schema to the schema repo.
- schema: dict
- Schema to add to repo