A tutorial for beginners with interactive examples
Avro is an open source framework within Apache’s Hadoop project. That deals with data exchange services and data serialization for Hadoop. By using Avro, Big data can be exchanged among programs written in any language. Also it uses RPC to send data. Programs can efficiently serialize data into files or into messages by using this service.
Following key advantages of Avro:
1. Schema evolution
· It requires schemas while reading and writing the data.
· Most exciting feature of Avro.
· Avro schemas are defined using JSON.
sample data is defined as below (Student.avsc):
{
"type": "record",
"name": "student",
"fields":
[
{"name": "name", "type": "string"},
{"name": "age", "type": "int"},
{"name": "emails", "type": {"type": "array", "items": "string"}},
{"name": "student", "type": ["student","null"]}
]
}
2. Untagged data
Two ways to encode data (binary or JSON).
Switching JSON protocol to binary format to achieve better performance.
Makes compact data encoding and faster data processing.
Used to implement validation for a JSON protocol.
3. Dynamic typing
· Serialization and deserialization without generating code.
· It is a set of name (field name)-value (Avro supported value type) pair.
· It is referred to as “generic” as compared to “static” code generation approach.
Very easy to use.
Avro support various datatypes-
Primitive type names:
ü null : no value
ü boolean : a binary value
ü int : 32-bit signed integer
ü long : 64-bit signed integer
ü float : single precision (32-bit) IEEE 754 floating-point number
ü double : double precision (64-bit) IEEE 754 floating-point number
ü bytes : sequence of 8-bit unsigned bytes
ü string : Unicode character sequence
Sample coding of Primitive Types:
{"type": "string"}
Complex type names:
ü records
ü enums
ü arrays
ü maps
ü union
ü fixed
Sample coding of Complex Types:
{ "type": "enum",
"name": "Suit",
"symbols" : ["SPADES", "HEARTS", "DIAMONDS", "CLUBS"]
}
{ "type": "array", "items": "string" }
{ "type": "map", "values": "long" }
Avro utilizes a compact binary data format –that helps to find results in fast serialization times. And its concept of schema is similar to Protocol Buffers, but Avro is far better than Protocol Buffers, because Avro works natively with MapReduce.
Schema Resolution: A reader of Avro data, whether from an RPC or a file, can always parse that data because its schema is provided. But that schema may not be exactly the schema that was expected.We call the schema used to write the data as the writer's schema, and the schema that the application expects the reader's schema. For example, if the data was written with a different version of the software than it is read, then records may have had fields added or removed.
Apache Avro is one of the unique serialization framework.
In the upcoming post, we'll cover the program for serialization/deserialization of data using Avro and also see how Avro works.











