File formats
Avro is row based
structure. Good if we are having multiple columns for analysis. Good for Kafka
and Druid. Write is fast read is slow
Perquet : columnar
db. Read is fast, write can be slow. Good for querting.spark uses this format.
ORC: columnar db,
read is fast , write can be slow. Specially used in hive
https://www.datanami.com/2018/05/16/big-data-file-formats-demystified/
Different Hadoop applications also have different affinities for the three file formats. ORC is commonly used with Apache Hive, and since Hive is essentially managed by engineers working for Hortonworks, ORC data tends to congregate in companies that run Hortonworks Data Platform (HDP). Presto is also affiliated with ORC files.
Similarly, Parquet is commonly used with Impala, and since Impala is a Cloudera project, it’s commonly found in companies that use Cloudera’s Distribution of Hadoop (CDH). Parquet is also used in Apache Drill, which is MapR‘s favored SQL-on-Hadoop solution; Arrow, the file-format championed by Dremio; and Apache Spark, everybody’s favorite big data engine that does a little of everything.
Avro, by comparison, is the file format often found in Apache Kafka clusters, according to Nexla. Avro is also the favored big data file format used by Druid, the high performance big data storage and compute platform that came out of Metamarkets and was eventually picked up by Yahoo, the Nexla folks say.
What file format you use to architect your big data solution is important, but it’s just one consideration among many. For more information on the differences among ORC, Avro, and Parquet, check out Nexla’s white paper here.