Alfresco: Hive Optimization Techniques

Friday, November 22, 2019

Hive Optimization Techniques

==========================================
set operations

Union, munus, Intersect
Only Union is supported in hive, minus and Intersect are not supported.
We can write the query to achieve minus and Intersect.

==========================================

Sub queries

IN , NOT IN
exists , not exists
Only these 2 are supported in hive

==========================================
Views
Logical table.
Creating the table which will have subset of the columns.

create view viewname as (query)

Will give security.
can create multiple logical table.

hiding complexity of the query.
This query will be stored in metastore.

==========================================
Normalization vs Denormalization

Normalization : Data should not be redundant
Dividing bigger table into multiple smaller table to remove with
the intent to remove redundancy

Normalization is good for transactional db.

Denormalization : keep the data in big table rather than small tables.
it will have issue of data redundancy.

It will help in reducing the number of joins.

Hive Basic Concepts:
Datawarehouse
Data
metadata : metastore
managed table, external table, temporary table
loading data from local and from hdfs
complex data types.
UDF, UDAF, UDTF
Views
Normalization and deNormalization

==========================================

Hive Optimization techniques:
1. table structure level Optimization
Partitioning and bucketing

2. Optimization of hive query

1. table structure level Optimization
Logically segreagate the data
Partitioning and bucketing

Static and Dynamic Partitioning

We can have hierarchy of the folders in partitioning.
We dont have control on the number of partitions created.
Partition is the folder
==========================================
Bucketing tables :

How bucketing works?
Bucket value has to be fixed while creating table.
Each bucket is a file.

hash function used for inserting will be used by searching also to decide
which bucket to read.

Whenever we have lot of distinct values we can use bucketing
==========================================
We can use combination of partitioning and bucketing to organize the data.

Alfresco

Friday, November 22, 2019

Hive Optimization Techniques

No comments:

Post a Comment