Powered By Blogger

Friday, November 22, 2019

Hive Optimization Techniques

==========================================
set operations

    Union, munus, Intersect
    Only Union is supported in hive, minus and Intersect are not supported.
    We can write the query to achieve minus and Intersect.

==========================================

Sub queries

    IN , NOT IN
    exists , not exists
Only these 2 are supported in hive

==========================================
Views
Logical table.
Creating the table which will have subset of the columns.

create view viewname as (query)

Will give security.
can create multiple logical table.

hiding complexity of the query.
This query will be stored in metastore.

==========================================
Normalization vs Denormalization

 Normalization : Data should not be redundant
 Dividing bigger table into multiple smaller table to remove  with
 the intent to remove redundancy

Normalization is good for transactional db.


Denormalization : keep the data in big table rather than small tables.
it will have issue of data redundancy.

It will help in reducing the number of joins.



Hive Basic Concepts:
    Datawarehouse
    Data
    metadata : metastore
    managed table, external table, temporary table
    loading data from local and from hdfs
    complex data types.
    UDF, UDAF, UDTF
    Views
    Normalization and deNormalization

==========================================

Hive Optimization techniques:
1. table structure level Optimization
        Partitioning and bucketing

2. Optimization of hive query

1. table structure level Optimization
    Logically segreagate the data
    Partitioning and bucketing

Static and Dynamic Partitioning

We can have hierarchy of the folders in partitioning.
We dont have control on the number of partitions created.
Partition is the folder
==========================================
Bucketing tables :

How bucketing works?
Bucket value has to be fixed while creating table.
Each bucket is a file.

hash function used for inserting will be used by searching also to decide
which bucket to read.

Whenever we have lot of distinct values we can use bucketing
==========================================
We can use combination of partitioning and bucketing to organize the data.

No comments:

Post a Comment