Alfresco: Hive optimizations

Join column- Hive Optimizations - Hive session 6

For each column there will be one MR will run.

to optimize we have to reduce joins and minimize MR.

bucketing helps : as it scans less number of records

join column will be the key of the mapper output.
in reduce results will be aggregated.

Hive wrapper helps in writing the job.
We can remove the the reduce phase for certain queries. This join is called
as map join as it does not require shuffling.

Inner join/common join : Only matching records from both the tables

left outer join : matching records + all records from left + padded with nulls

right outer join : matching records + all records from right + padded with nulls

full outer join : union of left outer join and right outer join

If we have 2 tables, left table 10mb and right table 100gb

============================================================
Join side Optimizations
Bucket Map join
Sort Merge Bucket join (SMB)
============================================================
map side join

small table will go to all the machines.

big table will be distributed across the machines, and on each machine
left table data will go and sit.

inner joins can be trated as map side join, if the table is small.
by default 25MB is the size of the small table. if needed we can change
the size of the table.

left outer join : matching records + table from left.
since the left table is small,we will not know what to do for
non matching records , so we cannot use map join.
left outer join does not support map side join even though the left table
is small

right outer join : matching records + records from right
If the left table is small and right is big, as per the definition
right outer join is possible.

full outer join : left outer join + right outer join, as left outer join
does not support map side join.

============================================================
If the right table is small then the above scenarios will be reversed

============================================================
Join side Optimizations
Bucket Map join
Sort Merge Bucket join (SMB)

============================================================
Bucket Map join
In the Map join, it expects one table is small.
Bucket Map join can work with 2 big tables.

Both the tables should be bucketed on join column.

Buckets in one table should be an integral multiple of number
ofbuckets in other table

2 2
2 4
2 6
3 6

Load only one bucket in memory.
If it is not integral multiple it will not be performant.

If the hash function used is same in both the tables, system will
not know which buket needs to be loaded.

============================================================

SMB : Sort merge Bucket join

Tables can be big
Both the tables should be bucketed on join column.
Number of buckets in both the table should be same.

Data is sorted based on join columns in both the tables.

one to one mapping from source and dest will happen as both are sorted.
but effort is involved in sorting data.

============================================================

Hive :
Datawarehouse
Database and Datawarehouse difference.
Data is stored in table. Data is in hdfs. /usr/hive/warehouse

metastore - schema is stored
derby - will store metadata by default

managed table/ external table
Dropping data : managed table - data will be deleted

load tables using the files, load can be data from

views
normalization and denormalization

define user defined functions

optimisations

structural optimisations
partitionsing : distinct values less , folder
bucketing : file.

Query optimisations :
use less joins
map side join : small table will be fit into memory. We can revers table
Bucket map join : bucketed on join columns, big table join possible
smb : buckets should be same and data should be sorted.

Alfresco

Saturday, November 23, 2019

Hive optimizations

No comments:

Post a Comment