Powered By Blogger

Saturday, October 19, 2019

BigData concepts in a nutshell

In the version of HDP1.0 we had only HDFS and MR

HDP 2.0 we started having HDFS , MR and YARN

With Spark YARN, MESOS or Kunernetes can be used as negotiator

HDFS : Addresses distributed storage issues
Pig : It is scripting language which can be used for doing ETL kind of jobs. Int he data pipeline we have to do cleaning using PIG scripts and then load into Hive table.
Hbase : It is Nosql database which provides ACID behavour. With the Hive we will not be able to update or delete the record . With Hbase we will be able to update or delete the record.

There is a way to make Hive table accessible in Hbase.

Oozie : It will be used for scheduling jobs.

 Usually scenarios involving processing of table goes to hive, search for a key related use cases will fall in Hbase


Pig uses only Mapper phase , all the other components will be using Mapper and Reducer



The Balanced approach of 2-racks and 3-copies in Rack awareness mechanism is adopted to
Minimize Write-bandwidth and Maximize Redundancy.

Name node federation concept is meant for -
Load sharing.

In which of the following scenarios would the introduction of combiner can lead to wrong results -
Calculating the average.

Consider the following table structure:students (name STRING,id INT,subjects ARRAY,feeDetails MAP,phoneNumber STRUCT ) . To list the subjects taken by each student, we can use the following query, which executes successfully: select name, explode(subjects) from students;
False

Which of the following work-flow is valid in MR
map->partition->shuffle->sort->reduce.

Name node federation : metadata can be divieded to other node. It is for load sharing.
Secondary node : for checkpointing and tollerance.

1 comment: