Powered By Blogger

Friday, March 27, 2020

Filtering set of values from the bigger dataframe

Filtering set of values from the bigger dataframe

import spark.implicits._
    val localFileDF = spark.read.orc("/tmp/Analysis-10")
        localFileDF.filter(localFileDF("key") === "1065551413044" )



var followinglist=List("1065551413044")
val df2 = localFileDF.filter(col("key").isin(followinglist:_*))

Wednesday, March 18, 2020

back to basic- sql left join producing duplicate records


CREATE TABLE `shipment` (
  `shipmentnumber` varchar(45) DEFAULT NULL,
  `order_number` varchar(45) DEFAULT NULL,
  `order_line` varchar(45) DEFAULT NULL,
  `quantity` int DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;



CREATE TABLE `order` (
  `order_number` varchar(45) DEFAULT NULL,
  `order_line` varchar(45) DEFAULT NULL,
  `prder_type` varchar(45) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;



INSERT INTO `test`.`shipment`
(`shipmentnumber`,
`order_number`,
`order_line`,
`quantity`)
VALUES
("shipmentnumber1",
"order1",
"orderline11",
2
);

insert into test.order values("order2","orderline21","ordertype2");
insert into test.order values("order1","orderline11","ordertype1");
insert into test.order values("order2","orderline21","ordertype3");


SELECT * FROM test.`order`;


# order_number, order_line, prder_type
order2, orderline21, ordertype2
order1, orderline11, ordertype1
order2, orderline21, ordertype3


'

select * from  `test`.`shipment`\

# shipmentnumber, order_number, order_line, quantity
shipmentnumber1, order1, orderline11, 1
shipmentnumber1, order2, orderline21, 1




select * from `test`.`shipment`  as shpmnt
 left join  `test`.`order` as ord
 on
shpmnt.order_number=ord.order_number
AND
shpmnt.order_line=ord.order_line

# shipmentnumber, order_number, order_line, quantity, order_number, order_line, prder_type
'shipmentnumber1', 'order2', 'orderline21', '1', 'order2', 'orderline21', 'ordertype2'
'shipmentnumber1', 'order1', 'orderline11', '1', 'order1', 'orderline11', 'ordertype1'
'shipmentnumber1', 'order2', 'orderline21', '1', 'order2', 'orderline21', 'ordertype3'

Observe since right table is having more records we get records more than left table
============

left table also having duplicates

select * from  `test`.`shipment`

Have added duplicate record in the left table

# shipmentnumber, order_number, order_line, quantity
shipmentnumber1, order1, orderline11, 1
shipmentnumber1, order2, orderline21, 1
shipmentnumber1, order1, orderline11, 2


SELECT * FROM test.`order`;

# order_number, order_line, prder_type
order2, orderline21, ordertype2
order1, orderline11, ordertype1
order2, orderline21, ordertype3

# shipmentnumber, order_number, order_line, quantity, order_number, order_line, prder_type
shipmentnumber1, order1, orderline11, 1, order1, orderline11, ordertype1
shipmentnumber1, order1, orderline11, 2, order1, orderline11, ordertype1
shipmentnumber1, order2, orderline21, 1, order2, orderline21, ordertype2
shipmentnumber1, order2, orderline21, 1, order2, orderline21, ordertype3

all the left table records will be retained, and from the right table cross product is happening