filmov
tv
Filtering the Data : How to Filter the data in pig

Показать описание
In this module, you will learn about some of the common operators in filtering the data such as FILTER, FOREACH, DISTINCT, SAMPLE
How to filter out unwanted rows
----------------------------------------------------
FILTER operator can be used to filter out (remove) unwanted rows. In other words, it can be used to select the tuples / rows only that are needed.
suppose in our relation A , we need include only those tuples that have salary greater than 6800, we can use the FILTER operator
How to filter out unwanted columns
----------------------------------------------------------
FOREACH operator can be used to filter out (remove) unwanted columns. In other words, it can be used to select the columns that are needed.
suppose in our relation A , we need to include only the name and the age columns, then we can use the FOREACH operator as specified in the below example.
B = FOREACH A GENERATE name,age;
How to Remove duplicates
-------------------------------------------
DISTINCT operator can be used to remove the row level duplicates from a file.
B = DISTINCT A;
How to extract sample data
---------------------------------------------
sample operator can be used to select a random sample of data from a file
for e.g. if we need to select a random sample of tuples from our relation A,
then we can use SAMPLE operator.
B = SAMPLE A 0.3;
In the example, 0.3 represents 30%, which means 1/3 rd of the data will be displayed as sample.
this lists one tuple which is 30% of the data in the original tuple.
In this module we saw about ,
Common pig operators in filtering the data, such as
FILTER operator,
FOREACH operator,
DISTINCT operator, and
SAMPLE operator
Комментарии