I was hunting for an article when my friend came up and asked me this question.
How to list/delete duplicate records without using ROWID?
I never asked why he do not want to use ROWID, because I was caught in some thinking.
I done some googling, and derived the method, but first the ROWID method
First how to delete records with ROWID, and why?
ROWID is the unique record identifier within Oracle, and it is easy to get the minimum of the ROWID for the duplicating criteria and delete the rest of them.
We will take the example of EMP table for simplicity. Assume that there is no primary key in the Employee_Id column due to some data migration requirements. Once the data migration is over I need to check the existense of duplicate records. This is how traditionally we used to achieve this:
Suppose the data after migration stands like this:
Employee_Id Employee_Name
10 Scott
11 Tiger
10 Amanda
10 Nagpal
20 Rawat
It is clear from this data that there are duplicate records, but in a normal data migration there will be millions of records from which identifying the duplicate records will be a nightmare. This is where the following query will save our day:
SELECT Employee_Id, Employee_Name
FROM Employee e
WHERE ROWID NOT IN (SELECT MIN(ROWID)
FROM Employee f
WHERE e.Employee_Id = f.Employee_Id);
The above query will list the following duplicate records:
Employee_Id Employee_Name
10 Amanda
10 Nagpal
We will not get into the decision of what to do with the duplicate records as it is not our discussion topic.
Now what does the MIN(ROWID) implies in the query? It implies that the first record inserted cannot be a duplicate record, and the rest of the data are. Simple isn't it?
Now how to select the same data without ROWID?
Consider the following select statement:
SELECT Employee_Id, Employee_Name,
decode(row_number() over( PARTITION BY employee_id ORDER BY employee_id),1,'Not duplicate','Dupilcate') Dupicate_Criteria
FROM employee;
The following is the data from the query:
EMPLOYEE_ID EMPLOYEE_NAME DUPICATE_CRITERIA
10 Scott Not duplicate
10 Amanda Dupilcate
10 Nagpal Dupilcate
11 Tiger Not duplicate
20 Rawat Not duplicate
In this, you can see the usage of row_number() function with over() clause. In the over() clause we are partitioning the data on basis of the duplicate criteria, that is the Employee_Id. The Order By clause is mandatory so we will use the same Employee_Id. Now the row_number() function assigns running numbers for each Employee_Id. If there are any duplicate records, the row_number() function will return any value greater than 1 (in the running order).
So it is very easy for us to identify the usage of row_number() function and to find an alternative method of getting duplicate rows without using ROWID in our query. Now it is in your hands to do more research and find out more usages.
Thanks for Subesh Kumar who prompted me for a short research in this topic, and hence this article.