Costing for a full-merge with duplicate values

If the outer table in a merge join has duplicate values, the inner table must be accessed from the root page of the index for each duplicate value. This query is the same as the previous example:

select t1.c2, t2.c2
from t1, t2
where t1.c1 = t2.c1
and t1.c1 >= 1000 and t1.c1 < 1100

If t1 is the outer table, and there are duplicate values for some of the rows in t1, so that there are 120 rows between 1000 and 1100,with 20 duplicate values, then each time one of the duplicate values is accessed, the scan of t2 is restarted from the root page of the index. If one row for t2 matches each value from t1, the I/O costs for this query are:

3 index pages to position on the first matching row of t1
Scanning 12 pages of t1
3 index pages to position on the first matching row of t2, plus an I/O to read the data page
For the remaining rows:
- If the value from t1 is a duplicate, the scan of t2 restarts from the root page of the index.
- For all values of t1 that are not duplicates, the scan remains positioned on the leaf level of t2. The scan on the inner table remains positioned on the leaf page as rows are returned until the next duplicate value in the outer table requires the scan to restart from the root page.

This formula gives the cost of the scan of the inner table for a merge join:

The scan size is the number of pages of the inner table that need to be read for each value in the outer table. For tables where multiple inner rows match, the scan size is the average number of pages that need to be read for each outer row.