Database Management Systems

York University
Winter 2018
The Tipping Point
using IBM DB2 v11

For this project, you will explore, albeit briefly, IBM DB2's query optimizer (v11) and how it optimizes SQL queries. The DB2 server you will use — db2ledu, and associated client db2leduc — is the default DB2 server set up on the “PRISM” eecs.yorku.ca domain. (It is running DB2 v11.)

This hosts the database stl. In particular, your task is to explore when the optimizer switches from one plan to another for the same templated query, as reduction factors are changed. These are the tipping points in the plan space.

You may work in teams of up to two people.

  The STL Database

The StL data-warehouse is for St Lawrence Booksellers, Inc. It records sales by customer and book from the years 1997 to 2003. StL is a small, boutique web-based bookseller.

The StL Data-warehouse Schema

The schema of StL-DW consists of three tables:

country VARCHAR(20)
province VARCHAR(20)
city VARCHAR(20)
language VARCHAR(20)
genre VARCHAR(20)
publisher VARCHAR(40)
price DECIMAL(5, 2)
when DATE
sale DECIMAL(6, 2)

The underlined attributes show the primary keys. In the language of data-warehousing, Customer and Book are dimension tables, and Purchase is the fact table. Customer has 100,000 rows, Book has 50,000 rows, and Purchase has 9,997,850 rows. There is a foreign key from Purchase to Customer and from Purchase to Book.

In Purchase, cust#, book#, and when are the dimension columns. The measure column is sale. It reports how much the sale was for. Each sale is equal to the book's price × qnty. The column qnty indicates how many copies of this book was bought in that given purchase. In most cases, qnty = 1. Note that sale has already accommodated qnty! In your queries, you do not have to multiply by qnty.

Since StL is a web-based book-seller, we mail the books to the customers' addresses. So one can think of "customer" as a location. Call that dimension location then. That dimension offers a natural roll-up: country > state > city > cust#.

The next dimension is book. Books have three attributes of interest: language (what language it is written in), genre (what category it is in, e.g., humor, politics, etc.), and publisher (what company publishes it). These form three dimensions on book, and offer another natural cube (treating Book itself as a small fact table).

The last dimension is time, which is represented by the attribute when in Purchase. It is of type DATE. So it does not show the time of a purchase, but reports the day of purchase. Notice there is no table Time, but we could pretend there is a virtual table for Time. E.g.,

year YEAR
month MONTH
day DAY

Thus, the time dimension provides for a natural roll-up. Standard SQL provides many functions for dealing with DATE and TIMESTAMP values. In fact, we can roll up when in many ways.

There are no NULL values in the data warehouse.


The clustered indexes are on the primary keys as follows.

  • customer: cust#
  • Book: book#
  • Purchase: cust#, book#, when

Table Purchase has two additional unclustered indexes:

  • book#
  • when
Accessing the database

The database is accessible via the “db2” command (which is the client) on any of the PRISM machines (departmentally managed machines in the eecs.yorku.ca) domain; e.g., red.eecs.yorku.ca.

The database is named stl. So that is what you need to connect to. E.g.,

% db2 "connect to stl"

The three tables are in a schema named stl. So to access them, one needs to prepend ‘stl.’ in front of the table name. E.g.,

% db2 "select count(*) from stl.purchase"

(See the other instruction pages for DB2 off of the class homepage for more help.)

The StL-DW is reasonably large. Most of the (correct) queries for this project take a few seconds to a minute or so to execute. Be patient. For real DWs, query execution time can be hours. Design your queries carefully. Do them in steps, and debug along the way. Also reuse (parts of) queries once you have them working in subsequent challenges.

  Profiling Query Plans

We shall use the tool db2expln to explore query plans that the DB2 optimizer creates.

You need to be able to “see” commands in DB2's install. From any shell you are using DB2, run

% source ~db2leduc/cshrc.V11

This sets environment variables and command paths to see the tools in the instance of the DB2 system running on PRISM.

The following command will dynamically prepare the query in file sql_file, and output and explanation of DB2's plan for it to the shell.

% db2expln -d stl -terminal -z \; -f sql_file

This assumes the SQL query in sql_file is terminated with a semicolon, as is convention. (The “-z \;” tells db2expln this. Otherwise, db2expln assumes the query text to be all on one line in the file. The “\” simply escapes the semicolon for shell, as semicolon has a special meaning in shell.)

To output into a file, say

% db2expln -d stl -o output_file -z \; -f sql_file

To prepare and explain a query just typed on the command line:

% db2expln -d stl -o output_file -z \; -q "query text here"

Design the following queries. For each, you are finding a query with two variations: the query is the same except the values of a predicate is — or predicates are — different between the two versions. You are wanting that DB2's query optimizer finds different query plans for the two versions. E.g.,

select sum(sale) as total
    from stl.purchase
    where when between '01/01/2001' and '12/31/2001';


select sum(sale) as total
    from stl.purchase
    where when between '12/01/2001' and '12/31/2001';

Design queries with variants for the following.

  1. Design a simple, one-relation query that with two variants that result in different access paths.

  2. Design a query that uses aggregation that requires an explicit sort for the aggregation in the plan for one variant, but does not need the sort for the aggregation in the other variant — that is, it does the aggregation on the fly.

  3. Design a query that results in a plan that uses an index-nested loop join for one variant, but not for the other variant.

  4. Design a query that results in plans with different join orders for the two variants.

  5. Design another query of your choice where the two variants result in very different plans. Explain why the optimizer is likely choosing the plan for each variant.


Make a directory named tipping for your project.

Within the directory, put the SQL for your queries into files named oneA.sql, oneB.sql, twoA.sql, twoB.sql, threeA.sql, threeB.sql, fourA.sql, fourB.sql, fiveA.sql, and fiveB.sql, where one, two, three, four, and five correspond to the queries in the above section, and A and B for each represent the two variants for each.

Save the results of db2expl for each in oneA.plan, oneB.plan, twoA.plan, twoB.plan, threeA.plan, threeB.plan, fourA.plan, fourB.plan, fiveA.plan, and fiveB.plan, for the queries, respectively.

Write a text file named report.txt Explain why each of the three query variations result in different query plans. Keep the report simple (say, under 1,000 words).

Submit the directory as follows.

% submit 4411 tipping tipping

Due before midnight on April 5.

jarek gryz