CS 3140 Session 10

Session 10

More SQL: Joins and Subqueries

CS 3140
Database Systems

Opening Exercise

Back in Session 7, we used a movie database that contained two tables with identical schema:

    Alice(name, year, length, genre)
    Kurt (name, year, length, genre)

Write SQL queries to answer these requests:

List the name, genre, and length of Alice's favorite movies, sorted by genre and length and with a column header saying "length in minutes".

Find the name of the shortest movie in Kurt's collection.

(Spoiler alert: the answers come next...).

Opening Solution

These requests require features we learned about last week. The first uses aliasing and the ORDER BY clause, as well as columns and headings:

    sqlite> .mode columns
    sqlite> SELECT name, genre, length AS 'length in minutes'
       ...> FROM Alice
       ...> ORDER BY genre, length DESC;
    name                genre    length in minutes
    ------------------  -------  -----------------
    Black Panther       action   134              
    Lethal Weapon       action   109              
    The Replacements    comedy   118              
    The Big Chill       drama    105              
    The Princess Bride  fantasy  98               
    The Princess Bride  fantasy  98               
    Star Trek           scifi    132
    Blade Runner        scifi    117

I sorted genres in alphabetical order, which is ascending by default, and movie lengths in descending order. Notice how dropping the year attribute from our tuples loses the distinction between the two versions of "The Princess Bride".

SQLite Tips: If you see headers when you don't want to, or don't see them when you do, you can toggle headers with .headers [on | off] . If you'd like use the same setting every time you use sqlite3, put the command in a file named .sqliterc in your home directory. (Notice the dot in the name!) sqlite3 loads that file every time it starts up. To see a list of sqlite3's dot commands, check out this list in the Command Line Shell guide.

The second request looks tricky. It requires that we know the length of the shortest movie so that can select that movie from the database. With a subquery, though, it's a snap:

    sqlite> .mode list
    sqlite> SELECT name FROM Kurt
       ...> WHERE length = (SELECT MIN(length) FROM Kurt);
    48 Hrs.

Notice how we can use a subquery to replace what would be a local variable in a general-purpose language. This is a design pattern for SQL queries that we will see in a variety of forms over the coming weeks.

Last session, we saw how using a subquery can make a query more efficient, often dropping an order of magnitude in space, time, or both. In this case, the space and time complexity of the query are both O(n). In a general-purpose language, we can solve this problem with one pass through the database, but our SQL query requires two passes. [Is there any way to do this on one pass? Why not?]

Where Are We?

Last week, we expanded our understanding of SQL programming by looking at some of its features that enable us to do more than simply look up tuples in a relational database. When we think of those features used in conjunction with the basic form of statements:

    SELECT attributes FROM relation WHERE condition;

we can see that SQL defines syntax that replaces the loops and conditional statements we would write in a general-purpose language. SQL is a great example of designing a new language that implements lower-level design patterns as single commands.

We also learned two more things last week: other ways that SQL supports relational algebra, such as aliases for local renaming, and subqueries. Today, we'll look at correlated subqueries, a particular kind of subquery that depends on the context of the query that contains it. Then we'll revisit the idea of a join and SQL's support for them.

For today's session, we will continue to work with the Alice/Kurt database from Session 7. Fire it up now, if you haven't already:

    $ sqlite3 homework03.db
    sqlite> _

A Quick Review of Subqueries

Our query to find the name of the shortest movie in Kurt's collection in the opening exercise:

    SELECT name FROM Kurt
    WHERE length = (SELECT MIN(length) FROM Kurt);

used a subquery. A subquery is a SELECT statement nested in another statement.

In our solution, the containing query is:

    SELECT name FROM Kurt WHERE length = _____;

And the blank is filled by the subquery:

    (SELECT MIN(length) FROM Kurt)

SQL computes the value of the subquery and then plugs that value into the containing query, where it is used to compue its value.

This is an example of a subquery that returns a single value we use in a WHERE clause. Last time, we also saw that subqueries can return a relation we use in a WHERE clause and a relation we use in a FROM clause.

Subqueries have a number of benefits. In some cases, they make the code more readable. In many cases, they result in code that is more efficient to execute. Rather than compute a join or even a cross product, which is O(m*n) in both space and time, we can create queries that run in linear time and linear space or less.

Sometimes, efficiency is not the goal...

Correlated Subqueries

In all the cases we saw last session, the subquery could be executed independent of containing query. The DBMS could execute the subquery, get its value, plug that value into the containing query, and then execute it. This worked because the subquery did not refer to any name in the containing query.

Sometimes, though, we want to ask a different sort of question.

Suppose that we would like to find all the movies in Kurt's collection that are shorter than the average length of the movies in Alice's collection. We can find the answer using a subquery similar to the one we wrote for the opening exercise:

    sqlite> SELECT name, year FROM Kurt
       ...> WHERE length < (SELECT AVG(length) FROM Alice);
    name                year
    ------------------  ----
    The Princess Bride  1987
    Lethal Weapon       1987
    48 Hrs.             1982
    The Big Chill       1983

But what if we'd like to ask a more specific question: find all the movies in Kurt's collection that are shorter than the average length of the movies in Alice's collection of the same genre. That is, we want to include "The Big Chill" only if it is shorter than the average drama in Alice's collection, and "Star Trek" only if it is shorter than the average scifi flick in Alice's collection.

To compute the desired result, our subquery needs to know the genre of the movie in Kurt's table that is currently being considered:

    sqlite> SELECT name, year FROM Kurt
       ...> WHERE length <
       ...>         (SELECT AVG(length) FROM Alice
       ...>          WHERE Alice.genre = Kurt.genre);
    name                year
    ------------------  ----
    Lethal Weapon       1987
    48 Hrs.             1982

This query rejects two movies from the previous query. (If we had larger collections, this query could include new movies, which are longer than the average film but shorter than the average film in a genre.)

Notice the new WHERE clause. It refers to the current tuple in the containing query, using the same dot notation we always use to identify an attribute in a particular table.

This is an example of correlated subquery. It refers directly to a name or value in the containing query. Correlated subqueries offer a way to read every row in a table and compare its values to data in a specific row of another table.

We use a correlated subquery whenever a subquery must return a different result for each row processed by the main query. The technique enables us to answer a multi-part question where the answer depends on the value of each row in the outer query.

A correlated subquery is not especially efficient. It is evaluated once for each row processed by the outer query. This is O(m*n) in time, like a cross product or a join. However, it does not have to store the entire product relation, as it works one tuple at a time. So it uses less space than a full product, while giving us finer control on the processing.

Using a correlated subquery is our way of simulated nested loops!

A Quick Exercise

Write an SQL query with to answer this question:

find all the movies in Kurt's collection that are shorter than the average length of the movies of the same genre in Kurt's collection

You'll need a correlated subquery. Use our previous query as your starting point:

    SELECT name, year FROM Kurt
    WHERE length <
            (SELECT AVG(length) FROM Alice
             WHERE Alice.genre = Kurt.genre);

Hint: You will need an alias!

A Solution

What is the new wrinkle here? In the subquery, we want to select movies with the same genre as the movie in the outer query... But now the two relations share the same name, which means that we cannoy disambiguate them in the usual way.

That is where an alias can help. Let's use an AS clause to rename the table in the outer query temporarily. Then, in the subquery, the twp relations will have different names:

    sqlite> SELECT name, year
       ...> FROM Kurt AS Outer
       ...> WHERE length <
       ...>         (SELECT AVG(length) FROM Kurt
       ...>          WHERE genre = Outer.genre);
    name                year
    ------------------  ----
    Lethal Weapon       1987
    Return of the Jedi  1983
    48 Hrs.             1982

Nice!

You can also rename the inner Kurt:

    sqlite> SELECT name, year
       ...> FROM Kurt
       ...> WHERE length <
       ...>         (SELECT AVG(length) FROM Kurt AS Inner
       ...>          WHERE Kurt.genre = Inner.genre);
    name                year
    ------------------  ----
    Lethal Weapon       1987
    Return of the Jedi  1983
    48 Hrs.             1982

Note that in this case you must qualify the reference Kurt.genre.

This little trick of renaming the outer query makes it possible for us to write correlated subqueries to answer all sorts of questions. For example, how can we find names that have been used in two or more movies?

Well, we know that name and year are the key in our databases. If two movies share the same name, then they must have been released in different years. If we find a movie younger than a movie of the same name, the name is a repeat. So:

    sqlite> SELECT name
       ...> FROM Kurt AS Outer
       ...> WHERE year >
       ...>         (SELECT MIN(year) FROM Kurt
       ...>          WHERE name = Outer.name);

What about Alice's collection?

    sqlite> SELECT name
       ...> FROM Alice AS Outer
       ...> WHERE year >
       ...>         (SELECT MIN(year) FROM Alice
       ...>          WHERE name = Outer.name);
    The Princess Bride

With correlated subqueries, we must now be aware of how SQL determines the scope of names in our queries. SQL is a block-structured language and has scope rules similar to Python and Java. Whenever SQL sees an unqualified attribute, it first checks to see if one of the relations listed in the FROM clause has that attribute in its schema. If not, it looks to the immediately surrounding query. This

In the query just above, the references to name and year in the subquery can be connected to the Alice relation listed in the FROM clause. The second use of name would be, too, so we qualified it as Outer.name. In the containing query, the references to name and year are connected to the Alice relation listed in that query's FROM clause.

Do you see now why we must qualify the reference Kurt.genre in the query where we renamed the inner Kurt? There is an attribute named genre in the subquery's relation, and without qualification genre would refer to the same thing as Inner.genre!

We won't always need correlated subqueries, but when we do, they will be very helpful.

Wrap-Up

In this week's on-line session, we will learn two new ideas:

There are more kinds of join in relational algebra than natural and theta. This can be a deep rabbit hole, but it's important for you to know the basic ideas and see how they work in SQL. We'll learn enough for you to understand the basic ideas and use them in SQL. If you encounter them in industry later, you'll be ready to go deeper. Joins will also give us a chance to learn about null values in SQL.

Like ORDER BY, there is an SQL clause GROUP BY that gives us more control on how our DBMS aggregates and presents our results.

See you next week!

Eugene Wallingford ..... wallingf@cs.uni.edu ..... October 22, 2020