Back in Session 7, we used a movie database that contained two tables with identical schema:
Alice(name, year, length, genre) Kurt (name, year, length, genre)
Write SQL queries to answer these requests:
(Spoiler alert: the answers come next...).
These requests require features we learned about last week. The first uses aliasing and the ORDER BY clause, as well as columns and headings:
sqlite> .mode columns sqlite> SELECT name, genre, length AS 'length in minutes' ...> FROM Alice ...> ORDER BY genre, length DESC; name genre length in minutes ------------------ ------- ----------------- Black Panther action 134 Lethal Weapon action 109 The Replacements comedy 118 The Big Chill drama 105 The Princess Bride fantasy 98 The Princess Bride fantasy 98 Star Trek scifi 132 Blade Runner scifi 117I sorted genres in alphabetical order, which is ascending by default, and movie lengths in descending order. Notice how dropping the year attribute from our tuples loses the distinction between the two versions of "The Princess Bride".
SQLite Tips: If you see headers when you don't want to, or don't see them when you do, you can toggle headers with .headers [on | off] . If you'd like use the same setting every time you use sqlite3, put the command in a file named .sqliterc in your home directory. (Notice the dot in the name!) sqlite3 loads that file every time it starts up. To see a list of sqlite3's dot commands, check out this list in the Command Line Shell guide.
The second request looks tricky. It requires that we know the length of the shortest movie so that can select that movie from the database. With a subquery, though, it's a snap:
sqlite> .mode list sqlite> SELECT name FROM Kurt ...> WHERE length = (SELECT MIN(length) FROM Kurt); 48 Hrs.
Notice how we can use a subquery to replace what would be a local variable in a general-purpose language. This is a design pattern for SQL queries that we will see in a variety of forms over the coming weeks.
Last session, we saw how using a subquery can make a query more efficient, often dropping an order of magnitude in space, time, or both. In this case, the space and time complexity of the query are both O(n). In a general-purpose language, we can solve this problem with one pass through the database, but our SQL query requires two passes. [Is there any way to do this on one pass? Why not?]
Last week, we expanded our understanding of SQL programming by looking at some of its features that enable us to do more than simply look up tuples in a relational database. When we think of those features used in conjunction with the basic form of statements:
SELECT attributes FROM relation WHERE condition;we can see that SQL defines syntax that replaces the loops and conditional statements we would write in a general-purpose language. SQL is a great example of designing a new language that implements lower-level design patterns as single commands.
We also learned two more things last week: other ways that SQL supports relational algebra, such as aliases for local renaming, and subqueries. Today, we'll look at correlated subqueries, a particular kind of subquery that depends on the context of the query that contains it. Then we'll revisit the idea of a join and SQL's support for them.
For today's session, we will continue to work with the Alice/Kurt database from Session 7. Fire it up now, if you haven't already:
$ sqlite3 homework03.db sqlite> _
Our query to find the name of the shortest movie in Kurt's collection in the opening exercise:
SELECT name FROM Kurt WHERE length = (SELECT MIN(length) FROM Kurt);used a subquery. A subquery is a SELECT statement nested in another statement.
In our solution, the containing query is:
SELECT name FROM Kurt WHERE length = _____;
And the blank is filled by the subquery:
(SELECT MIN(length) FROM Kurt)
SQL computes the value of the subquery and then plugs that value into the containing query, where it is used to compue its value.
This is an example of a subquery that returns a single value we use in a WHERE clause. Last time, we also saw that subqueries can return a relation we use in a WHERE clause and a relation we use in a FROM clause.
Subqueries have a number of benefits. In some cases, they make the code more readable. In many cases, they result in code that is more efficient to execute. Rather than compute a join or even a cross product, which is O(m*n) in both space and time, we can create queries that run in linear time and linear space or less.
Sometimes, efficiency is not the goal...
In all the cases we saw last session, the subquery could be executed independent of containing query. The DBMS could execute the subquery, get its value, plug that value into the containing query, and then execute it. This worked because the subquery did not refer to any name in the containing query.
Sometimes, though, we want to ask a different sort of question.
Suppose that we would like to find all the movies in Kurt's collection that are shorter than the average length of the movies in Alice's collection. We can find the answer using a subquery similar to the one we wrote for the opening exercise:
sqlite> SELECT name, year FROM Kurt ...> WHERE length < (SELECT AVG(length) FROM Alice); name year ------------------ ---- The Princess Bride 1987 Lethal Weapon 1987 48 Hrs. 1982 The Big Chill 1983
But what if we'd like to ask a more specific question: find all the movies in Kurt's collection that are shorter than the average length of the movies in Alice's collection of the same genre. That is, we want to include "The Big Chill" only if it is shorter than the average drama in Alice's collection, and "Star Trek" only if it is shorter than the average scifi flick in Alice's collection.
To compute the desired result, our subquery needs to know the genre of the movie in Kurt's table that is currently being considered:
sqlite> SELECT name, year FROM Kurt ...> WHERE length < ...> (SELECT AVG(length) FROM Alice ...> WHERE Alice.genre = Kurt.genre); name year ------------------ ---- Lethal Weapon 1987 48 Hrs. 1982This query rejects two movies from the previous query. (If we had larger collections, this query could include new movies, which are longer than the average film but shorter than the average film in a genre.)
Notice the new WHERE clause. It refers to the current tuple in the containing query, using the same dot notation we always use to identify an attribute in a particular table.
This is an example of correlated subquery. It refers directly to a name or value in the containing query. Correlated subqueries offer a way to read every row in a table and compare its values to data in a specific row of another table.
We use a correlated subquery whenever a subquery must return a different result for each row processed by the main query. The technique enables us to answer a multi-part question where the answer depends on the value of each row in the outer query.
A correlated subquery is not especially efficient. It is evaluated once for each row processed by the outer query. This is O(m*n) in time, like a cross product or a join. However, it does not have to store the entire product relation, as it works one tuple at a time. So it uses less space than a full product, while giving us finer control on the processing.
Using a correlated subquery is our way of simulated nested loops!
Write an SQL query with to answer this question:
find all the movies in Kurt's collection that are shorter than the average length of the movies of the same genre in Kurt's collection
You'll need a correlated subquery. Use our previous query as your starting point:
SELECT name, year FROM Kurt WHERE length < (SELECT AVG(length) FROM Alice WHERE Alice.genre = Kurt.genre);
Hint: You will need an alias!
What is the new wrinkle here? In the subquery, we want to select movies with the same genre as the movie in the outer query... But now the two relations share the same name, which means that we cannoy disambiguate them in the usual way.
That is where an alias can help. Let's use an AS clause to rename the table in the outer query temporarily. Then, in the subquery, the twp relations will have different names:
sqlite> SELECT name, year ...> FROM Kurt AS Outer ...> WHERE length < ...> (SELECT AVG(length) FROM Kurt ...> WHERE genre = Outer.genre); name year ------------------ ---- Lethal Weapon 1987 Return of the Jedi 1983 48 Hrs. 1982
Nice!
You can also rename the inner Kurt:
sqlite> SELECT name, year ...> FROM Kurt ...> WHERE length < ...> (SELECT AVG(length) FROM Kurt AS Inner ...> WHERE Kurt.genre = Inner.genre); name year ------------------ ---- Lethal Weapon 1987 Return of the Jedi 1983 48 Hrs. 1982Note that in this case you must qualify the reference Kurt.genre.
This little trick of renaming the outer query makes it possible for us to write correlated subqueries to answer all sorts of questions. For example, how can we find names that have been used in two or more movies?
Well, we know that name and year are the key in our databases. If two movies share the same name, then they must have been released in different years. If we find a movie younger than a movie of the same name, the name is a repeat. So:
sqlite> SELECT name ...> FROM Kurt AS Outer ...> WHERE year > ...> (SELECT MIN(year) FROM Kurt ...> WHERE name = Outer.name);What about Alice's collection?
sqlite> SELECT name ...> FROM Alice AS Outer ...> WHERE year > ...> (SELECT MIN(year) FROM Alice ...> WHERE name = Outer.name); The Princess Bride
With correlated subqueries, we must now be aware of how SQL determines the scope of names in our queries. SQL is a block-structured language and has scope rules similar to Python and Java. Whenever SQL sees an unqualified attribute, it first checks to see if one of the relations listed in the FROM clause has that attribute in its schema. If not, it looks to the immediately surrounding query. This
In the query just above, the references to name and year in the subquery can be connected to the Alice relation listed in the FROM clause. The second use of name would be, too, so we qualified it as Outer.name. In the containing query, the references to name and year are connected to the Alice relation listed in that query's FROM clause.
Do you see now why we must qualify the reference Kurt.genre in the query where we renamed the inner Kurt? There is an attribute named genre in the subquery's relation, and without qualification genre would refer to the same thing as Inner.genre!
We won't always need correlated subqueries, but when we do, they will be very helpful.
In this week's on-line session, we will learn two new ideas:
See you next week!