## Online Session 9-2

### More SQL: Subqueries

#### CS 3140 Database Systems

There is a small deliverable for this session.

### Opening Exercise

On Homework 3, your worked with a database of movie data consisting of four relations:

```    Movie(title, director, year, rating)
Actor(actor, yearBorn)
Director(director, yearBorn)
StarsIn(actor, movie)
```

Then, in Online Session 9-1, we worked with an SQLite database with those schema, homework03.db, which enables us to write SQL queries to solve our homework.

Write SQL queries to answer these requests:

• Which Coen movies feature McDormand?

• Which movies were made by a director younger than 40 years old?
Feel free to use your relational queries from the homework as a roadmap, if you'd like.

### Opening Solution

These requests are simple to express in natural language, but mapping them onto relational algebra and SQL take a bit of design work. As with all programming, breaking a problem into smaller problems can often help us build a solution.

For our first query, it's straightforward to find the movies Coen made:

```    \$ sqlite3 homework03.db
sqlite> SELECT title FROM Movie WHERE director='Coen';
Fargo
Raising Arizona
```
and to find the movies starring McDormand:
```    sqlite> SELECT movie FROM StarsIn WHERE actor='McDormand';
Fargo
Raising Arizona
Wonder Boys
```

The solution asks for movies that are in both sets. That's what the set intersection operator can do for us. In SQL, that is INTERSECT. So let's use it to combine our two queries:

```    sqlite> SELECT title FROM Movie WHERE director='Coen'
...> INTERSECT
...> SELECT movie FROM StarsIn WHERE actor='McDormand';
Fargo
Raising Arizona
```

The second request asks for an answer that involves data in two different relations, too. The connection between directors and movies is in the Movie relation, while the director's year of birth is in Director. We learned that the natural join operator can create a new relation pairing tuples from Movie and Director that have the same value in the 'director' column.

In relational algebra, we simply write Movie ⋈ Director. In SQL, we write Movie NATURAL JOIN Director, but that's not a legal statement. To execute the join, we need to make it part of a SELECT statement:

```    sqlite> SELECT * FROM Movie NATURAL JOIN Director;
Fargo|Coen|1996|8.2|1954
Raising Arizona|Coen|1987|7.6|1954
Spiderman|Raimi|2002|7.4|1959
Wonder Boys|Hanson|2000|7.6|1945
```
That's a little hard to read. After last session, we can make it cleaner:
```    sqlite> .mode column
sqlite> SELECT * FROM Movie NATURAL JOIN Director;
title            director  year  rating  yearBorn
---------------  --------  ----  ------  --------
Fargo            Coen      1996  8.2     1954
Raising Arizona  Coen      1987  7.6     1954
Spiderman        Raimi     2002  7.4     1959
Wonder Boys      Hanson    2000  7.6     1945
```

This relation contains the information we need to answer the question. We want to select the tuples where the movie's year is less than 40 more than the director's year of birth, and then project the title attribute. So the SQL statement we want is:

```    sqlite> SELECT title FROM Movie NATURAL JOIN Director
...> WHERE (year - yearBorn) < 40;
title
---------------
Raising Arizona
```
It would be fun to explore a more complete database of movies to see how rare this phenomenon is, and to see whether younger or older directors tend to make higher-rated movies. Data can be fun.

### Where Are We?

In recent weeks, we have learned about relational algebra. In our previous session, we transitioned to writing SQL programs that implement queries against real databases. At their heart, these programs implement expressions in relational algebra, but they can also contain pragmatic expressions of the sort we find in general purpose languages.

In this session, we will look in more detail at SQL queries that are built out of other queries. We will find that so-called subqueries often benefit from the full range of relational algebra expressions, including renaming columns and relations using aliases.

We will again work with the movie database we used last session, homework03.db, for many of our examples. Fire it up:

```    \$ sqlite3 homework03.db
sqlite>
```
and let's get started!

### Subqueries

As we solved the problems in today's opening exercise, both of our answers used simple queries from relational algebra to build relations that we then used as parts in the query that computed our final answer. In the Coen/McDormand movies case, we wrote two selection expressions and then tied their results together using intersection. In the young directors case, we used a natural join to construct a relation from which we could select the tuples and project the column we desired.

Each solution was a compound query built out of simpler queries. These simpler queries are called subqueries. In SQL, there several other ways that subqueries can be used to help us build compound queries. In this session, we will consider three:

• A subquery can return a single value that we use in a WHERE clause.
• A subquery can return a relation that we inspect in a WHERE clause.
• A subquery can return a relation that we use as the source in a FROM clause.

### Subqueries That Compute Single Values

Last time, we learned that a query can produce a scalar value, a single value in one of the datatypes supported by the language. The examples their involved aggregation operators such as COUNT and MAX. A SELECT statement can also produce a single value. Consider the answer to which movies were made by directors younger than 40 above. That query could have produced multiple tuples, but in our small database it produced exactly one.

Some queries are designed to produce a single value. If we want to know who directed "Wonder Boys", we could write:

```    sqlite> SELECT director FROM Movie WHERE title='Wonder Boys';
Hanson
```
This answer can be useful on its own, but it can also help us compute another answer.

Suppose that we wanted instead to know what year the director of "Wonder Boys" was born. As in the 'directors under 40' problem from earlier, we will need to query two relations: one to find the director and one to find the director's year of birth. As in the 'directors under 40' solution, we could use a natural join:

```    sqlite> SELECT yearBorn FROM Movie NATURAL JOIN Director
...> WHERE title='Wonder Boys';
1945
```

This works, but it is inefficient. It computes all combinations matching Movie and Director, when we know we only need the tuple for "Wonder Boys". Can we do better?

We can, with a subquery. We need the Movie relation only to find out who directed "Wonder Boys". Once we know the director, we can use that value to query the Director relation and find the birth year:

```    sqlite> SELECT yearBorn FROM Director
...> WHERE director = (SELECT director FROM Movie
...>                   WHERE title='Wonder Boys');
1945
```
The expression in parentheses is the subquery.

The new query is longer than the original, but it is much more efficient. Instead of using natural join, an O(m*n) operation, the new query operates on the two relations separately. In the worst case, this makes the new query O(m+n); in practice, the DBMS will be able to look up single tuples more quickly than linear time.

This simple example demonstrates the value of being able to write SQL queries in many ways. When we know some of the language's features, we are able to write programs that are clearer or more efficient. Readability is as valuable in database programming as it is in other forms of programming. With very large databases, so is efficiency.

### Subqueries That Compute Relations for the WHERE Clause

SQL provides us with several features that help us work with relations as values. We will use two new features today:

• It represents a tuple as an ordered sequence of scalar values. For example, ('Hanks', 1956) is a tuple in the Actor relation. We can describe a generic tuple in Actor as (actor, yearBorn).

• It includes boolean set operators that allow us to write conditions comparing a scalar to a relation.
• v IN R returns true if v is equal to one of the values in relation R
• v > ANY R returns true if v is greater than at least one value in relation R
• v > ALL R returns true if v is greater than every value in relation R
We can use any comparison operator in place of >.

Unfortunately, SQLite does not implement ANY or ALL. Even so, IN will be quite useful.

Suppose that we wanted to know all of the years in which McDormand appeared in a movie. We could compute the product or join of Movie and StarsIn and then select tuples that meet our criteria. But that would be inefficient in both time and space.

Instead, we can use a subquery. First find all the movies that McDormand stars in by querying StarsIn, then use that relation to help select the years of those movies in Movie:

```    sqlite> SELECT year FROM Movie
...> WHERE title IN (SELECT movie FROM StarsIn
...>                 WHERE actor='McDormand');
1996
1987
2000
```

We have again replaced an O(m*n) query with one that is O(m+n) or better, using a nested subquery.

### Subqueries That Compute Relations for the FROM Clause

Up to now, the relations in our FROM clauses have all been named relations stored in the database. However, SQL also allows us to use a relation that has been computed by a subquery. If we want to refer to this relation in a condition, we can rename it using an AS alias.

For example, suppose that we wanted to know the birth years for every director who has worked with McDormand. We could compute the product or join of the Movie, StarsIn, and Director relations and then select tuples that meet our criteria. Obviously, that would be rather inefficient.

Instead, we can use a subquery to peel off one of the relations. We could first find the directors of all of McDormand's movies, then use that relation to help us find those directors' years of birth. The first query will be a subquery:

```    sqlite> SELECT yearBorn
...> FROM Director, (SELECT director
...>                 FROM Movie, StarsIn
...>                 WHERE movie = title AND
...>                       actor = 'McDormand')
...>                AS McDormandDirector
...> WHERE Director.director = McDormandDirector.director;
1954
1954
1945
```
There are repeats because Coen has directed McDormand in two films. We can use the DISTINCT modifier if we would like unique answers.

This replace an O(m*n*p) query with two queries, one that is O(m*n) and one that is O(n*p). Again, we have gained some efficiency.

We can also use of a subquery in the FROM clause to solve one of our Homework 4 queries. Problem 2 asks which manufacturers produce laptops with a hard disk of at least 100 GB. The simplest relational algebra expression to compute this result is:

```    πmaker(Product ⋈ (σhd ≥ 100(Laptop)))
```

Notice that this expression computes the natural join of a named relation, Product, and a relation computed by another query. In SQL, this will give us a subquery in the FROM clause:

```    sqlite> SELECT DISTINCT maker
...> FROM Product
...>      NATURAL JOIN
...>      (SELECT * FROM Laptop WHERE hd >= 100);
E
A
B
F
G
```

The nested query finds the subset of the Laptop relation with hard drives that meet our spec, which can them be joined with Product to find the manufacturers. This is more efficient than joining Product with the entire Laptop relation.

### More Practical SQL: ORDER BY

Let's close with a little bonus SQL.

You may have noticed that several of our queries in this session have produced results with the tuples listed in no particular order. This is a side effect of the order of tuples in the underlying relations and of the way we find the answers. But when we have our final answer, we may want to present it in a more orderly fashion.

SQL accommodates us with the ORDER BY clause. When used as the last clause in a statement, it instructs the DBMS to present the resulting tuples in the order of the listed attributes.

For example, we could list all the movies sorted by rating then by year:

```    sqlite> .mode column
...> SELECT * FROM Movie
...> ORDER BY rating, year;
title            director  year  rating
---------------  --------  ----  ------
Spiderman        Raimi     2002  7.4
Raising Arizona  Coen      1987  7.6
Wonder Boys      Hanson    2000  7.6
Fargo            Coen      1996  8.2
```

By default, SQL sorts values in ascending order. If we would prefer to see our list with higher-rated movies first, we can add the DESC keyword (short for"descending"):

```    sqlite> .mode column
...> SELECT * FROM Movie
...> ORDER BY rating DESC, year;
title            director  year  rating
---------------  --------  ----  ------
Fargo            Coen      1996  8.2
Raising Arizona  Coen      1987  7.6
Wonder Boys      Hanson    2000  7.6
Spiderman        Raimi     2002  7.4
```

We can use ORDER BY to sort McDormand's movies by the year in which they were made:

```    sqlite> SELECT year FROM Movie
...> WHERE title IN (SELECT movie FROM StarsIn
...>                 WHERE actor='McDormand')
...> ORDER BY year;
year
----
1987
1996
2000
```

Next week, we'll learn another SQL feature for presenting results in a more pleasing way. We will also see more examples of subqueries that involve joins.

### Wrap-Up

There is no big deliverable for this session. Just play! As a status check, do one task for me:

1. Create a new query of your own against the movie database we have been using this week.
2. Execute it in SQLite.