CS 3140 Online Session 10

Online Session 10

Joins and SQL

CS 3140
Database Systems

There is a small deliverable for this session.
The task is decribed at the bottom of this page.

Opening Exercise

Let's return to the movie/actor/director database originally from Homework 3, which comprises four relations:

    Movie(title, director, year, rating)
    Actor(actor, yearBorn)
    Director(director, yearBorn)
    StarsIn(actor, movie)

Last week, you wrote standard SQL queries to answer these questions:

Which Coen movies feature McDormand?
Which movies were made by a director younger than 40 years old?

This time, answer these questions with SQL statements that use subqueries.

(Spoiler alert: the answers come next...).

Opening Solution

Now that we know about subqueries, we often have multiple ways at our disposal to solve any problem. Figuring out how to use a tool to solve a particular problem helps us master the tool, helps us learn the tool's limits, and gives us options later when we need them.

For our first query, it's straightforward to find the movies Coen made:

    sqlite> SELECT title FROM Movie WHERE director='Coen';
    Fargo
    Raising Arizona

Now we need to a way to narrow our condition to keep a movie only if it stars McDormand. We know how to find those movies:

    sqlite> SELECT movie FROM StarsIn WHERE actor='McDormand';
    Fargo
    Raising Arizona
    Wonder Boys

In our previous solution, we used set intersection operator to combine our two queries. But we could also make the second one a subquery of the first:

    sqlite> SELECT title FROM Movie
       ...> WHERE director='Coen'
       ...>   AND title IN (SELECT movie FROM StarsIn
       ...>                 WHERE actor='McDormand');
    Fargo
    Raising Arizona

To answer the second request, we also need data from two different relations. Our previous solution used a natural join to create a new relation containing all the data we needed. Then we learned that a subquery can help us solve such problems, sometimes more efficiently.

This query with a subquery will do the job:

    sqlite> SELECT title FROM Movie
       ...> WHERE 40 > year -
       ...>            (SELECT yearBorn FROM Director
       ...>             WHERE director = Movie.director);
    Raising Arizona

This query saves time and space by not creating a large relation via natural join, at a smaller time cost of walking down the Director relation once for each tuple in the Movie relation. Do you think the new query is more readable than our original, or less?

Where Are We?

For the last couple of weeks, we have been learning SQL as a programming language for querying relational databases. In this session, we will extend our knowledge of join, first as an operator in the relational algebra and then as an operator in SQL. At the end of the session, we will learn one more bit of SQL that is useful for writing queries that are useful out in the world: the ability to group tuples before computing totals and presenting results.

Inner and Outer Joins

A few week's back, we learned about ways in relational algebra to pair up the tuples in two relations. Consider our old friends, relations U and V:

      U            V

    A B C        B C D
    - - -        - - -
    1 2 3        2 3 4
    5 6 7        2 3 5
    6 7 8        4 3 2
    9 7 8        7 8 0

The more specific way to pair tuples, ⋈, creates a natural join, in which tuples are paired up whenever they have the same values for all the attributes common to the two relations. R and S both have an attribute C, so it drives the natutal join:

    A B C D
    - - - -  U ⋈ V
    1 2 3 4
    1 2 3 5
    6 7 8 0
    9 7 8 0

The more flexible way to pair tuples is the theta join, ⋈_C. It pairs up tuples whenever they satisfy a given condition, C. If we'd like to pair up U and V whenever the value of A in U is less than the value of D in V, we can use a theta join:

    A U.B U.C V.B V.C D
    - --- --- --- --- -  U ⋈_A<D V
    1  2   3   2   3  4
    1  2   3   2   3  5
    1  2   3   4   3  2

When we compute a natural join or a theta join, some of the tuples from the original relations may not pair up with any other tuple and thus do not appear in the result. We refer to them as dangling tuples. In our example, the tuples (5,6,7) in U and (4,3,2) in V do not naturally join with any tuple in the other relation. As such, they do not appear in the result.

The presence of dangling tuples means that the result of a join does not represent all of the data in the original relations. We can think of natural and theta join as computing an inner join, because they include only the tuples inside the condition they use to select rows.

There may be cases in which this behavior is undesirable. To counter it, a new kind of join operator was created: the outer join, or ⟗. It includes dangling tuples in its result. The outer join is also sometimes called the full outer join, for reasons that will be apparent in few moments.

There is an outer join version of both the natural join and the theta join. For simplicity, let's focus on the natural join version here.

The outer join R ⟗ S includes all of the tuples in R ⋈ S, plus any dangling tuples from R and S. The dangling tuples do not have values for all of the attributes produced by the join, so those attributes are assigned a null value. We write the null value as ⊥.

Consider again relations U and V. The natural join U ⋈ V produces two dangling tuples: (5,6,7) in U and (4,3,2) in V. The outer join includes them, padded with null values:

    A B C D
    - - - -  U ⟗ V
    1 2 3 4
    1 2 3 5
    6 7 8 0
    9 7 8 0
    5 6 7 ⊥
    ⊥ 4 3 2

The full outer join includes all dangling tuples from either of the relations. If we want to include dangling tuples only from the lefthand relation, we can compute the left outer join, denoted ⟕.

    A B C D
    - - - -  U ⟕ V
    1 2 3 4
    1 2 3 5
    6 7 8 0
    9 7 8 0
    5 6 7 ⊥

Similarly, the right outer join, denoted ⟖, includes dangling tuples only from the righthand relation:

    A B C D
    - - - -  U ⟖ V
    1 2 3 4
    1 2 3 5
    6 7 8 0
    9 7 8 0
    ⊥ 4 3 2

Everything we have just learned about outer natural joins can be applied to theta joins, too. The only difference is the condition used by the operator to select the tuples for pairing.

Outer joins can be helpful when we want a complete set of tuples from one or both of the operands. Imagine a Customer and a Phone relation. If a particular customer does not have a phone, then joining the two relations drop that customer from the result. But we probably still want to be able to see and process those customers later in a program, perhaps handling them with special purpose code.

The Database

We will continue to work with the simple database from the previous section, consisting of two simple tables:

    U(A, B, C)
    V(B, C, D)

Now that we are going to look at SQL implementations of the join operators, we need an actual database. I implemented the tables we have been using in the database uv-lite.db. Download it now and fire it up in SQLite:

    $ sqlite3 uv-lite.db
    sqlite>

Inner and Outer Joins in SQL

We have already seen SQL versions of several kinds of join. The first, CROSS JOIN, computed the cross product of two relations. As a full product, the concepts of inner and outer joins don't apply.

To implement R ⋈ S in SQL, we use the NATURAL JOIN operator. Here is U ⋈ V:

    sqlite> SELECT * FROM U NATURAL JOIN V;
    A  B  C  D
    -  -  -  -
    1  2  3  4
    1  2  3  5
    6  7  8  0
    9  7  8  0

This is the result we expect, from above. This is an inner join, as it drops tuples from U and V that are unable to pair. Indeed, the keyword INNER is optional here:

    sqlite> SELECT * FROM U NATURAL INNER JOIN V;
    A  B  C  D
    -  -  -  -
    1  2  3  4
    1  2  3  5
    6  7  8  0
    9  7  8  0

The NATURAL keyword indicates that we want to pair tuples that are identical on all shared attributes. We can also use the USING keyword if we would like to specify a subset of the shared columns:

    sqlite> SELECT * FROM U INNER JOIN V USING (C);
    A  B  C  B  D
    -  -  -  -  -
    1  2  3  2  4
    1  2  3  2  5
    1  2  3  4  2
    6  7  8  7  0
    9  7  8  7  0

As in the case of a natural join, JOIN ... USING pairs tuples that have identical values for the specified attributes and collapses the specified attributes into a single column in the result.

We can also join two relations whenever they satisfy a given condition, condition. In SQL, we can represent the theta join U ⋈_condition V using the ON keyword. To compute U ⋈_A<D V, we say:

    sqlite> SELECT * FROM U INNER JOIN V ON A < D;
    A  B  C  B  C  D
    -  -  -  -  -  -
    1  2  3  2  3  4
    1  2  3  2  3  5
    1  2  3  4  3  2

Unlike NATURAL JOIN and JOIN ... USING, JOIN ... ON includes duplicate columns from the two relations. Oddly, it does not disambiguate them as we have seen in the past!

These inner joins, whether natural or theta, drop tuple that do not pair with any tuple in the other relation. In the case of joining U and V, we have two such dangling tuples: (5,6,7) in U and (4,3,2) in V.

In this session, we saw a new kind of join operator, the outer join, that includes dangling tuples in its result. We saw three variants of the outer join:

the full outer join, which includes dangling tuples from both relations
the left outer join, which includes dangling tuples only from its lefthand operand
the right outer join, which includes dangling tuples only from its righthand operand

Different implementations of SQL provide different levels of support for outer joins. SQLite supports only the left outer join, with the compound keyword LEFT OUTER JOIN.

Unqualified, LEFT OUTER JOIN works like a cross join:

    sqlite> SELECT * FROM U LEFT OUTER JOIN V;
    A  B  C  B  C  D
    -  -  -  -  -  -
    1  2  3  2  3  4
    1  2  3  2  3  5
    ...                [twelve tuples omitted]
    9  7  8  4  3  2
    9  7  8  7  8  0

But with qualification, it provides something new. If we seek the left outer version of the natural join, we will see the tuples that did not match:

    sqlite> SELECT * FROM U NATURAL LEFT OUTER JOIN V;
    A  B  C  D
    -  -  -  -
    1  2  3  4
    1  2  3  5
    5  6  7  
    6  7  8  0
    9  7  8  0

The dangling tuple (5,6,7) is included in the answer, with the attribute D given a value of NULL. The print value of NULL is the empty string, but we can examine the value using a special boolean operator:

    sqlite> SELECT *
       ...> FROM U NATURAL LEFT OUTER JOIN V
       ...> WHERE D IS NULL;
    A  B  C  D
    -  -  -  -
    5  6  7

Left outer joins can be qualified in the same ways as an inner join. We can use the USING keyword to join on a subset of the shared columns:

    sqlite> SELECT * FROM U LEFT OUTER JOIN V USING (C);
    A  B  C  B  D
    -  -  -  -  -
    1  2  3  2  4
    1  2  3  2  5
    1  2  3  4  2
    5  6  7  
    6  7  8  7  0
    9  7  8  7  0

Notice that, in this case, the dangling tuple has null values in both of the unmatched attributes. In SQLite, null is displayed as an empty string.

Likewise, we can compute a theta join on a given condition using the ON keyword. To compute U ⟕_A<D V, we say:

    sqlite> SELECT * FROM U LEFT OUTER JOIN V ON A < D;
    A  B  C  B  C  D
    -  -  -  -  -  -
    1  2  3  2  3  4
    1  2  3  2  3  5
    1  2  3  4  3  2
    5  6  7         
    6  7  8         
    9  7  8

In this case, all but one tuple from U is unpaired. The dangling tuples are padded with null values in the unmatched attributes.

This looks like a lot of variations that are hard to keep track of. When might we use a left outer join in solving a problem? In the section on relational algebra above, I imagined a database with Customer and Phone relations. If a particular customer does not have a phone, then a natural join of the two relations would drop that customer from the result. Yet we would probably still want to be able to see and process those customers later in a program. Customer ⟕ Phone would include that customer in its result.

Or consider our movie database homework03.db, with Movie, Actor, Director, and StarsIn relations. The natural join of Actor and StarsIn, Actor ⋈ StarsIn:

    sqlite> SELECT * FROM Actor NATURAL JOIN StarsIn;
    actor      yearBorn  movie          
    ---------  --------  ---------------
    Cage       1964      Raising Arizona
    Maguire    1975      Spiderman      
    Maguire    1975      Wonder Boys    
    McDormand  1957      Fargo          
    McDormand  1957      Raising Arizona
    McDormand  1957      Wonder Boys

has no trace of Tom Hanks, because we have none of his movies in the Movie relation. The tuple (Hanks,1956) is dangling. However, the left outer join of the relations, Actor ⟕ StarsIn, includes the dangling tuple:

    sqlite> SELECT * FROM Actor NATURAL LEFT OUTER JOIN StarsIn;
    actor      yearBorn  movie          
    ---------  --------  ---------------
    Cage       1964      Raising Arizona
    Hanks      1956                     
    Maguire    1975      Spiderman      
    Maguire    1975      Wonder Boys    
    McDormand  1957      Fargo          
    McDormand  1957      Raising Arizona
    McDormand  1957      Wonder Boys

As I said in class, trying to learn everything about joins in SQL can take us down a deep rabbit hole. What I'd like you to learn from this session is:

the difference between an inner join and outer join
the differences among a full outer join, a left outer join, and a right outer join
SQLite supports only left outer join
you can use a natural left outer join to combine two relations in a way that includes dangling tuples

More Practical SQL: `GROUP BY`

Let's close with a little bonus SQL that's easier to understand. and really useful in practice.

(The following examples use our movie database homework03.db, with Movie, Actor, Director, and StarsIn relations.)

Last week, we learned how to use aggregation operators to compute summary values for a query, such as the total for a set of values:

    sqlite> SELECT COUNT(director) FROM Movie
       ...> WHERE rating > 7.5;
    3

This computes a total over the entire relation. There are times, though, when we want to apply an aggregator to subsets of the relation based on a particular attribute. For example, rather than counting how many directors have made movies rated higher than 7.5, we might want to find our how many movies rated higher than 7.5 each director has made.

SQL gives us this ability with the GROUP BY clause. When used after the WHERE clause in a statement, it instructs the DBMS to group tuples in the relation according to the listed attribute and to apply any aggregator function in the SELECT clause to values within the group.

For example, we can find the directors who have made movies rated higher than 7.5 and their counts using GROUP BY:

    sqlite> SELECT director, COUNT(director)
       ...> FROM Movie
       ...> WHERE rating > 7.5
       ...> GROUP BY director;
    director  COUNT(director)
    --------  ---------------
    Coen      2              
    Hanson    1

This query instructs the DBMS to group the tuples in the result by director before applying the count operation.

If there is no WHERE, we can use GROUP BY after the FROM clause. For example, we can find out how many movies each actor has starred in with:

    sqlite> SELECT actor, COUNT(movie)
       ...> FROM StarsIn
       ...> GROUP BY actor;
    actor      COUNT(movie)
    ---------  ------------
    Cage       1           
    Maguire    2           
    McDormand  3

We can also use GROUP BY when a relation has null values in some of its fields. If we want a report listing movie counts for all actors, even if they do not have any movies in the database, we can use an outer join:

    sqlite> SELECT actor, COUNT(movie)
       ...> FROM Actor NATURAL LEFT OUTER JOIN StarsIn
       ...> GROUP BY actor;
    actor      COUNT(movie)
    ---------  ------------
    Cage       1           
    Hanks      0           
    Maguire    2           
    McDormand  3

This shows us that null values don't count as members of a bag. The COUNT() of an empty bag is 0. Null is ignored by every aggregation operator.

When using GROUP BY, the SELECT clause should contain only two kinds of terms:

an aggregation operator applied to one of the attributes in the resulting relation, and
attributes that appear in the GROUP BY clause.

SQLite is more permissive than this, but the results of queries that violate SQL's conditions are often broken.

Wrap-Up

There is no big deliverable for this session. Just play!

Please run one experiment:

What does a JOIN ... ON statement return when we list an attribute that appears in only one of the relations?

You can use the JOIN ... ON example from the session as your starting point. Change the joining attribute and see what happens.

Email me the answer.

If you have any questions, please let me know.

Eugene Wallingford ..... wallingf@cs.uni.edu ..... October 23, 2020