CS 3140 Session 8

Session 8

Join Operators in Relational Algebra

CS 3140
Database Systems

Opening Exercise

Consider these four relations:

    Student                  Offering

    SID Name  Class          Class Subject
    --- ----- -----          ----- -------
    101 Alex  10             10    Math
    102 Maria 11             10    English
                             11    Music
                             11    Science
      U            V

    A B C        B C D
    - - -        - - -
    1 2 3        2 3 4
    5 6 7        2 3 5
    6 7 8        4 3 2
    9 7 8        7 8 0

Answer these questions:

What is Student ⋈ Offering?
What is U ⋈ V?

Bonus exercise: Can you express these queries in relational algebra without using the natural join operator?

Solutions for the problems are at the bottom of the page. No peeking until you have your own answers! People learn better when they are invested in the answer that is shown to them.

Where Are We?

Last week, we explored the basic operators of relational algebra: three set operations, two removal operators (projection and selection), two combination operators (product and natural join), and a renaming operator. We also learned ways to express some of these operators in SQL.

In this session, we'll review the renaming and natural join operators and see how to express them in SQL. We'll then learn about a second kind of join that is more general.

Review: The Rename Operator

The renaming operator, ρ, takes a relation and returns a new relation.

    ρ_R'(A1,...)(R)

It allows us to rename the relation or any of the attributes. This can be useul in compound expressions where we combine two tuples that have attributes with the same name.

Renaming is helpful in mathematical applications of relational algebra, but it can be computationally expensive inside a DBMS. As a result, we don't do it very often in our programs.

In SQL, we implement rename operations using the ALTER TABLE.

To rename a table:

    sqlite> .tables
    Alice
    sqlite> ALTER TABLE Alice RENAME TO Barbara;
    sqlite> .tables
    Barbara

To rename a column:

    sqlite> .schema Customer
    CREATE TABLE Customer (
      ssNo CHAR(11),
      name VARCHAR(30),
      PRIMARY KEY (owner)
      );
    sqlite> ALTER TABLE Customer
       ...> RENAME COLUMN ssNo TO owner;
    sqlite> .schema Customer
    CREATE TABLE Customer (
      owner CHAR(11),
      name VARCHAR(30),
      PRIMARY KEY (owner)
      );

Renaming columns was added to SQLite in v3.25. If your version is older than that, you have to simulate the action by renaming the existing table to a temporary name, creating a new table with the desired name and column names, and copying data from the temporary table into the new table. Rather than do that, you should probably just install the latest version. It's really easy. :-)

Review: The Join Operator

Last session, we learned the relational algebra operator for a common database problem: combining the tuples in different relations when they match on a shared attribute. This operator is ⋈, which computes the natural join of two relations.

Consider the Customer and Address relations from our example database in Week 5:

    ssNo          name                               CUSTOMER
    -----------   -----     
    123-45-6789   Eugene Wallingford
    321-45-6789   John Smith

    areaCode   number     owner                      PHONE
    --------   --------   -----------
    319        273-5919   123-45-6789
    517        452-9159   321-45-6789
    319        273-2618   123-45-6789

    street     city          state   owner           ADDRESS
    --------   -----------   -----   -----------
    339 WRT    Cedar Falls   IA      321-45-6789
    1 Elm St   Plainfield    IN      321-45-6789
    305 WRT    Cedar Falls   IA      123-45-6789

The Phone and Address share a common attribute 'owner', both of which have the domain of Customer social security numbers. Combining Phone with Address with a natural join:

    Phone ⋈ Address

pairs tuples from Phone with tuples from Address when they have the same value for social security number. This tells us which phone numbers correspond with which addresses.

To compute a natural join in SQL, we use the compound verb NATURAL JOIN:

    sqlite> SELECT * FROM Phone NATURAL JOIN Address;
    319|273-5919|123-45-6789|305 WRT|Cedar Falls|IA
    517|452-9159|321-45-6789|1 Elm St|Plainfield|IN
    517|452-9159|321-45-6789|339 WRT|Cedar Falls|IA
    319|273-2618|123-45-6789|305 WRT|Cedar Falls|IA

In our opening exercise, I gave you an extra challenge: how can we express R ⋈ S without using the natural join operator? We can implement the basic idea in two steps: first, find the product of R and S, then select the tuples from the resulting relation that have the same value on the shared attributes.

    σ_{R.attrib=S.attrib}(R ⨯ S)

For Phone ⋈ Address, that becomes:

    σ_{Phone.owner=S.owner}(Phone ⨯ Customer)

In SQL, we can implement this using the compound verb CROSS JOIN:

    sqlite> SELECT * FROM Phone CROSS JOIN Address
    sqlite> WHERE Phone.owner = Address.owner;
    319|273-5919|123-45-6789|305 WRT|Cedar Falls|IA|123-45-6789
    517|452-9159|321-45-6789|1 Elm St|Plainfield|IN|321-45-6789
    517|452-9159|321-45-6789|339 WRT|Cedar Falls|IA|321-45-6789
    319|273-2618|123-45-6789|305 WRT|Cedar Falls|IA|123-45-6789

The new adverb in CROSS JOIN says that we want a full cross-product, similar to the ⨯ operator we learned about last week.

This doesn't quite give us the same result, though. Notice that our new table has two columns for owner, one from the Phone relation and the other from the Address relation. The values are always, the same, of course, because we selected only the tuples for which that was true!

We need a couple of more steps in relational algebra to produce the same value as R ⋈ S. After finding the product of R and S and selecting the desired tuples, we need to drop the extra attribute and rename the common attributes.

    ρ_{R.attrib→attrib}(π_{drop S.attrib}(σ_{R.attrib=S.attrib}(R ⨯ S)))

The details of the projection and renaming steps will depend on the particular attributes of R and S.

Bonus exercise: Write the relational algebra query to compute Phone ⋈ Address without using the natural join operator. For an added challenge,

write a full SQL query to compute the result. For a full query, the project and select steps will all have their own SELECT clauses in the result.

write the simplest SQL query possible to compute the result. To do this you can take advantage of our ability to select both columns and rows in the same SELECT statement.

We have just derived a rewrite rule for relational algebra:

    R ⋈ S ==
      ρ_{R.attrib→attrib}(π_{drop S.attrib}(σ_{R.attrib=S.attrib}(R ⨯ S)))

There are other such rules, though only a few can be written without using the operator being rewritten. Here is one of each:

    R ⋂ S  == R - (R - S)

    R ⋃ S  == (R ⋂ S) ⋃ (R - S) ⋃ (S - R)

These relationships may be of interest to a mathematician, but of what value are they to us as programmers? They can be valuable to us in many ways, both as programmers and as users of tools.

Sometimes, we can use them to figure out that a query doesn't do what we think. Transform a complex query into a simpler one, a dense query into a more understandable one. Applying rewrite rules can help us debug our queries.

Sometimes, we can use to them to create a more maintainable query or a more efficient query. Applying rewrite rules can help us refactor our queries.

Our DBMS and its compiler can use rewrite rules to create more efficient programs, too. Applying rewrite rules can help the DBMS optimize our queries before executing them.

When a DBMS receives an SQL query, the first thing it does is translate it into a relational algebra expression, or something very similar. The second thing it does is optimize the query, by translating into an equivalent expression that can be evaluated more efficiently. In the compiler, simplicity and maintainability are not goals. Time and space efficiency are.

In our last session, in the example at end of the section on compound expressions, we saw two simple queries against our Movie database that compute the same result. We may prefer one to the other, or not, but our DBMS may have a strong preference due to its underlying implementation. This optimization step is often essential to making a relational database fast enough to use in practice.

Query optimization happens behind the scenes...

... efficiency with big tables and complicated queries.
... parallelization! ... analogy to FP and Google's MapReduce.
... importance of the functional nature of relational algebra.

Theta Join

In our customer database example earlier, we were able to compute the natural join of Phone and Address because they shared a common attribute: owner, the social security number of the corresponding Customer. But we might just as well want to joining Customer and Phone, to find the phone numbers associated with Eugene, or Customer and Address, to find the addresses associated with John.

Unfortunately, natural join cannot help. Both Customer and Phone have attributes storing customer SSNs, but they have different names, ssNo and owner. Natural join only pairs tuples when they have the same values on attributes with the same name.

One option would be to rename one of the attributes so that we could do a natural join:

    sqlite> ALTER TABLE Customer
       ...> RENAME COLUMN ssNo TO owner;
    sqlite> SELECT * FROM Customer NATURAL JOIN Phone;
    123-45-6789|Eugene Wallingford|319|273-5919
    321-45-6789|John Smith|517|452-9159
    123-45-6789|Eugene Wallingford|319|273-2618

However, this is a heavy-handed solution with side effects. In relational algebra, the renaming operator is a function; in SQL, it changes the state of the database.

The natural join operation requires tuples to share at least one attribute, and it pairs tuples using exactly one condition on the shared attributes. This is a valuable way to join two relations, but not the only one. Sometimes, we would like to join relations according a different kind of condition.

Relational algebra offers a second join operator for just such situations: the theta join. Theta refers to the arbitrary condition used to select tuples. Rather than define a new operator for theta joins, θ, relational algebra adapts the ⋈ operator:

    R ⋈_C S

In database theory, we usually use C to name the condition, rather the mathematician's θ.

Theta join works like this:

Compute the product of R and S.
From the result, select the tuples that satisfy C.

We can write this as: σ_C(R ⨯ S).

The schema for the result is the schema we expect from a product: the union of the schemas of R and S. If the relations have attributes in common, the attributes in the result are disambiguated with prefixes as before.

Consider again the relations from the opening exercise:

      U            V

    A B C        B C D
    - - -        - - -
    1 2 3        2 3 4
    5 6 7        2 3 5
    6 7 8        4 3 2
    9 7 8        7 8 0

Theta join allows us to join U and V on an arbitrary condition. For example, we can find all the combinations in which A<D:

    U ⋈_A<D V

The first tuple in U, (1,2,3), pairs with each of the first three tuples in V, because 1 is less than 4, 5, and 2. It can't pair with (7,8,0), though. None of the other tuples in U can pair with any tuple in V, because the value for D in each is larger than all of the D values in V. Thus, the result of U ⋈_A<D V is:

    A U.B U.C V.B V.C D
    - --- --- --- --- -
    1  2   3   2   3  4
    1  2   3   2   3  5
    1  2   3   4   3  2

Notice the repeated attributes. Theta join doesn't not require that two relations have any attributes in common, and so produces a subset of the product. In cases where the operands do share an attribute, the resulting relation will contain duplicate attributes that are disambiguated by the relation names.

This shows us a distinction between theta join and natural join: Natural join merges common attributes.

Theta join supports arbitrarily complex conditions. We can, for instance, find all the combinations of U and V in which A<D and the B values are distinct:

    U ⋈_{A<D AND U.B!=V.B} V

Two of the matches in our previous result shared values in the B, so they do not satisfy the query. The more restrictive condition limits our result to a single tuple:

    A U.B U.C V.B V.C D
    - --- --- --- --- -
    1  2   3   4   3  2

... another example, from our movie database consisting of Alice and Kurt. We can find all cases in which Alice likes a a movie older than one of Kurt's favorites using this query:

    Alice ⋈_{Alice.year<Kurt.year} Kurt

In SQL, we express a theta join using the CROSS JOIN operator we learned earlier in the session:

    sqlite> SELECT * FROM Alice CROSS JOIN Kurt
    sqlite> WHERE Alice.year < Kurt.year;
    The Big Chill|1983|105|drama|Black Panther|2018|134|action
    The Big Chill|1983|105|drama|The Princess Bride|1987|98|fantasy
    The Big Chill|1983|105|drama|The Replacements|2000|118|comedy
    ...
    Star Trek|1979|132|scifi|Return of the Jedi|1983|131|scifi
    Star Trek|1979|132|scifi|48 Hrs.|1982|96|comedy
    Star Trek|1979|132|scifi|The Big Chill|1983|105|drama

CROSS is an adverb that modifies the verb JOIN in our statement. We have now seen two such adjectives: CROSS and NATURAL. In coming sessions, we will see a few more.

Review Exam 1

Quick review.

Opening Solution

Here are the four relations again:

    Student                  Offering

    SID Name  Class          Class Subject
    --- ----- -----          ----- -------
    101 Alex  10             10    Math
    102 Maria 11             10    English
                             11    Music
                             11    Science
      U            V

    A B C        B C D
    - - -        - - -
    1 2 3        2 3 4
    5 6 7        2 3 5
    6 7 8        4 3 2
    9 7 8        7 8 0

Student and Offering have one attribute in common, Class, so Student ⋈ Offering looks for tuples that match in this column. The result is:

    Student ⋈ Offering

    SID Name   Class Subject
    --- -----  ----- -------
    101 Alex   10    Math
    101 Alex   10    English
    102 Maria  11    Music
    102 Maria  11    Science

This exercise illustrates something we learned in last week's online session but did not see in a worked example there: the same tuple in one relation can be paired with multiple tuples in the other relation.

The second exercise is different, too. U and V have two attributes in common, B and C. A natural join looks for tuples that match in all of the shared columns. It finds two matches in V for (1,2,3) in U, and two matches in U for (7,8,0) in V:

Notice that the (2,3,5) tuple in V matches a tuple in U on the C attribute, but their B values don't match. As a result, ⋈ is unable to pair the tuples.

Did you think of a way to express the natural join query without using the natural join operator? If yes, then you have discovered the theta join, which we will discuss in a few minutes!

Return to the rest of the session.

Eugene Wallingford ..... wallingf@cs.uni.edu ..... October 8, 2020