CS 3140 Online Session 7

Online Session 7

Going Deeper into Relational Algebra

CS 3140
Database Systems

Opening Exercise

Consider these two tables:

RELATION: Course            ||     RELATION: Student

course  student  uid        ||     uid     name     yearBorn
------  -------  ------     ||     ------  -------  -----------
cs1800  Frank    123464     ||     123456  Felix    2001
cs1520  Frank    123464     ||     123459  Claire   1998
cs1520  Angela   123480     ||     123464  Frank    1997
cs1800  Mona     123471     ||     123471  Mona     2002
cs1520  Felix    123456     ||     123480  Angela   1996
                            ||     123494  Papa     1980

Answer these questions:

What is π_course,uid(Course)?
What is σ_{yearBorn≥2000}(Student)?
How many tuples are in Course ⨯ Student? Attributes?
What is π_name(Student) – π_student(Course)?

Bonus exercise: Can your express the relational algebra queries in Numbers 1, 3, and 4 as simple questions in words?

Solutions for the problems are at the bottom of the page. No peeking until you have your own answers! People learn better when they are invested in the answer that is shown to them.

Where Are We?

In class this week, we explored the basic operators of relational algebra: three set operations, two 'removal' operators (projection and selection), and one 'combination' operator (product). We also looked at ways that we can express the algebra's operators in SQL. Relational algebra is the theory that underlies SQL, but SQL does not mimic the algebra directly.

In this session, we explore relational algebra's renaming operator and learn a second, even more useful combination operator. Then we look at ways to combine queries into compound expressions of the sort that reflect the kinds of questions we want to be able to ask of our data.

The session deliverable is described at the bottom of the session. It's another check-in response, not some answer you produce.

An Operation that Renames Tuples and Attributes

When we apply relational operators, we create new relations. There are times when it would be convenient to specify the name of the resulting relation or the resulting relation's attributes.

Consider the example from class this week in which we computed R ⨯ S:

    A B C           C D
    -----  R        ---  S
    1 2 3           7 8
    4 5 6           3 9 
                    6 0


    A    B   R.C  S.C   D
    ---------------------  R ⨯ S
    1    2    3    7    8
    1    2    3    3    9
    1    2    3    6    0
    4    5    6    7    8
    4    5    6    3    9 
    4    5    6    6    0

The resulting relation has two attributes that are named C in the source relations R and S. To disambiguate the clash, the operator has to invent new names for the attributes in the new relation. We did so using member-style dot notation.

Another option is to give a name to the new attributes. Relational algebra include a renaming operator, ρ, that takes a relation and returns a new relation.

    ρ_R'(A1,...)(R)

The names used for the new relation, R', and its attributes, (A1,...), are given with the operator when it is applied.

In our product example, we can use ρ to rename the result:

    ρ_{T(A,B,C1,C2,D)}(R ⨯ S)

The operation names the relation that results from the product T, with attributes named A, B, C1, C2, and D.

If we only want to rename one part of a relation, we can omit a new relation name:

    ρ_{(A,B,C1,C2,D)}(R)

or a new attribute list:

    ρ_name(R)

Renaming can usually be done at different stages of a compound expression. In our example above, we renamed the result of the cross product:

    ρ_{T(A,B,C1,C2,D)}(R ⨯ S)

We could just as easily rename the columns of S before computing the product:

    R ⨯ ρ_(E,D)(S)

This renames S's C columns as D, and then computes the product. The result is:

    A    B    C    E    D
    ---------------------  R ⨯ S
    1    2    3    7    8
    1    2    3    3    9
    1    2    3    6    0
    4    5    6    7    8
    4    5    6    3    9 
    4    5    6    6    0

We will look at compound expressions after looking at one more operator, the natural join. In general, there will be multiple ways to construct any compound expression to compute any given result.

Combining Relations with the Join Operator

Last session, we learned how to combine tables using the product operator, which creates a new relation with all combinations of the tuples in its two operands. More frequently, though, we want only the combinations that match up in some way.

For example, if two tables both have a student number column, then we might want to see combinations where the student numbers match. This would be a way to create a new table with combined information about the student.

Or think back to multi-table query in Online Session 5-2. We had one table of customer data, keyed on social security number, and we had another table of address data that had a social security number column to indicate who lived at each address. Customer ⨯ Address would contain many irrelevant combinations... What we wanted was a table that contained all the tuples with matching social security numbers combined.

Relational algebra provides an operator that does just this, ⋈, which computes the natural join of two relations. R ⋈ S, pairs only the tuples that have the same values for the attributes they have in common.

Let a1, a2, ... an be the attributes that occur in both R and S. The natural join finds tuples r from R and s from S that have identical the values for all of a1, a2, ... an. It then creates a joined tuple containing a1, a2, ... an as well as the other attributes from r and s.

This is a little hard to explain with getting all mathematical, but an example can make the idea clear.

Consider again the example from class in which we computed R ⨯ S:

    A B C           C D
    -----  R        ---  S
    1 2 3           7 8
    4 5 6           3 9 
                    6 0

The product of these relations contains six tuples. However, both relations have an attribute named C. There are tuples in the two relations with matching values for C (in bold). If this attribute means the same thing in both relations, then we may be interested only in the combinations that match these tuples.

This is the natural join of R and S. The expression R ⋈ S produces this relation:

    A    B    C    D
    ---------------------  R ⋈ S
    1    2    3    9
    4    5    6    0

It combines the tuples with matching values for C and thus needs only one column for C in the result.

There is one tuple in S that does not pair with any tuple in R: (7 8). This dangling tuple does not occur in R ⋈ S.

This example is so small that it never pairs a tuple from R with more than one tuple from S, but that is certainly possible in a natural join. Consider the Customer and Address relations from our example database in Week 5:

    ssNo          name                               CUSTOMER
    -----------   -----     
    123-45-6789   Eugene Wallingford
    321-45-6789   John Smith

    areaCode   number     owner                      PHONE
    --------   --------   -----------
    319        273-5919   123-45-6789
    517        452-9159   321-45-6789
    319        273-2618   123-45-6789

    street     city          state   resident        ADDRESS
    --------   -----------   -----   -----------
    339 WRT    Cedar Falls   IA      321-45-6789
    1 Elm St   Plainfield    IN      321-45-6789
    305 WRT    Cedar Falls   IA      123-45-6789

Joining Customer with Phone would pair Eugene with two phone numbers. Joining Customer with Address would pair John with two addresses.

One last note: All of our examples thus far have joined relations with only one common attribute. Natural join also handles cases where the overlap between R's attributes and S's attributes contains two or more attributes. In these cases, paired tuples will match on the values of all the common attributes. We'll see examples of this later.

Compound Expressions in Relational Algebra

Using relational operators to compute new relations is interesting enough, but the real power of the algebra lies in being able to write compound expressions that compute answers to more challenging queries. As in any algebra, we can use the relation computed by one operator as an argument to another operator.

Conider the movie relations from our opening exercise in class this week.

   name                 year   length   genre
   --------------------------------------------     Alice
a1 The Big Chill        1983   105      drama
a2 Blade Runner         1982   117      scifi
a3 The Princess Bride   2020   98       fantasy
a4 Lethal Weapon        1987   109      action
a5 Black Panther        2018   134      action
a6 The Princess Bride   1987   98       fantasy
a7 The Replacements     2000   118      comedy
a8 Star Trek            1979   132      scifi
   --------------------------------------------     Kurt
k1 Black Panther        2018   134      action
k2 The Princess Bride   1987   98       fantasy
k3 The Replacements     2000   118      comedy
k4 Star Trek            1979   132      scifi
k5 Lethal Weapon        1987   109      action
k6 Return of the Jedi   1983   131      scifi
k7 48 Hrs.              1982   96       comedy
k8 The Big Chill        1983   105      drama

Suppose that we wanted to know the title and year of all the scifi movies Alice and Kurt like that are less than two hours long. Let's define the Movies relation as the union of Akices and Kurt's tables: (Alice ⋃ Kurt). Then we can build a query to answer this question using our relational operators in this way:

Select the movies that where length < 120.
Select the movies that where genre = 'scifi'.
Find the intersection of these two relations.
Project the intersection onto title and year.

Each step builds a solution out of the solutions to previous steps:

    1.  σ_{length < 120}(Movies)

    2.  σ_{genre = 'scifi'}(Movies)

    3.  (σ_{length < 120}(Movies))
           ⋂
        (σ_{genre = 'scifi'}(Movies))

    4.  π_title,year
           ((σ_{length < 120}(Movies))
               ⋂
            (σ_{genre = 'scifi'}(Movies)))

This sequence of expressions forms a tree of expressions and sub-expressions:

         Movies         Movies
           |              |
           |              |
           |              |
    σ_{length < 120}     σ_{genre = 'scifi'}
            \            /
             \          /
              \        /
                   ⋂
                   |
                   |
                   |
                   π_title,year

Relational algebra builds expressions out of other expressions, without variables or sequences of statements. In this sense, it resembles functional programming, a style of programming you learn in Programming Languages. We could add variables and sequences of statements to our own notation, but that wouldn't be relational algebra -- and, worse, that would undermine many of the advantages of using relational algebra in the first place.

One of those advantages is being able to manipulate expressions using identity rules to create different queries that compute the same result. For example, consider the query we just constructed:

    π_title,year
       ((σ_{length < 120}(Movies))
           ⋂
        (σ_{genre = 'scifi'}(Movies)))

We could write a functionally equivalent query using a single selection with a compound query

    π_title,year
       (σ_{length < 120 AND  genre = 'scifi'}(Movies))

The second query describes the same result as the first.

Which one is better? That really depends. As SQL programmers, we often delegate that question to our DBMS and let it decide which query is better given its implementation. Every relational DBMS performs query optimization that translates queries into equivalent expressions that can be evaluated more efficiently. We as programmers can use the same translation ideas to help us write clearer and, yes, more efficient queries of our own.

Opening Solution

This query projects the course and UID attributes from the Course relation. The result looks exactly like Course, minus the student column.

This query selects tuples from the Student relation with a yearBorn value of at least 2000. The result is a two-tuple relation containing Felix and Mona.

The product of Course and Student is a relation whose tuples pair all combinations of the tuples in the two relations. It consists of 30 tuples, 5 (number of courses) times 6 (number of students). Each tuple has six attributes, three coming from each source.

This query creates two projections, each containing the column corersponding to a student name. It then computes the difference, subtracting the Course projection from the Student projection. The result is a relation with two tuples of attribute each. These tuples will be Claire and Papa.

Do you want more practice? Use SQLite to compute the answers to these queries.

Return to the rest of the session.

Optional Digression

If you know the name of the book from which I took the students' names in the opening exercise, let me know! If you've read that book, good for you. We have something in common. (It's one of my favorites; I have read it many times...)

Session Deliverable

When you finish working through this session, email me to let me know you are done. Let me know which part of the session is least clear. Perhaps that is something you don't think you understand well yet, or it is something you understand but feel like the notes could do a better job explaining. Feel free to include as many questions or comments as you'd like!

Eugene Wallingford ..... wallingf@cs.uni.edu ..... October 1, 2020