CS 3140 Session 11

Session 11

Basic Ideas of Database Design

CS 3140
Database Systems

Opening Exercise

Recall our NASDAQ stock database, which we last used in Week 9. Its schema is:

    EndOfDayData(symbol, date, open, high, low, close, volume)

with data that look like this:

    AMAT|20050506|15.56|15.66|15.33|15.63|20778700

Write SQL queries to answer these requests:

Compute the average closing price for each stock.

Find the companies whose average closing price is more than eight times the average closing price of all stocks.

You can use the first query as a subquery in the second.

(Spoiler alert: the answers come next...).

Opening Solution

Last session, we learned about the GROUP BY clause, which instructs the DBMS to group tuples in the relation by the listed attribute before projecting from SELECT clause. This includes applying any aggregator function to values within the groups.

This is exactly what we need to perform our first task:

    sqlite> .mode column
    sqlite> SELECT symbol, AVG(close)
       ...> FROM EndOfDayData
       ...> GROUP BY symbol;
    symbol  AVG(close)
    ------  ----------
    AABC    14.4      
    AACB    23.672    
    AACC    21.298    
    ...
    ZOOM    2.422     
    ZRAN    10.966    
    ZVXI    3.438

There are a lot stocks in our file, so I had to elide a lot of lines. To make this query useful, we would need to select a smaller subset of the tuples, or produce a voluminous report.

The second request looks trickier. But you know everything you need to solve it. We just need to break it down into to parts that we can implement piece by piece.

Our first query is one of those pieces. It compute every stock's average closing price. We also need to know the average closing price of all stocks. That's a simple query to write:

    sqlite> .mode list
    sqlite> SELECT AVG(close) FROM EndOfDayData;
    15.8546278928136

We can use these two queries as subqueries. The first query builds the relation of stocks from which we want to select our answer (the FROM clause). The second query is part of the condition we need to select our answer (in the WHERE clause):

    SELECT symbol
    FROM (SELECT symbol, AVG(close)
          FROM EndOfDayData
          GROUP BY symbol)
    WHERE ??????? > 8 * (SELECT AVG(close)
                         FROM EndOfDayData);

We do need another piece of the puzzle: To select the stocks that trade at the highest prices, we need to be able to refer to the average closing price of each stock. We can't use the expression AVG(close), which is an aggregator. But we can rename that column when we produce the relation!

    SELECT symbol
    FROM (SELECT symbol, AVG(close) AS average
          FROM EndOfDayData
          GROUP BY symbol)
    WHERE average > 8 * (SELECT AVG(close)
                         FROM EndOfDayData);

That will do it:

    sqlite> .mode list
    sqlite> SELECT symbol
       ...> FROM (SELECT symbol, AVG(close) AS average
       ...>       FROM EndOfDayData
       ...>       GROUP BY symbol)
       ...> WHERE average > 8 * (SELECT AVG(close)
       ...>                      FROM EndOfDayData);
    FCNCA
    GOOG
    MITSY
    NWLIA
    SHLD

I was curious which of these stocks was most expensive, so I modified the previous query to display the companies' stock symbols in descending order of their average closing price. All I need is an ORDER BY clause: price:

    sqlite> SELECT symbol
       ...> FROM (SELECT symbol, AVG(close) AS average
       ...>       FROM EndOfDayData
       ...>       GROUP BY symbol)
       ...> WHERE average > 8 * (SELECT AVG(close)
       ...>                      FROM EndOfDayData)
       ...> ORDER BY average DESC;
    GOOG
    MITSY
    NWLIA
    SHLD
    FCNCA

During this period in 2005, only five stocks closed at more than eight times the average stock price. Google was the most expensive of them.

This query repeats a design pattern we saw in Session 10: we can use a subquery to implement something like a local variable. As we write larger queries, we will frequently use this pattern.

This query is perhaps the most complex query we have written thus far. We write complex queries -- SQL programs -- the same way we write large programs in any language.

Break the problem into pieces we understand.
Solve the smaller problems.
Put the partial solutions to together to create a final solution.
Find small gaps and write code to fill them.

With practice, you will develop the experience you need to write complex SQL queries. For now, study the SQL basics, trust what you know, and take small steps.

Here is an SQL source file containing my two queries. It is file of code, just like a set of functions you might write and save in Python or Java. It is a text file. It is also "executable", in the sense that we can load it into an interpreter (say, SQLite's command shell) and see the results.

Where Are We?

We've spent the last couple of weeks learning to program in SQL, based in large part on the idea of relational algebra. Now we take a brief break from new SQL, while you practice what you've learned, consolidate your knowledge, and prepare for Exam 2 next week.

This week, we step back to reconsider how we design our databases. Previously, we learned techniques for doing conceptual and logical design to create a set of relations. How can we be sure that our schemas represent data in away that makes sure it can be used effectively?

Relational theory has tools to help us: the idea of functional dependencies and the technique of decomposition. By applying them, we can design databases that support our use of the data, including our programs. We will then be able to return to SQL programming with new tools.

How A Schema Can Be Sub-Optimal

In Weeks 2 through 4, we learned how to create conceptual data models using entity-relationship modeling and logical data models using relational modeling. These techniques produced a set of relations. We could even produce a set of relations directly based on our experience in a domain.

However we create our initial set of relations, it is common for these relations to have flaws that affect our ability to work with the data effectively. Consider this instance of a relation I call Movie-v0:

    title            year  rating  director  yearBorn  actor    
    ---------------  ----  ------  --------  --------  ---------
    Fargo            1996  8.2     Coen      1954      McDormand
    Raising Arizona  1987  7.6     Coen      1954      Cage     
    Raising Arizona  1987  7.6     Coen      1954      McDormand
    Spiderman        2002  7.4     Raimi     1959      Maguire  
    Wonder Boys      2000  7.6     Hanson    1945      Maguire  
    Wonder Boys      2000  7.6     Hanson    1945      McDormand

It contains much of the information we've seen in our running example of a movie database, but in one table.

One table seems simpler than three or four, but as a database it creates problems for us. Many of these problems result from redundancy in the table. Coen's year of birth is repeated for every movie he directs. Entire tuples are repeated when a movie has more than one star. This redundancy hampers our ability to use the data.

update anomalies: If the average rating for "Raising Arizona" changes, we have to remember to find all tuples for that movie and change the rating in all of them.

deletion anomalies: If we delete "Spiderman" from the database, we lose our only source of information about director Raimi. If we decide not to list McDormand as a star of "Fargo", then we either have to delete the only tuple with information about "Fargo" or leave a field blank.

insertion anomalies: If add a movie to our collection but don't know the director's birthday, we either have to leave a field blank or not add a new tuple to the relation.

These anomalies make working with the database difficult. Redundancy by itself wastes space and makes the data harder to understand. How can we redesign the database to eliminate these problems?

Rather than "hack" at the schema, trying to make arbitrary changes in order to make it better, we can use a bit of relational theory to redesign the schema in a principled way.

The key idea is that of a functional dependency... which, coincidentally, relates to the idea of a relation's key!

Functional Dependency

Relational design theory empowers us to examine a schema carefully and make small, incremental improvements based on only a few ideas. The first is to identify constraints on the values of attributes.

For example, in Movie-v0, the title and year of a movie determine the movie's rating. Both tuples for "Raising Arizona" have the same rating, because they refer to the same movie. It wouldn't make sense for one of those tuples to have a rating of 7.6 and the other a rating of 6.8.

If this constraint is true for all possible instances of the schema, then we call it a functional dependency. A functional dependency says that if two tuples have the same values for attributes I₁,...,I_j, then they must have the same values for attributes D₁,...,D_k. We write this as:

    I₁,...,I_j → D₁,...,D_k

The word "must" in that definition is important. A functional dependency applies to all possible instances of the relation. If a particular dataset has an example where two sets of attributes match up coincidentally, then the correlation is not a dependency.

To make functional dependencies easier to work with we will often write them with only one attribute on the righthand side:

    I₁,...,I_j → D₁
    I₁,...,I_j → D₂
    ...
    I₁,...,I_j → D_k

Clearly, if the I attributes determine the entire collection of D attributes, then they determine any individual D.

Consider again Movie-v0:

    title            year  rating  director  yearBorn  actor    
    ---------------  ----  ------  --------  --------  ---------
    Fargo            1996  8.2     Coen      1954      McDormand
    Raising Arizona  1987  7.6     Coen      1954      Cage     
    Raising Arizona  1987  7.6     Coen      1954      McDormand
    Spiderman        2002  7.4     Raimi     1959      Maguire  
    Wonder Boys      2000  7.6     Hanson    1945      Maguire  
    Wonder Boys      2000  7.6     Hanson    1945      McDormand

Its schema is Movie-v0(title, year, rating, director, yearBorn, actor). This functional dependency holds for Movie-v0:

    title year → rating director yearBorn

This says that if two tuples have the same values for title and year, they must also have the same values for rating, director, and yearBorn. This makes sense if we believe that no two movies of the same name will be released in the same year, every movie has one director, and every person is born in exactly one year.

However, Movie-v0 is not bound by this dependency:

    title year → actor

Movies can have more than one star, as evidenced by "Raising Arizona" and "Wonder Boys" in our instance. Even if I had not included a tuple for Nicholas Cage as the co-star of "Raising Arizona", this assertion would not be true of movies in general.

This is also not a functional dependency in Movie-v0:

    yearBorn → actor

We have counterexamples in this instance, and even if we didn't, there would be no domain reason to expect that the year a director is born determines the actors in a movie. (There may be a loose relationship, of course...).

Perhaps you can see now why these constraints are called functional dependencies: a given value for the lefthand side returns, or is associated with, a single for the righthand side. This a function in the most general mathematical sense: a set of ordered pairs. As in many contexts, the existence of a function enables us to reason in a particular way about the set.

Functional dependency is also clearly related to what we have been referring to as the "key" of a relation. We say that a set of attributes { A₁, ..., A_n }, where n ≥ 1, is a key for a relation if

{ A₁, ..., A_n } functionally determines all of the other attributes in the relation, and
no proper subset of { A₁, ..., A_n } functionally determines all the other attributes in the relation.

Informally, these rules say that it is impossible for two tuples to have the same values for all of the A attributes, and that the set A is minimal.

Consider again Movie-v0. The set { title, year, actor } is a key for the relation. As we saw earlier,

    title year → rating director yearBorn

is a functional dependency on the relation. Since the combination of title and year functionally determines all the other attributes, so does the combination { title, year, actor }. So this set satisfies the first condition.

What about the second condition? Is there a subset of { title, year, actor } that determines the rest of the tuple? We've already discussed the fact that

    title year → actor

is not a functional dependency, because movies can have more than one star. Actors can certainly appear in multiple movies during a single year, even if our instance doesn't contain examples of the phenomenon. So { year, actor } does not determine title. Finally, as rare as it may be, the same actor occasionally stars in two movies with the same name, so { title, actor } does not determine year. Thus, { title, year, actor } is minimal.

A relation can have more than one key. For example, my record in the UNI employee database may well store both my Social Security number (SSN) and my university ID number (UID). Both are unique to me and thus functionally determine all the other attributes in my record, including each other. Thus, either is a key for the relation.

In practice, we sometimes designate one of these keys as the primary key. This is especially true when we work with a DBMS. In our SQL CREATE TABLE statements, we often identify a set of attributes as the primary key. The choice of a primary key can influence how the DBMS stores and manipulates our relations. We will return to this issue in a few sessions.

However, the idea of a primary key has no particular role in the relational theory of functional dependencies. Any set of attributes that satisfies the two conditions listed above is a key and can be used as such in process of improving our schema.

A Quick Exercise

Consider this set of attributes for a relation about people in the United States:

    name               state 
    SSN                zip code
    street address     area code
    city               phone number

What functional dependencies hold for this relation?

What are the keys for this relation?

A Solution

Answers to these questions can vary depending on how we interpret domain requirements. This is part of what makes database design a challenge!

Possible functional dependencies include:

    SSN → name
    area code → state
    street address, city, state → zipcode
    zipcode → area code

Here is one possible key:

    { SSN, street address, city, state, area code, phone number }

This candidate omits the name and zip code. Its viability depends on several assumptions:

We need street address, city, and state to uniquely determine the location.
A person can have multiple addresses.
A person can have multiple phones.

But can't we omit state, too, as it is determined by area code? What other possibilities are there?

Notice how we can use the functional dependencies to help us design a key...

Wrap-Up

Functional dependencies can help us design a key because they help us to see how some attributes determine others. They also help us to find "cleavage" points in a relation. Some relations, like Movie-v0, try to do too much. Functional dependencies point out specific ways in which this is true. Sometimes, we tap on a functional dependency, and an overambitious schema breaks naturally into two or more better relations.

We can move toward our goal of being able improve our database schemas by learning three new ideas:

the connection between functional dependencies and anomalies
the idea of decomposition
the idea of normal formals

and then tying them all together in:

a technique for using functional dependencies to decompose relations in a way that ensures they are in the desired normal form.

We begin this process in this week's on-line session. We will also look at a bit of SQL that we have neglected thus far but which are relevant to the anomalies: insertions into, deletions from, and updates to an existing database.

See you next time!

Eugene Wallingford ..... wallingf@cs.uni.edu ..... November 2, 2020