CS 3140 Session 2

Session 2

The Relational Model of Data

CS 3140
Database Systems

Context

In our first session, we learned that database systems decouple code that uses data from the code that implements the database. While more abstract than the physical layer, the logical model still tends to be written in technical language and to represent data in a way that does not reflect the problem domain directly. This led database designers to tease out a third layer, the idea of conceptual design, which models the data as users of the system might think of it, expressed in the vocabulary of a domain. This leaves us with three layers:

adding the conceptual layer to our system

There are several kinds of conceptual model. In this week's online session, you begin to explore the Entity Relationship Model, still the most commonly used conceptual model in industry. Other techniques have evolved out of objected-oriented development practices, such as Unified Modeling Language (UML) and Object Description Language (ODL).

There are also several kinds of logical model. Today, we briefly consider them and then begin our study of the relational model, a powerful technique that will serve as our primary focus for the course.

Data Models

A data model is a notation for describing information. At its simplest, a dota model consists of three elements:

the kind and structure of the data,
the operations that can be performed on the data, and
any constraints on the data.

This sounds a lot like a data type! This is not an accident. Data types and data models play the same role, only at different scales. You will notice some differences in degree on all three elements: The data in a data model is usually composite, not primitive. Data models generally support a smaller number of operations than a data type, but they generally support more complex kinds of constraints. We will study constraints in more detail in several weeks.

Over time, programmer have developed many kinds of data model. These days, there are two basic classes used in practice:

the relational model, with data structured in tables
a semi-structured model, with data structured in trees or graphs

The relational model has been the primary model used in most commercial DBMS for over thirty years. Semi-structured data models have become more popular in recent years, in the form of XML and its variants, and is often offered as an add-on to commercial relational DBMS.

As mentioned last week, the relational model stores data in tables.

    title                year   length   genre
    -------------------------------------------
    Gone with the Wind   1939   231      drama
    Star Wars            1977   124      scifi
    Wayne's World        1992   95       comedy

The data structure looks like an array of records or structs, which makes it easy for humans to read and reason about. In practice, though, for efficiency reasons, relational data is almost never stored in an array. The set of operations supported by the model is called relational algebra, which we will begin studying soon. The relational model supports a wide variety of constraints beyond data types on its fields, including constraints across tables. We will consider a couple of the useful classes of constraints beginning in a few weeks.

In a semistructured data model, information is represented in a network of nodes that provides high-level structure to the data, often hierarchical. Common forms of semistructured data are XML and JSON.

The data structure resembles a set of nested, tagged nodes. The operations for manipulating such data draw on search algorithms for graphs: different ways to walk paths in the tree. Constraints here can also be more than data types, often relating data with different tags. For example, must every movie element contain a length attribute? Can the same movie contain two <year> attributes?

Since the rise of object-oriented programming in industry, there has been a trend to add OO concepts to both kinds of model. We may have a chance to look at some of these variants later in semester.

Why the relational model? Why does the relational model remain so dominant in both industry and academia? Semistructured models seem to be more flexible, subsuming full graph structures or even the tables we use in relational models. When databases become large and use cases more varied, two forces affect database modeling and programming:

efficiency, of both access and modification
programmer productivity

With the relational model, we can achieve favorable trade-offs on both dimensions, due in large part, surprisingly, to its limitations:

data is structured in simple form, but the form is versatile
there is only a small set of operations, but they are versatile

Simplicity is a virtue here. As a a high-level modeling language, programmers can be very productive. ("A few lines of SQL, a few hundred lines to process XML, a thousand lines of C.") The limited set of operations is like assembly language, amenable to manipulation and optimization by the DBMS, provising very efficient run-time behavior. (We explore more general examples of this in CS 3540 and CS 4550).

So: the relational data model remains the focus of industry, academia, and this course. For this course, it has one more advantage: well-developed theory supports it, which means that we can understand models at a high level, express ideas clearly, and see how things work inside the DBMS.

The Relational Data Model

Let's return to our example table of movie data:

    title                year   length   genre
    -------------------------------------------
    Gone with the Wind   1939   231      drama
    Star Wars            1977   124      scifi
    Wayne's World        1992   95       comedy

It provides us with all we need to learn the basic vocabulary of the relational data model.

A relation is a two-dimensional table. We might call our relation "Movies". The rows correspond to movies, and the columns denote the properties of movies. The rows are a set, not a list. Order does not matter. "Movies" contains information about three movies.

Each column is named by an attribute. In this model, movies are stored with their names, year of release, length, and genre. The attribute names describe meaning of the information in the corresponding column.

A schema is the name of a relation plus the set of attributes.

    Movies(title, year, length, genre)

The attributes also form a set, not a list. In principle, order does not matter. In practice, though, we talk about the columns more generally than we talk about the rows, so it's beneficial to specify a standard ordering.

A database consists of one or more relations. The set of schemas for those relations is the database schema.

With order attached to the columns, each row in a relation is a tuple:

    (Gone with the Wind, 1939, 231, drama)

The form of a tule does not include the column headers, so tuples are meaningful only in the context of a specific relation.

In the relational model, each component of a tuple (the value of an attribute) must be atomic: a base type, not a structured type. Each component comes from a specific domain. We often include the attributes types in the relation's schema.

    Movies(title: string, year: integer, length: integer, genre: string)

Quick Exercise. As noted earlier, a relation is a set of tuples, not a list. How many different representations of the "Movies" relation are there? Attributes are a set, too. How many representations of "Movies" can we create by changing their order? This may seem like number trivia, but it will be helpful to remember these ideas later when we program. They are important for the theory we use to manipulate relations, and even more important as a tool for DBMSes to use when they optimize storage and access to the data.

3! = 6. 4! = 24. 3!4! = 144.

A relation's schema tends to stay the same (the create operation), but we expect the relation itself to change frequently (the read, *update*, *delete*, and *insert*) operations.

An instance of a relation is the specific set of tuples that exist at any moment. The DBMS usually contains only the current instance -- no history. However, when we reason about a relation, we reason over all possible instances. Consider what that means for constraints on the data values....

That said, we sometimes do need to change a schema. This is both expensive (due to size) and impractical (due to data availability: if we add a new attribute, where do the values for existing tuples come from?). I'm a proponent of agile software development, but this is an example of how different layers of an application change at different rates. Software is more malleable than database schema. It is worth planning ahead. (Story about Charlie Richter).

One of the strengths of the relational model is that it allows many kinds of constraint. We will devote considerable kinds to two kinds later in the course. Most are optional, but the model depends on one essential constraint. A key is a set of one or more attributes that uniquely identifies a tuple. No two tuples can share the same value for a key. This constraint holds for the entire relation over all possible instances. For example, in the current instance of the "Movies" relation, any attribute can be used to distinguish the tuples. But is that true for all possible instances? Is the title of the movie sufficient to distinguish it from all others? (Batman, King Kong, ...) Perhaps title + year is enough. This is a compound key.

Typographically, we identify a relation's key by underlining the attribute(s) in the schema:

    Movies(title, year, length, genre)

What if no set of keys will do the job? We can create an artificial keys. This is a new attribute added to the tuple as a unique identifier. Social security numbers and UNI student numbers are example of artificial keys. When defining an artificial key, the designer wants to choose a domain of values that is big enough to label all the unique tuples that we might want to add to the relation. UNI IDs come of from a domain of size one million... or one billion. (Story about SSNs, UNI IDs, and a CS student.)

Bigger Example

Consider movies as part of a larger database:

This database contains examples of all the terms we have just learned. Can you identify them? Look at the keys in particular. "MovieExec" needs an artifical key. In terms of the Entity Relationship model, "StarsIn" is a relationship, not an entity! Its key contains all of its attributes. (Why?) Other relationships are embedded the other relations as attributes. We'll consider how one can convert an ER model into a relational model soon.

Eugene Wallingford ..... wallingf@cs.uni.edu ..... August 27, 2020