In our first session, we learned that database systems decouple code that uses data from the code that implements the database. While more abstract than the physical layer, the logical model still tends to be written in technical language and to represent data in a way that does not reflect the problem domain directly. This led database designers to tease out a third layer, the idea of conceptual design, which models the data as users of the system might think of it, expressed in the vocabulary of a domain. This leaves us with three layers:
There are several kinds of conceptual model. In this week's online session, you begin to explore the Entity Relationship Model, still the most commonly used conceptual model in industry. Other techniques have evolved out of objected-oriented development practices, such as Unified Modeling Language (UML) and Object Description Language (ODL).
There are also several kinds of logical model. Today, we briefly consider them and then begin our study of the relational model, a powerful technique that will serve as our primary focus for the course.
A data model is a notation for describing information. At its simplest, a dota model consists of three elements:
Over time, programmer have developed many kinds of data model. These days, there are two basic classes used in practice:
As mentioned last week, the relational model stores data in tables.
title year length genre ------------------------------------------- Gone with the Wind 1939 231 drama Star Wars 1977 124 scifi Wayne's World 1992 95 comedy
The data structure looks like an array of records or structs, which makes it easy for humans to read and reason about. In practice, though, for efficiency reasons, relational data is almost never stored in an array. The set of operations supported by the model is called relational algebra, which we will begin studying soon. The relational model supports a wide variety of constraints beyond data types on its fields, including constraints across tables. We will consider a couple of the useful classes of constraints beginning in a few weeks.
In a semistructured data model, information is represented in a network of nodes that provides high-level structure to the data, often hierarchical. Common forms of semistructured data are XML and JSON.
The data structure resembles a set of nested, tagged nodes. The operations for manipulating such data draw on search algorithms for graphs: different ways to walk paths in the tree. Constraints here can also be more than data types, often relating data with different tags. For example, must every movie element contain a length attribute? Can the same movie contain two <year> attributes?
Since the rise of object-oriented programming in industry, there has been a trend to add OO concepts to both kinds of model. We may have a chance to look at some of these variants later in semester.
Why the relational model? Why does the relational model remain so dominant in both industry and academia? Semistructured models seem to be more flexible, subsuming full graph structures or even the tables we use in relational models. When databases become large and use cases more varied, two forces affect database modeling and programming:
With the relational model, we can achieve favorable trade-offs on both dimensions, due in large part, surprisingly, to its limitations:
So: the relational data model remains the focus of industry, academia, and this course. For this course, it has one more advantage: well-developed theory supports it, which means that we can understand models at a high level, express ideas clearly, and see how things work inside the DBMS.
Let's return to our example table of movie data:
title year length genre ------------------------------------------- Gone with the Wind 1939 231 drama Star Wars 1977 124 scifi Wayne's World 1992 95 comedy
It provides us with all we need to learn the basic vocabulary of the relational data model.
A relation is a two-dimensional table. We might call our relation "Movies". The rows correspond to movies, and the columns denote the properties of movies. The rows are a set, not a list. Order does not matter. "Movies" contains information about three movies.
Each column is named by an attribute. In this model, movies are stored with their names, year of release, length, and genre. The attribute names describe meaning of the information in the corresponding column.
A schema is the name of a relation plus the set of attributes.
Movies(title, year, length, genre)
The attributes also form a set, not a list. In principle, order does not matter. In practice, though, we talk about the columns more generally than we talk about the rows, so it's beneficial to specify a standard ordering.
A database consists of one or more relations. The set of schemas for those relations is the database schema.
With order attached to the columns, each row in a relation is a tuple:
(Gone with the Wind, 1939, 231, drama)The form of a tule does not include the column headers, so tuples are meaningful only in the context of a specific relation.
In the relational model, each component of a tuple (the value of an attribute) must be atomic: a base type, not a structured type. Each component comes from a specific domain. We often include the attributes types in the relation's schema.
Movies(title: string, year: integer, length: integer, genre: string)
Quick Exercise. As noted earlier, a relation is a set of tuples, not a list. How many different representations of the "Movies" relation are there? Attributes are a set, too. How many representations of "Movies" can we create by changing their order? This may seem like number trivia, but it will be helpful to remember these ideas later when we program. They are important for the theory we use to manipulate relations, and even more important as a tool for DBMSes to use when they optimize storage and access to the data.
3! = 6. 4! = 24. 3!4! = 144.
A relation's schema tends to stay the same (the create operation), but we expect the relation itself to change frequently (the read, *update*, *delete*, and *insert*) operations.
An instance of a relation is the specific set of tuples that exist at any moment. The DBMS usually contains only the current instance -- no history. However, when we reason about a relation, we reason over all possible instances. Consider what that means for constraints on the data values....
That said, we sometimes do need to change a schema. This is both expensive (due to size) and impractical (due to data availability: if we add a new attribute, where do the values for existing tuples come from?). I'm a proponent of agile software development, but this is an example of how different layers of an application change at different rates. Software is more malleable than database schema. It is worth planning ahead. (Story about Charlie Richter).
One of the strengths of the relational model is that it allows many kinds of constraint. We will devote considerable kinds to two kinds later in the course. Most are optional, but the model depends on one essential constraint. A key is a set of one or more attributes that uniquely identifies a tuple. No two tuples can share the same value for a key. This constraint holds for the entire relation over all possible instances. For example, in the current instance of the "Movies" relation, any attribute can be used to distinguish the tuples. But is that true for all possible instances? Is the title of the movie sufficient to distinguish it from all others? (Batman, King Kong, ...) Perhaps title + year is enough. This is a compound key.
Typographically, we identify a relation's key by underlining the attribute(s) in the schema:
Movies(title, year, length, genre)
What if no set of keys will do the job? We can create an artificial keys. This is a new attribute added to the tuple as a unique identifier. Social security numbers and UNI student numbers are example of artificial keys. When defining an artificial key, the designer wants to choose a domain of values that is big enough to label all the unique tuples that we might want to add to the relation. UNI IDs come of from a domain of size one million... or one billion. (Story about SSNs, UNI IDs, and a CS student.)
Consider movies as part of a larger database:
This database contains examples of all the terms we have just learned. Can you identify them? Look at the keys in particular. "MovieExec" needs an artifical key. In terms of the Entity Relationship model, "StarsIn" is a relationship, not an entity! Its key contains all of its attributes. (Why?) Other relationships are embedded the other relations as attributes. We'll consider how one can convert an ER model into a relational model soon.