CS 3140 Session 1

Why a Course on Database Systems

A database is an organized collection of data. Why do we need a course about databases, or any new ideas at all? We could just store data in a flat text file and process it.

Yes, we could. In this approach, a database is a set of flat files, and the database system consists of the programs that manipulate them. Consider, for instance, a database that models a digital music store such as Spotify. The database will hold information about entities such as artists, albums, and songs. The system will include programs for looking up songs, finding all the albums created by an artist, deleting songs that are no longer available, and so on.

The database can be stored as comma-separated value (CSV) files that the programs manage. Each entity will be stored in its own file. Each program will open one or more files, parse them, and return information. If the user wants to make a change the program will update one or more lines and write out the entire file.

Each entity will have its own set of attributes:

artist: name, country, rating
album : name, year of release, artist, songs it contains, rating
song: name, year of release, length, artist, rating

Information about the each entity is stored in its own file. In each file, the entries (called records) are delimited by new lines, and the attributes (called fields) within a record are delimited by a comma.

The data files collectively make up a database. The programs collectively make up a database management system.

Isn't this all we need? What could go wrong?

efficiency: each program has to parse the file every time it is opened
consistency: an artist's name appears in two files but with different spellings
maintainability: each program has to duplicate code that appears in other programs

Can you think of others?

Consistency relates to a more general concern: data integrity. How do we ensure that the artist name is the same for each of the artist's albums? What if a program overwrites an album's year with an invalid value?

Maintainability is a concern about implementation. How do we find records and fields of interest? What happens if we want to scale up and have multiple threads try to modify and write the same file at the same time?

Efficiency is about implementation, too, but also affects user experience. Inefficient programs make users unhappy. So do programs that corrupt a file when the machine crashes in the middle of an update.

There are other concerns, too: redundancy, security, and integration. These affect implementors and users alike.

Flat files are simple, which is good enough for many small or short-lived applications. But they don't meet the needs of most big applications or modern users. We need something more.

Course Details

Welcome to CS 3140, Database Systems. I am Eugene Wallingford, your instructor for the course. This is one of the classic systems courses that have been part of CS curricula for as long as CS programs have existed. It will have a little theory, a little design, and a little programming.

Earlier this week, I emailed you a link to the course home page:

http://www.cs.uni.edu/~wallingf/teaching/cs3140/

The web site there includes pointers to a full syllabus and links to all the materials you'll need for this course, including lecture notes, online activities, homework assignments, and exams. Bookmark in your browser. You'll want to go there regularly!

Study the course syllabus carefully, especially if you've never had me for a course, but even if you have. It lists the policies by which we will run this course. You will need to know these policies and when they apply.

You will also find a rough schedule for the semester on the home page, which also lists very tentative dates for homework and exams. I post almost everything I say, write, and do in class on the course website -- including lecture notes such as these. The notes for the course will occasionally be a little rougher than I like. This is my first time teaching CS 3140, so I'm creating all the course materials from scratch.

With the Covid-19 pandemic in progress, this course is being taught in a hybrid fashion:

one face-to-face session each week
one online activity each week

My plan at this point is to...

post readings for the week before the week starts
post the online activity on Monday, with deliverables due on Friday
post session notes on Thursday, after class

I don't take attendance in my classes, so requiring deliverables for the online portion of the course feels odd. But without the pull of coming to a class, many students want or need an incentive to do their online work. I'll post session notes after both groups have met, because I view them as a way to augment what you do in class, not replace it.

There are plenty of textbooks we could use in this course. Some are pretty good. Unfortunately, almost all of them are similar to the one we've used in the course over the last five years: it is much bigger than what we can cover in one semester, and it is very expensive. Fortunately, there are options, including some readings on the web.

We will use two online texts: Database Design, by Adrienne Watt, and Mastering Relational Databases: from Models to Querying, by Libby Shoop. I will assign readings from the former as we learn many of the fundamental elements and techniques of database design, primarily early in the course. We will use the latter when we move into the database programming phase of the course. It has the added advantage of being interactive, so that you can run code live and see what it does.

Oh, and about those session notes: I write lots of my own notes and give them to you.

There are many more details in the syllabus. At this point, almost every implementation detail of the course is subject to change! I will always give you advance notice of substantive changes.

Back to the Story

A database is an organized collection of related data that models some aspect of the real world. Spotify's database is an example. A database management system (DBMS) is the software that manages a database. Oracle, MySQL, MongoDB, and SQLite are examples. Many people confuse databases with DBMSes, or at least use the terms interchangeably. You will want to know and appreciate the difference.

A DBMS is the software that enables application programs to store and analyze information in a database. It is almost always designed to support the definition and creation of database, as well as querying, updating, and managing existing databases.

In the world of flat-file databases, there is a tight coupling between the logic of the database system and its implementation. The logical layer is what defines the entities and attributes that make up the database. The physical layer defines and implements the data are stored in the machine. In early DBMSes, the physical layer was defined and replicated in application code. This meant that if the client wanted to change the physical layer, then all of the application code had to be changed to match the new implementation. This made database applications difficult to build and maintain.

One way that computer scientists control or eliminate couple is to create a new level of abstraction. In 1970, Ted Codd proposed the relational model of databases to separate the logical and physical layers of database systems. Using the relational model,

Systems store data in a simple data structure called a relation.
Programs access the data through a high-level language,
The details of the physical storage are left up to the DBMS's implementation.

The abstract idea of a data model becomes central to any database application. It is a mathematical representation of data. In the relational model, data are represented using tables that resemble the entries in a flat-file database. But there are other models, too. In semistructured models, data are represented using trees or graphs. Whatever model is used, we must define the values that can be stored and the operations that can be performed on the data.

If you have taken Programming Languages and Paradigms, or even if you haven't, you may recognize this pair of definitions as what makes up a data type. Over the course of the semester, I think you will notice many relationship between data models and the conventional programs we write.

A data model is created by the designer of the database. Programmers refer to the data model when they write code that accesses the database.

the data model decouples the logical from the physical — *The data model decouples database users from the physical layer.*

A database course could be about many things, including:

theory
database design (admins)
database programming (users)
database implementation (creators)

This course focuses on the three of these topics:

how to design databases using a data model
how to use database by writing programs
how theory can inform both of these tasks

We will not implement a DBMS, though that could be a lot of fun! We will, however, learn a bit about different implementations and how they affect the design and use of the databases we care about.

We will study relations as our primary data model, with consideration of other models throughout the course and especially in our final unit. The relational model is the dominant model in both industry and academia, and it has a well-understood theoretical foundation: relational algebra. Relational algebra is the set of fundamental operations for manipulating tuples in a relation. Learning a little bit of theory can make us better database designers and programmers.

There are many languages for doing database programming. With our focus on the relational model, we will focus our programming efforts on the most common relational programming language in the world, SQL. There are many variations of SQL. We will likely do all of our programming using one of them, SQLite. In terms of learning ideas, any SQL will do. The basics of SQL are the same in all of them; what changes most is surface syntax. In practice, most programming languages abstract away the database anyway, so many people end up programming to a programming language's API.

SQLite offers at least two advantages that are particularly valuable to us:

We can use the SQLite in a browser, as our interactive text does.
An SQLite database is a single file, which makes it easy to to share, replace, and backup databases.

SQLite is also useful out in the world: it underpins Apple Core Data and the Android DB framework.

Our foray into the theory of relational algebra pays of when we do database programming, too. It gives us the tools we need to optimize our programs, which turns out to be really important when working with relational databases. Why?

Database programming centers on limited programming languages. We don't need the full power of a Turing-complete language, because our data model constrains what we do. Indeed, database programming is one of the few areas where a non-Turing-complete language makes a lot of sense. This leads to very succinct programs, which is good... except that it also creates some unique query optimization problems. Relational algebra will help us solve those problems. It is the theory that supports SQL.

And yes, I say S - Q - L.

Course Goals and Outcomes

Given all that, what can you expect to learn in this course? At the end of the semester, you should be able to:

Compare and contrast several different kinds of data model.
Use relational algebra to express database queries.
Use SQL to interact with database management systems.
Design database tables, using functional dependencies and normal forms.
Identify the trade-offs among various database systems techniques, including relational systems and distributed alternatives.

So there we are. This should be fun. I said earlier that this is one of the classic systems courses in CS. But in the grand scheme of things, "classic" doesn't mean the same thing here as it does in art, music, or literature. All of computing is young and fresh and still growing.

In fact, the word "database" turns 65 years old this year.

The first citation for database comes from a 1955 economics journal article. In 1962 it was written as two separate words in quotes in the following explanation: "A 'data base' is a collection of entries containing item information that can vary in its storage media and in the characteristics of its entries and items." Now it's so common a concept it needs no explanation or scare quotes.

(As that article says, so is the term "artificial intelligence"!)

We can learn and benefit from our history in intimate ways. The database course is a great example of that.

Life in the Time of Covid-19

As the syllabus reminds us, we are all adapting to the current health crisis. A few of these adaptations require our attention now:

Prepare a seating chart in class.
The final exam is on Saturday.
We are all wearing masks. This is essential to keep the class face-to-face. Students need to help keep each other and their professors healthy.

It's also university policy. "If a student does not wear a face covering (cloth or shield) throughout the class period, faculty should ask them first to put on a face covering. If they do not comply they should be asked to leave class and be referred to the student conduct process."

Eugene Wallingford ..... wallingf@cs.uni.edu ..... August 20, 2020

CS 3140 Database Systems

Fall 2020

Session 1

A Gentle Introduction
to Database Systems

Why a Course on Database Systems

Course Details

Back to the Story

Course Goals and Outcomes

Life in the Time of Covid-19

CS 3140 Database Systems

Fall 2020

Session 1

A Gentle Introduction to Database Systems

Why a Course on Database Systems

Course Details

Back to the Story

Course Goals and Outcomes

Life in the Time of Covid-19

A Gentle Introduction
to Database Systems