CS 3140 Session 13

Session 14

The Final Query

CS 3140
Database Systems

These notes are very rough. I will improve them over time.

Opening Exercise

Recall our movie database with tables for Alice and Kurt:

    name                 year   length   genre
    --------------------------------------------     Alice
    The Big Chill        1983   105      drama
    Blade Runner         1982   117      scifi
    The Princess Bride   2020   98       fantasy
    Lethal Weapon        1987   109      action
    Black Panther        2018   134      action
    The Princess Bride   1987   98       fantasy
    The Replacements     2000   118      comedy
    Star Trek            1979   132      scifi
    --------------------------------------------     Kurt
    Black Panther        2018   134      action
    The Princess Bride   1987   98       fantasy
    The Replacements     2000   118      comedy
    Star Trek            1979   132      scifi
    Lethal Weapon        1987   109      action
    Return of the Jedi   1983   131      scifi
    48 Hrs.              1982   96       comedy
    The Big Chill        1983   105      drama

Write the SQL statements needed to...

Create an index on the length attribute of each table.
Find all pairs of an Alice film followed by a Kurt film that we can watch in less than 3 1/2 hours combined.

Opening Solution

When we create an index, it applies to a single table. So we need to create two indexes here:

    sqlite> CREATE INDEX LengthIdx1 ON Alice(length);
    sqlite> CREATE INDEX LengthIdx2 ON Kurt(length);

The second task requires a cross product:

    sqlite> .mode column
    sqlite> SELECT Alice.name, Kurt.name
            FROM Alice,Kurt
            WHERE Alice.length + Kurt.length < 210;

The result is not ideal. It contains pairs with repeats when both collections contain the same movie. The tuples are also in no particular order. So let's add a condition and a sorting clause:

    sqlite> SELECT Alice.name, Kurt.name
            FROM Alice,Kurt
            WHERE Alice.length + Kurt.length < 210 AND
                  Alice.name != Kurt.name
            ORDER BY Alice.name;

This list looks like it has duplicates, but remember that (name, year) is the key for these tables. There are two different "Princess Bride" movies in Alice's collection! So let's project the year columns, too:

    sqlite> SELECT Alice.name, Alice.year, Kurt.name, Kurt.year
            FROM Alice,Kurt
            WHERE Alice.length + Kurt.length < 210 AND
                  Alice.name != Kurt.name
            ORDER BY Alice.name;

One downside to this query is that it doesn't return back-to-back Princess Brides even when they are different versions. Does that seem reasonable? How would you remedy this situation?

Alas, this query does not use either of our indexes. It adds the two attributes before it compares the sum to a constant. There are limits to when SQL can use an index to improve performance.

Here is an SQL source file with the final version of our statements.

One Last "Where Are We?"

For fourteen weeks, we have been learning how to design, implement, and use relational databases. Most recently, we learned how to create views that simulate phyical tables and indexes that enable SQL to search tables more efficiently.

Throughout the semester, I have spoken of "relational databases" to distinguish them non-relational databases... What other kinds of database databases are there? When might we use them? Why do we focus our study on only one kind?

Alternatives to the Relational Model

... the story from Day 1: flat files. Why a database? Concerns about data integrity and consistency, efficiency, and usability.

The relational model solves these problems using a small set of operators and clever implementations. It enabled users to "scale up" by adding more and more rows to tables with little or no loss of integrity, consistency, and usability.

Then... people found that the relational model was not perfect. It was easy to "scale up" but not "scale out" to distributed implementations. But it still provided the best trade-off of integrity, efficiency, usability, and cost, so companies stuck with it.

The first big break: MapReduce (story)

    - Google wanted to index the World Wide Web
    - unstructured data (web pages)
    - data changing all the time
    - commodity hardware; cost of disk, RAM, and processor
    - parallelism

Led to Hadoop, a new kind of software infrastructure for building DBs. Replace DBMS with platform.

NoSQL Systems

NoSQL databases are databases that store data in a format other than relational tables.

These alternatives began to emerge around 2000, as the cost of storage and memory decreased rapidly. They enabled users to ensure data integrity without needing to create and maintain a complex data model. As the cost of hardware dropped, developers became the primary cost of software development. NoSQL databases attempt to optimize for developer productivity.

Other motivations include simpler database designs, finer-grained control over availability, and simpler horizontal scaling to clusters of machines.

Types of NoSQL Databases

Four major types of NoSQL databases emerged over time:

Document databases. Store data in everything from unstructured text to structured documents using formats such as JSON. The database contains pairs of fields and values. A flexible data model and powerful text-based query languages make document databases useful in a wide variety of settings, including as a general purpose database. They can scale out horizontally to accomodate large data volumes. Example: MongoDB.

Key-value databases. A simpler type of database where each entry consists of a key and a value. (Think: a Python dictionary or a Java map.) Generally, a value can be retrieved only by referencing its key, so querying is simple. But that simplicity is a limitation. This kind of database is most useful when you have a lot of data but don't need to perform complex queries to retrieve it, such as user preferences and caching. Example: Redis.

Wide-column stores. They store data in rows of a table, like the relational model, but with dynamic columns. This adds a lot of flexibility at the cost of complexity managing the data. They are most useful when you have a lot of data and can predict up front the kind of queries users need most often, such as Internet of Things data and user profiles. Example: Cassandra.

Graph databases. Store data as nodes and edges. Nodes typically store information about objects, and edges store information about the relationships between the nodes. This kind of database gives you a lot of power to traverse relationships to look for patterns across objects, such as social networks and fraud detection. Example: Neo4j.

Advantages of NoSQL

The various types of NoSQL databases tend to offer a similar set of benefits:

scalability -- As originally designed, relational databases made it easy to scale up (adding more and more tuples) but harder to scale out (adding more components in parallel to spread out a load). In contrast, NoSQL databases make it much easier to maintain performance as your workload grows beyond the limits of a single server.

distributed data -- Some core functionality of the relational model works best when everything is on one server, which isn't always possible. NoSQL databases are designed as distributed systems, even globally distributed. Good performance across servers; ability to segregate data and comply with data sovereignty and privacy regulations.

reliability -- Distribution of data and processing allows for replication and built-in failover. Adding similar features to SQL databases requires resources beyond the database itself, such as a separate clustering layer.

flexibility -- With fewer constraints on the data model, it is easier to test new ideas and to modify the database as application requirements evolve.

The Pendulum Swings Back

Scalable, reliable, flexible. That all sounds great. Why doesn't everyone use NoSQL all the time?

As many in industry moved to NoSQL, they learned that it, too, is not perfect. It was hard to match run-time efficiency of the relational DBMSs, and the constraints of relational model actually help to ensure correctness. (Think about type checking, Java v Python.) The pendulum began to swing back.

First, people implemented SQL interfaces on top of Hadoop and other NoSQL systems. Then came a new set of scalable databases that embraced SQL fully. Google led the way here, too, with a 2012 paper whose authors include the original MapReduce authors.

At the same time, the SQL community began to assimilate many of the features of NoSQL, adding new datatypes, better support for partitioning and replication, and the ability to write parallelized data definitions.

Researchers at Google wrote that application developers rely heavily on certain features in their DBMS:

a robust query language -- Developers have to write complex code to process and aggregate the data.
a robust query execution engine -- The execution of queries is tightly integrated with other architectural features, such as consistency and replication.

They concluded that NoSQL methods are optimal for retrieving specific items and scanning ranges of values in tables. They work best in scenarios where retrieval is relatively straightforward.

SQL, on the other hand, provides significant value in expressing more complex data access patterns and pushing computation down to the data.

So Here We Are

In networking, there is stack of technologies, from infrastructure such as copper and fiber wires on the bottom to applications such as the web and email on the top. In order to make the stack work, programmers end up writing a lot of "glue code" that ties things together. The thing that makes all of this work as well it does is IP, the Internet Protocol. It is the universal interface by which every other layer interacts with the other.

The domain of databases has a similar shape. There is stack of technologies, from infrastructure such as physical storage and B-trees on the bottom to applications such as student information systems and data visualization tools on the top. In order to make the stack work, programmers end up writing a lot of "glue code" that ties things together.

This is the power of SQL. It has become the universal interface for data analysis, which makes the relational model an essential element of all database systems. And, unlike IP, it's a human readable language -- a programming language.

So perhaps now you can understand the focus in this course on the relational data model and SQL. You are learning the basic ideas and skills that apply no matter what kind of database model you end up working with.

Bonus SQL: Using a Database to Solve a Problem

Knowing how to write programs gives you a new power. Whenever you encounter a problem, you can ask yourself, "How might a program help me solve this?"

The same is true for many more specialized CS skills. People who know how to implement a language interpreter can ask themselves, "How might a language help me solve this problem?"

The same is true for databases.

On Tuesday, I asked the question, "Can a database help me create a spell checker?"

On Thursday, I asked the question, "Can a database help me build a passphrase generator?"

You can use the tools you have learned this semester to represent, store, and use information. That's a power you can use to solve many problems.

If you would like to play around with files and SQL code for either demo, please do!

spell checker: zip file | directory
passphrase generator: zip file | directory

All That's Left Is The Final

There is nothing new from this week on the final.

The material on the sxam will be split roughly like this:

material on Exam 1: 30%
material on Exam 2: 60%
material since Exam 2: 10%

Eugene Wallingford ..... wallingf@cs.uni.edu ..... November 19, 2020