These notes are very rough. I will improve them over time.
Recall our movie database with tables for Alice and Kurt:
name year length genre -------------------------------------------- Alice The Big Chill 1983 105 drama Blade Runner 1982 117 scifi The Princess Bride 2020 98 fantasy Lethal Weapon 1987 109 action Black Panther 2018 134 action The Princess Bride 1987 98 fantasy The Replacements 2000 118 comedy Star Trek 1979 132 scifi -------------------------------------------- Kurt Black Panther 2018 134 action The Princess Bride 1987 98 fantasy The Replacements 2000 118 comedy Star Trek 1979 132 scifi Lethal Weapon 1987 109 action Return of the Jedi 1983 131 scifi 48 Hrs. 1982 96 comedy The Big Chill 1983 105 drama
Write the SQL statements needed to...
When we create an index, it applies to a single table. So we need to create two indexes here:
sqlite> CREATE INDEX LengthIdx1 ON Alice(length); sqlite> CREATE INDEX LengthIdx2 ON Kurt(length);
The second task requires a cross product:
sqlite> .mode column sqlite> SELECT Alice.name, Kurt.name FROM Alice,Kurt WHERE Alice.length + Kurt.length < 210;
The result is not ideal. It contains pairs with repeats when both collections contain the same movie. The tuples are also in no particular order. So let's add a condition and a sorting clause:
sqlite> SELECT Alice.name, Kurt.name FROM Alice,Kurt WHERE Alice.length + Kurt.length < 210 AND Alice.name != Kurt.name ORDER BY Alice.name;
This list looks like it has duplicates, but remember that (name, year) is the key for these tables. There are two different "Princess Bride" movies in Alice's collection! So let's project the year columns, too:
sqlite> SELECT Alice.name, Alice.year, Kurt.name, Kurt.year FROM Alice,Kurt WHERE Alice.length + Kurt.length < 210 AND Alice.name != Kurt.name ORDER BY Alice.name;
One downside to this query is that it doesn't return back-to-back Princess Brides even when they are different versions. Does that seem reasonable? How would you remedy this situation?
Alas, this query does not use either of our indexes. It adds the two attributes before it compares the sum to a constant. There are limits to when SQL can use an index to improve performance.
Here is an SQL source file with the final version of our statements.
For fourteen weeks, we have been learning how to design, implement, and use relational databases. Most recently, we learned how to create views that simulate phyical tables and indexes that enable SQL to search tables more efficiently.
Throughout the semester, I have spoken of "relational databases" to distinguish them non-relational databases... What other kinds of database databases are there? When might we use them? Why do we focus our study on only one kind?
... the story from Day 1: flat files. Why a database? Concerns about data integrity and consistency, efficiency, and usability.
The relational model solves these problems using a small set of operators and clever implementations. It enabled users to "scale up" by adding more and more rows to tables with little or no loss of integrity, consistency, and usability.
Then... people found that the relational model was not perfect. It was easy to "scale up" but not "scale out" to distributed implementations. But it still provided the best trade-off of integrity, efficiency, usability, and cost, so companies stuck with it.
The first big break: MapReduce (story)
- Google wanted to index the World Wide Web - unstructured data (web pages) - data changing all the time - commodity hardware; cost of disk, RAM, and processor - parallelismLed to Hadoop, a new kind of software infrastructure for building DBs. Replace DBMS with platform.
NoSQL Systems
NoSQL databases are databases that store data in a format other than relational tables.
These alternatives began to emerge around 2000, as the cost of storage and memory decreased rapidly. They enabled users to ensure data integrity without needing to create and maintain a complex data model. As the cost of hardware dropped, developers became the primary cost of software development. NoSQL databases attempt to optimize for developer productivity.
Other motivations include simpler database designs, finer-grained control over availability, and simpler horizontal scaling to clusters of machines.
Types of NoSQL Databases
Four major types of NoSQL databases emerged over time:
Advantages of NoSQL
The various types of NoSQL databases tend to offer a similar set of benefits:
The Pendulum Swings Back
Scalable, reliable, flexible. That all sounds great. Why doesn't everyone use NoSQL all the time?
As many in industry moved to NoSQL, they learned that it, too, is not perfect. It was hard to match run-time efficiency of the relational DBMSs, and the constraints of relational model actually help to ensure correctness. (Think about type checking, Java v Python.) The pendulum began to swing back.
First, people implemented SQL interfaces on top of Hadoop and other NoSQL systems. Then came a new set of scalable databases that embraced SQL fully. Google led the way here, too, with a 2012 paper whose authors include the original MapReduce authors.
At the same time, the SQL community began to assimilate many of the features of NoSQL, adding new datatypes, better support for partitioning and replication, and the ability to write parallelized data definitions.
Researchers at Google wrote that application developers rely heavily on certain features in their DBMS:
They concluded that NoSQL methods are optimal for retrieving specific items and scanning ranges of values in tables. They work best in scenarios where retrieval is relatively straightforward.
SQL, on the other hand, provides significant value in expressing more complex data access patterns and pushing computation down to the data.
In networking, there is stack of technologies, from infrastructure such as copper and fiber wires on the bottom to applications such as the web and email on the top. In order to make the stack work, programmers end up writing a lot of "glue code" that ties things together. The thing that makes all of this work as well it does is IP, the Internet Protocol. It is the universal interface by which every other layer interacts with the other.
The domain of databases has a similar shape. There is stack of technologies, from infrastructure such as physical storage and B-trees on the bottom to applications such as student information systems and data visualization tools on the top. In order to make the stack work, programmers end up writing a lot of "glue code" that ties things together.
This is the power of SQL. It has become the universal interface for data analysis, which makes the relational model an essential element of all database systems. And, unlike IP, it's a human readable language -- a programming language.
So perhaps now you can understand the focus in this course on the relational data model and SQL. You are learning the basic ideas and skills that apply no matter what kind of database model you end up working with.
Knowing how to write programs gives you a new power. Whenever you encounter a problem, you can ask yourself, "How might a program help me solve this?"
The same is true for many more specialized CS skills. People who know how to implement a language interpreter can ask themselves, "How might a language help me solve this problem?"
The same is true for databases.
On Tuesday, I asked the question, "Can a database help me create a spell checker?"
On Thursday, I asked the question, "Can a database help me build a passphrase generator?"
You can use the tools you have learned this semester to represent, store, and use information. That's a power you can use to solve many problems.
If you would like to play around with files and SQL code for either demo, please do!
There is nothing new from this week on the final.
The material on the sxam will be split roughly like this: