I have a database of stock price data that looks like this:
AMAT|20050506|15.56|15.66|15.33|15.63|20778700
The data is in MetaStock ASCII 7-columns format. The seven columns represent the following:
Let's practice both relational algebra and SQL.
(Spoiler alert: the answer comes next...).
Here is a CREATE TABLE statement to create the EndOfDayData table:
$ sqlite3 sqlite> CREATE TABLE EndOfDayData( ...> symbol VARCHAR(5), ...> date CHAR(8), ...> open REAL, ...> high REAL, ...> low REAL, ...> close REAL, ...> volume INTEGER, ...> PRIMARY KEY (symbol, date) ...> );
I used VARCHAR(5) for the symbol's type because I noticed that, while most of the symbols in the file are four characters, a few are five. SQL and SQLite have several different formats for dates, but a string is actually simpler to work with in many cases.
The relational algebra query to extra all of Microsoft's closing prices requires two steps: a selection of the tuples for MSFT and a projection of the date and price attributes:
πdate, close(σsymbol="MSFT"(EndOfDayData)))
In general, the π and σ operators are commutative, since one drops tuple and the other drops columns. In this case, though, we need one of the dropped columns (symbol) to select the desired tuples. So we have to compute σ first and π second.
It would be cool if we had the data in our database, too, so that we could find the actual data for Microsoft. But my datafile contains 16,420 lines and over 680,000 bytes. You've seen me type... We need a better solution than entry by hand...
Over the course of the rest of the session, you will create a database using a text datafile named "NASDAQ_200505.txt". Your submission for the exercise is the database you create. Download the datafile using this link now and save it to your computer. Start SQLite in the same directory. If you are doing your SQLite work on student.cs., copy the file to student.cs. and log in before proceeding.
The database I referred to in the exercise does not exist yet. What I have is a comma-separated values (CSV) file named NASDAQ_200505.txt containing end-of-day stock prices for the week of May 2-May 6, 2005. A company that sells stock data services on the web gives away a week's worth of old data as part of its sales pitch. I like free data to play with, so I grabbed a week's worth once!
The data in the file actually looks like this:
AMAT,20050506,15.56,15.66,15.33,15.63,20778700
CSV is a widely-used format out in the world, so it would be convenient if we could load tables from CSV files. Fortunately, SQLite supports reading and writing data in a number of formats, including CSV. If you'd like to read about them, check out the documentation for SQLite's command-line shell.
In SQLite, we import data from a file using the .import dot command. To import my NASDAQ data file, we follow these steps:
$ sqlite3 sqlite> .read EndOfDayData-schema.sql sqlite> .schema CREATE TABLE EndOfDayData(...);
sqlite> .mode csv
sqlite> .import NASDAQ_200505.txt EndOfDayData
I like SQLite's "list" mode for displaying data, so I changed my mode back immediately:
sqlite> .mode list
The commands all ran, but do I have data? Let's see...
sqlite> SELECT * FROM EndOfDayData; AABC|20050502|14.4|14.4|14.4|14.4|0 AACB|20050502|23.69|23.71|23.58|23.71|400 ... ZRAN|20050506|10.94|11.1|10.83|10.89|650300 ZVXI|20050506|3.93|3.93|3.4|3.4|3700The "..." elides a lot of data. It seems to be all there. Yay!
Now we are ready to query our data. Our task was to find the closing price of Microsoft (symbol "MSFT") on each date in the database. We constructed this relational algebra query:
πdate, close(σsymbol="MSFT"(EndOfDayData)))
We can translate this relational algebra query directly into SQL. The σ expression turns into a SELECT statement:
sqlite> SELECT * FROM EndOfDayData WHERE symbol="MSFT"; MSFT|20050502|25.23|25.36|24.95|25.23|54438400 MSFT|20050503|25.13|25.4|25.09|25.36|68183200 MSFT|20050504|25.34|25.4|25.11|25.21|87075500 MSFT|20050505|25.2|25.33|25.08|25.23|59389700 MSFT|20050506|25.33|25.48|25.19|25.22|64338800
The next step in the relational algebra query is to project the desired columns using the π operator. In SQL, this will simply be another SELECT statement, with column names in place of the asterisk. What table do we select from? The table produced by our SELECT statement in the preceding paragraph!
It turns out that we can do that in SQL, too, using a new idea: the sub-query. We use the output of a SELECT statement as the value given to another SELECT statement:
sqlite> SELECT date, close FROM ...> (SELECT * FROM EndOfDayData WHERE symbol="MSFT"); 20050502|25.23 20050503|25.36 20050504|25.21 20050505|25.23 20050506|25.22
This is a very cool idea that we will soon put to use in writing more complex SQL queries. In this case, though, we know that we can find our answer using one SELECT statement that combines the selection and the projection:
sqlite> SELECT date, close FROM EndOfDayData WHERE symbol="MSFT"; 20050502|25.23 20050503|25.36 20050504|25.21 20050505|25.23 20050506|25.22
This example gives you merely a taste of what more is possible in SQL. We will begin to explore the language in much more detail next, week including the idea of sub-queries.
Oh, one more thing... Now that we have an SQLite database for my NASDAQ file, let's save it for future use:
sqlite> .save NASDAQ_200505.db sqlite>Check out the size of the NASDAQ_200505.db file. It's about 45 times bigger than the customer-example.db database we have been using in previous examples -- despite the fact that it contains more than 45 times as much data. This tells us that SQLite stores its data efficiently in memory. We'll learn more about that, too, later in the course.
Email me your NASDAQ_200505.db database file by 11:59 PM on Wednesday, October 14.
If you are feeling adventurous, try a few more SQL queries of your own design against the database. Let me know if you learn anything interesting about the data!
As always, if you have any questions, please let me know as soon as you can. This kind of programming can be a lot of fun if you aren't banging into technical difficulties.