CS 3140 Online Session 8

Online Session 8

Importing Data into SQLite

CS 3140
Database Systems

Opening Exercise

I have a database of stock price data that looks like this:

    AMAT|20050506|15.56|15.66|15.33|15.63|20778700

The data is in MetaStock ASCII 7-columns format. The seven columns represent the following:

the stock's symbol
a date string
the opening price of the stock that day
the high price that day
the low price that day
the closing price that day
the number of shares traded that day, called "volume"

The date is a string in "yymmdd" format. The prices are real numbers. The volume is an integer.

Let's practice both relational algebra and SQL.

Write the SQL statement to create a table to store this data.
Call your table "EndOfDayData".

Write a query in relational algebra that finds the closing price of Microsoft (symbol "MSFT") on each date in the database.

(Spoiler alert: the answer comes next...).

Opening Solution

Here is a CREATE TABLE statement to create the EndOfDayData table:

    $ sqlite3
    sqlite> CREATE TABLE EndOfDayData(
       ...>    symbol VARCHAR(5),
       ...>    date CHAR(8),
       ...>    open REAL,
       ...>    high REAL,
       ...>    low REAL,
       ...>    close REAL,
       ...>    volume INTEGER,
       ...>    PRIMARY KEY (symbol, date)
       ...>    );

I used VARCHAR(5) for the symbol's type because I noticed that, while most of the symbols in the file are four characters, a few are five. SQL and SQLite have several different formats for dates, but a string is actually simpler to work with in many cases.

The relational algebra query to extra all of Microsoft's closing prices requires two steps: a selection of the tuples for MSFT and a projection of the date and price attributes:

    π_{date, close}(σ_{symbol="MSFT"}(EndOfDayData)))

In general, the π and σ operators are commutative, since one drops tuple and the other drops columns. In this case, though, we need one of the dropped columns (symbol) to select the desired tuples. So we have to compute σ first and π second.

It would be cool if we had the data in our database, too, so that we could find the actual data for Microsoft. But my datafile contains 16,420 lines and over 680,000 bytes. You've seen me type... We need a better solution than entry by hand...

Importing Data into SQLite Databases

Over the course of the rest of the session, you will create a database using a text datafile named "NASDAQ_200505.txt". Your submission for the exercise is the database you create. Download the datafile using this link now and save it to your computer. Start SQLite in the same directory. If you are doing your SQLite work on student.cs., copy the file to student.cs. and log in before proceeding.

The database I referred to in the exercise does not exist yet. What I have is a comma-separated values (CSV) file named NASDAQ_200505.txt containing end-of-day stock prices for the week of May 2-May 6, 2005. A company that sells stock data services on the web gives away a week's worth of old data as part of its sales pitch. I like free data to play with, so I grabbed a week's worth once!

The data in the file actually looks like this:

    AMAT,20050506,15.56,15.66,15.33,15.63,20778700

CSV is a widely-used format out in the world, so it would be convenient if we could load tables from CSV files. Fortunately, SQLite supports reading and writing data in a number of formats, including CSV. If you'd like to read about them, check out the documentation for SQLite's command-line shell.

In SQLite, we import data from a file using the .import dot command. To import my NASDAQ data file, we follow these steps:

Create the table.

I saved my CREATE TABLE statement from the opening exercise in an SQL file named EndOfDayData-schema.sql so that I can read it in to a new SQLite session:
```
    $ sqlite3
    sqlite> .read EndOfDayData-schema.sql
    sqlite> .schema
    CREATE TABLE EndOfDayData(...);
```

Set the data mode to match the file's format.

We do this using the .mode dot command. My data file is in CSV format, so:
```
    sqlite> .mode csv
```

Import the file's data into the table.

    sqlite> .import NASDAQ_200505.txt EndOfDayData

I like SQLite's "list" mode for displaying data, so I changed my mode back immediately:

    sqlite> .mode list

The commands all ran, but do I have data? Let's see...

    sqlite> SELECT * FROM EndOfDayData;
    AABC|20050502|14.4|14.4|14.4|14.4|0
    AACB|20050502|23.69|23.71|23.58|23.71|400
    ...
    ZRAN|20050506|10.94|11.1|10.83|10.89|650300
    ZVXI|20050506|3.93|3.93|3.4|3.4|3700

The "..." elides a lot of data. It seems to be all there. Yay!

Querying the Data

Now we are ready to query our data. Our task was to find the closing price of Microsoft (symbol "MSFT") on each date in the database. We constructed this relational algebra query:

    π_{date, close}(σ_{symbol="MSFT"}(EndOfDayData)))

We can translate this relational algebra query directly into SQL. The σ expression turns into a SELECT statement:

    sqlite> SELECT * FROM EndOfDayData WHERE symbol="MSFT";
    MSFT|20050502|25.23|25.36|24.95|25.23|54438400
    MSFT|20050503|25.13|25.4|25.09|25.36|68183200
    MSFT|20050504|25.34|25.4|25.11|25.21|87075500
    MSFT|20050505|25.2|25.33|25.08|25.23|59389700
    MSFT|20050506|25.33|25.48|25.19|25.22|64338800

The next step in the relational algebra query is to project the desired columns using the π operator. In SQL, this will simply be another SELECT statement, with column names in place of the asterisk. What table do we select from? The table produced by our SELECT statement in the preceding paragraph!

It turns out that we can do that in SQL, too, using a new idea: the sub-query. We use the output of a SELECT statement as the value given to another SELECT statement:

    sqlite> SELECT date, close FROM
       ...>   (SELECT * FROM EndOfDayData WHERE symbol="MSFT");
    20050502|25.23
    20050503|25.36
    20050504|25.21
    20050505|25.23
    20050506|25.22

This is a very cool idea that we will soon put to use in writing more complex SQL queries. In this case, though, we know that we can find our answer using one SELECT statement that combines the selection and the projection:

    sqlite> SELECT date, close FROM EndOfDayData WHERE symbol="MSFT";
    20050502|25.23
    20050503|25.36
    20050504|25.21
    20050505|25.23
    20050506|25.22

This example gives you merely a taste of what more is possible in SQL. We will begin to explore the language in much more detail next, week including the idea of sub-queries.

Oh, one more thing... Now that we have an SQLite database for my NASDAQ file, let's save it for future use:

    sqlite> .save NASDAQ_200505.db
    sqlite>

Check out the size of the NASDAQ_200505.db file. It's about 45 times bigger than the customer-example.db database we have been using in previous examples -- despite the fact that it contains more than 45 times as much data. This tells us that SQLite stores its data efficiently in memory. We'll learn more about that, too, later in the course.

Submission for This Session

Email me your NASDAQ_200505.db database file by 11:59 PM on Wednesday, October 14.

If you are feeling adventurous, try a few more SQL queries of your own design against the database. Let me know if you learn anything interesting about the data!

As always, if you have any questions, please let me know as soon as you can. This kind of programming can be a lot of fun if you aren't banging into technical difficulties.

Eugene Wallingford ..... wallingf@cs.uni.edu ..... October 9, 2020