Hi! Welcome back for another five minutes in New Zealand with Data Mining with Weka. This is Lesson 1.3, and we're going to look at exploring datasets in this lesson. 

We looked at this data file in the last lesson. It's the weather data, a toy dataset of course. It has 14 days, or instances, and each instance, each day, is described by five attributes, four to do with the weather, and the last attribute, which we call the "class" value -- the thing that we're trying to predict, whether or not to play this unspecified game. This is called a classification problem. We're trying to predict the class value.

Let's open up Weka. It's here on my desktop. I'm going to go into the Explorer. We always use the Explorer. I'm going to open the file. I put the datasets in the "My Documents" folder, so I can see them here. Just open the Weka datasets and the nominal weather data. There's the weather data in Weka. 

As we saw last time, you can see the size of the dataset, the number of instances (14), you can see the attributes, you can click any of these attributes and get the values for those attributes up here in this panel. You also get at the bottom a histogram of the attribute values with respect to the different class values. The different class values are blue for "yes", play, and red for "no", don't play. By default, the last attribute in Weka is always the class value. You can change this if you like. If you change it here you can decide to predict a different one other than the last attribute.

That's the weather dataset, and we've already explored that. As I said, it's a classification problem, sometimes called a supervised learning problem. Supervised because you get to know the class values of the training instances. We take as input a data set as classified examples; these examples are independent examples with a class value attached.

The idea is to produce automatically some kind of model that can classify new examples. That's a "classification" problem. Here is what the examples look like. This is an "instance", with the different attribute values a fixed set of features; and then we add to that the class to get the classified example. 

That's what we have to have in our training dataset. These attributes, or features, can be discrete or continuous. What we looked at in the weather data were discrete; we call them nominal attribute values when they belong to a certain fixed set. Or they can be numeric or continuous values. Also, the class can be discrete or continuous. We're looking at a discrete class, "yes" or "no", in the case of the weather data. Another kind of machine learning problem would involve continuous classes, where you're trying to predict a number. That's called a "regression" problem in the trade.

I'm going to have a look at a similar dataset to the weather dataset: the numeric weather dataset. Let me just open that in Weka, weather.numeric.arff. Here it is. It's very similar, almost identical in fact, with 14 instances, 5 attributes, the same attributes. Maybe I should just look at this dataset in the edit panel. You can see here that two of the attributes -- temperature and humidity -- are numeric attributes, whereas previously they were nominal attributes. So here there are numbers. What we see when we look at the attributes values for outlook, just as before, we have sunny, overcast and rainy. For temperature, though, we can't enumerate the values, there are too many numbers to enumerate. We have the minimum and maximum value, mean, and standard deviation. That's what Weka gives you for numeric values.

I'm going to look at a different dataset. I'm going to look at the glass dataset, which is a rather more extensive dataset. It's a real world dataset, not a terribly big one. Let's open it. Here we've got 214 instances and 10 attributes. Here are the 10 attributes; it's not clear what they are. Let's look at the class, by default the last attribute shown. There are seven values for the class, and the labels of these values give you some indication of what this dataset is about. We have headlamps, tableware, and containers. Then we have building and vehicle windows, both float and non-float. You may not know this, but there are different ways of making glass, and the "floating" process is a way of making glass. These are seven different kinds of glass. 

What are the attribute values? I don't know what you remember about physics, and I guess it doesn't matter if you don't remember, but RI stands for the refractive index. 

It's always a good idea to check for reasonableness when you're looking at datasets. It's really important to get down and dirty with your data. Here we're looking at the values of the refractive index -- a minimum of 1.511, a maximum of 1.534. It's good to think about whether these are reasonable values for refractive index. If you go to the web and have a look around, you'll find that these are good values for the refractive index.

Na. If you did chemistry, you'll recognize Na as sodium. Here, it looks like these are percentages, the different percentages of sodium, magnesium (Mg), and so on. We would expect Silicon (Si), to make up the majority of glass. It varies between 69.81% and 75.41%. These are percentages of different elements in the glass. 

We can confirm our guesses here by looking at the data file itself. Let me just find the glass data. It's in Weka datasets, and it's glass.arff. This is the ARFF file format. It starts with a bunch of comments about the glass database. Those lines beginning with percentage signs (%) are comments. You can read about this, but we don't have time to read it now.

You can see about the attributes, and it does say that the attributes are refractive index, sodium, magnesium, and so on. And the type of glass, just like I said, is about windows, containers, and tableware, and so on. We can get down to the end of the comments, and here we have stuff for Weka. This is the ARFF format. The relation has a name, you'll see it printed in the interface when you look. The attributes are defined, they are real valued attributes, numeric attributes. The "type" attribute is nominal, and the different values of "type" are enumerated here in quotes. 

That defines the relation and the attributes. Then we have an '@data' line, and following that in the ARFF format are simply the instances, one after another, with the attribute values all on one line, ending with the class by default. This is the class value for the first instance. I think there are 214 instances here. There's the last one. That's the ARFF format. It is a very simple textual file format. 

Now we've confirmed our guesses about these numbers being percentages and different elements. We can think about this some more. It's important then, that these numbers are reasonable. If they went negative, for example, that would indicate some kind of corrupted value -- you can't have a negative percentage. We're expected silicon to be the majority component; we're expecting the refractive index to be in this kind of range. It's always a good idea when you get a dataset to just click around in the Weka interface and make sure things look real. Rather small amounts of aluminum in glass -- I guess that's not surprising; I don't know very much about glass myself. We're just kind of checking for reasonableness here -- a very good thing to do. That's it then. 

In this lesson, we've looked at the classification problem. We've looked at the nominal weather data and the numeric weather data. We've talked about nominal versus numeric attributes, and we've talked about the ARFF file format. We've looked at the glass.arff dataset, and I've talked about sanity checking of attributes, and the importance of getting down and dirty with your data. 

If you'd like some further background on this, you can read Section 11.1 of the text and read about preparing the data and loading the data into the Explorer. Whether or not you do that, please go and look at the activity associated with this lesson. We'll see you soon. Bye!