A Handbook Of Small Data Sets

Corrections

A Handbook Of Small Data Sets
A Handbook Of Small Data Sets For Analysis

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:19:y:1995:i:1:p:101-101. See general information about how to correct material in RePEc.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: (Nithya Sathishkumar). General contact details of provider: http://www.elsevier.com/locate/csda .

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

We have no references for this item. You can help adding them by using this form .

For such a small data set one may prefer counts instead.). In the second phase, the pruning phase, we look for set of optimal trees with respect to the. A Handbook of Small Data Sets D. McConway (auth.), D. Ostrowski (eds.) download Z-Library. A handbook of small data sets ×. Copy the page URI to the clipboard. The SAGE Handbook of Social Media Research Methods Luke Sloan, Anabel Quan-Haase download Z-Library. Download books for free.

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the 'citations' tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

Much of the practical use of statistics involves modeling and estimatingthe relationship among two or more variables.We will concentrate here on simple linear regression which isthe case of just two variables, Y and X.We first discussthe statistical model which relates the dependent variable Y tothe independent variable X, and thenintroduce the curve fitting method called least squares.

A Statistical Model for a Linear Relationship

Suppose that pairs of measurements are made where the mean ofone variable is a linear function of the other. For example,suppose that Y is the average minimum January (AMJ) temperaturefor a city in the US, and X is its latitude. These two variableshave been collected at the locations of the 50 largest cities inthe US and are in the MATLAB data setsclimate.jan andclimate.lat. Examining a scatter plot of AMJ temperature versuslatitude,

plot(climate.lat,climate.jan,'*')

one can see that these variables are related, and moreoverthe relationship appears to be linear.(Note that the MATLABplotfunction uses the form plot(x,y), with the independent variable first.)A reasonable model for AMJtemperature is that on the average, it is equal toα+βX where X is the latitude.

One problem with this fairly vague description is that it doesnot provide enough information to estimate α and β.Here is a more precise statistical model. Let(X₁,Y₁),...,(X₅₀,Y₅₀)denote the 50 pairs of latitude and AMJtemperatures. Then we will assume that

Y_i = α + βX + e_i , i = 1, 2, ... ,50

The random components e_iare assumed to be independent random variables withzero mean (E[e_i]=0) and a constant varianceσ² (Var[e_i]=σ²).This model explains the variation in the AMJtemperatures as having two components: it is a linearfunction of the latitude plus a random perturbation. Thecombination of the two will give a scatterplot that appears roughlyalong a line but with some random departures from one case toanother. This viewpoint can be exploited to estimate the linearrelationship in an objective manner.

Least Squares Estimates

For any choice of slope and intercept (β and α),one can predict the AMJ temperature from the latitude value. Theidea of least squares is to find the choice of slope and interceptthat give the best fit among the observed AMJ temperatures andthose predicted from the latitudes. Here is how this works inmathematical notation. Let

SS(α.β) =

_n
Σ
^{i =1}

(Y_i - (α + βX_i))² .

Then the least squares estimates of α and β are thevalues that minimize this sum of squares. For simple models andsmall data sets, the estimates of α and β have aformula that can be evaluated using a calculator.However, our interest is in using MATLAB to do thecalculations so that we can look at larger data sets and eventuallyfit more complicated curves.

The MATLAB Function`lm`

Recall from M-Lab 3 thatthe MATLAB function lm is activated by typinglm(data set).Here are the steps to uselm for the data setclimatewith a model which includes as response (Y)climate.jan andclimate.latas an independent variable (X), or Y_i = α+ βX_i + e_i.

First, we type

lm(climate)

We obtain then the following message:

The MATLAB prompt will change tolm>>(but here we suppress that notation since you don't actuallytype it)and we define the model to fit using the command

model jan = lat

The results of this command are

The above output is similar to that shown in Lab 3 but also hassome additional lines with the parameter estimatesalong with their standard errors (estimated standard deviations).For an explanation of some of the items, click onannotated output.Thus the fitted model isHere is the MATLAB command to plot the response(Y=jan) versus the regressor (X=lat) and addthe least square line:

pplot by lat lines=1

To plot response versus regressor adding the fitted (or predicted) values, type

pplot by lat

The fitted values are circles in the plot, and the crossed pointsare the observed values.

Scrutinizing the Residuals

The predicted values are actually Ŷ_k =α^ + β^X_k , 1 < k < n ,where α^ and β^ are the least squaresestimates. Another very important set of values are theresidualsY_k -Ŷ_k.Note that the residuals plus the predicted values equalthe original Y values. The residualsestimate the random component of the model, the e_i inthe model

Y_i = α + βX_i + e_i , i = 1, 2, ... ,50

If the model actuallyfits the data well, the residuals should appear randomlydistributed and not have any patterns. In this case the standarddeviation of the residuals estimates σ. (Actually wedivide by n-2 instead of n-1 in the standard deviation formula. Thisestimate of σ is called 'Standard Error' in the output above.)

Usually we plot the residuals to look for

a) outliers (unusual points)

b) patterns which might indicate curvature

c) patterns which suggest that the variance of the errorse_i changes with the x variable

If the residuals dohave some pattern or unusual values, then the model is possibly incorrect.In the case of outliers, we might decide to delete them from the data set.If there seems to be curvature, then a straight line fit is notappropriate. Transforming the y value (like log(y)) sometimes helpsget rid of the curvature.

Note, though, that the residuals always have a mean of zero due to thethe least squares fitting process,and therefore we often draw a reference horizontal line at y=0.An example with the climate data is as follows. (If you are not still inlm, then type lm(climate) followed by model model jan = lat. Then

rplot by lat

Note the two outliers in the upper right portion of the plot. We discussedthose two cities in M-Lab 1.Another interesting residual plot, it is the residuals versus predictedvalues:

rplot by predict

This plot is often used to see if the variation about the center lineis greater for larger values of the predicted Y's. If so, then atransformation like log(y) is often suggested. There is no evidenceof a need for a transformation in this example.

You can place a variable namedresidthat contains the residuals into the main workspace by typing (please dothis)

output r=resid

Similarly you can place a variable namedpredictthat contains the predicted values into the main workspace by typing(please do this)