A Handbook Of Small Data Sets

Corrections

  1. A Handbook Of Small Data Sets
  2. A Handbook Of Small Data Sets For Analysis

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:19:y:1995:i:1:p:101-101. See general information about how to correct material in RePEc.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: (Nithya Sathishkumar). General contact details of provider: http://www.elsevier.com/locate/csda .

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

We have no references for this item. You can help adding them by using this form .

For such a small data set one may prefer counts instead.). In the second phase, the pruning phase, we look for set of optimal trees with respect to the. A Handbook of Small Data Sets D. McConway (auth.), D. Ostrowski (eds.) download Z-Library. A handbook of small data sets ×. Copy the page URI to the clipboard. The SAGE Handbook of Social Media Research Methods Luke Sloan, Anabel Quan-Haase download Z-Library. Download books for free.

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the 'citations' tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

Much of the practical use of statistics involves modeling and estimatingthe relationship among two or more variables.We will concentrate here on simple linear regression which isthe case of just two variables, Y and X.We first discussthe statistical model which relates the dependent variable Y tothe independent variable X, and thenintroduce the curve fitting method called least squares.

A Statistical Model for a Linear Relationship

Suppose that pairs of measurements are made where the mean ofone variable is a linear function of the other. For example,suppose that Y is the average minimum January (AMJ) temperaturefor a city in the US, and X is its latitude. These two variableshave been collected at the locations of the 50 largest cities inthe US and are in the MATLAB data setsclimate.jan andclimate.lat. Examining a scatter plot of AMJ temperature versuslatitude,

plot(climate.lat,climate.jan,'*')

one can see that these variables are related, and moreoverthe relationship appears to be linear.(Note that the MATLABplotfunction uses the form plot(x,y), with the independent variable first.)A reasonable model for AMJtemperature is that on the average, it is equal toα+βX where X is the latitude.

One problem with this fairly vague description is that it doesnot provide enough information to estimate α and β.Here is a more precise statistical model. Let(X1,Y1),...,(X50,Y50)denote the 50 pairs of latitude and AMJtemperatures. Then we will assume that

Yi = α + βX + ei , i = 1, 2, ... ,50

A Handbook Of Small Data Sets

The random components eiare assumed to be independent random variables withzero mean (E[ei]=0) and a constant varianceσ 2 (Var[ei]=σ 2).This model explains the variation in the AMJtemperatures as having two components: it is a linearfunction of the latitude plus a random perturbation. Thecombination of the two will give a scatterplot that appears roughlyalong a line but with some random departures from one case toanother. This viewpoint can be exploited to estimate the linearrelationship in an objective manner.


Least Squares Estimates

For any choice of slope and intercept (β and α),one can predict the AMJ temperature from the latitude value. Theidea of least squares is to find the choice of slope and interceptthat give the best fit among the observed AMJ temperatures andthose predicted from the latitudes. Here is how this works inmathematical notation. Let

SS(α.β) = n
Σ
i =1
(Yi - (α + βXi)) 2 .

Then the least squares estimates of α and β are thevalues that minimize this sum of squares. For simple models andsmall data sets, the estimates of α and β have aformula that can be evaluated using a calculator.However, our interest is in using MATLAB to do thecalculations so that we can look at larger data sets and eventuallyfit more complicated curves.


The MATLAB Functionlm

Recall from M-Lab 3 thatthe MATLAB function lm is activated by typinglm(data set).Here are the steps to uselm for the data setclimatewith a model which includes as response (Y)climate.jan andclimate.latas an independent variable (X), or Yi = α+ βXi + ei.

First, we type

lm(climate)

We obtain then the following message:

The MATLAB prompt will change tolm>>(but here we suppress that notation since you don't actuallytype it)and we define the model to fit using the command

model jan = lat

The results of this command are

The above output is similar to that shown in Lab 3 but also hassome additional lines with the parameter estimatesalong with their standard errors (estimated standard deviations).For an explanation of some of the items, click onannotated output.Thus the fitted model isHere is the MATLAB command to plot the response(Y=jan) versus the regressor (X=lat) and addthe least square line:

pplot by lat lines=1

To plot response versus regressor adding the fitted (or predicted) values, type

pplot by lat

The fitted values are circles in the plot, and the crossed pointsare the observed values.

Scrutinizing the Residuals

The predicted values are actually Ŷk =α^ + β^Xk , 1 < k < n ,where α^ and β^ are the least squaresestimates. Another very important set of values are theresidualsYk -Ŷk.Note that the residuals plus the predicted values equalthe original Y values. The residualsestimate the random component of the model, the ei inthe model

Yi = α + βXi + ei , i = 1, 2, ... ,50

If the model actuallyfits the data well, the residuals should appear randomlydistributed and not have any patterns. In this case the standarddeviation of the residuals estimates σ. (Actually wedivide by n-2 instead of n-1 in the standard deviation formula. Thisestimate of σ is called 'Standard Error' in the output above.)

Usually we plot the residuals to look for

a) outliers (unusual points)

b) patterns which might indicate curvature

c) patterns which suggest that the variance of the errorsei changes with the x variable

If the residuals dohave some pattern or unusual values, then the model is possibly incorrect.In the case of outliers, we might decide to delete them from the data set.If there seems to be curvature, then a straight line fit is notappropriate. Transforming the y value (like log(y)) sometimes helpsget rid of the curvature.

Small

Note, though, that the residuals always have a mean of zero due to thethe least squares fitting process,and therefore we often draw a reference horizontal line at y=0.An example with the climate data is as follows. (If you are not still inlm, then type lm(climate) followed by model model jan = lat. Then

rplot by lat

Note the two outliers in the upper right portion of the plot. We discussedthose two cities in M-Lab 1.Another interesting residual plot, it is the residuals versus predictedvalues:

Data

rplot by predict

This plot is often used to see if the variation about the center lineis greater for larger values of the predicted Y's. If so, then atransformation like log(y) is often suggested. There is no evidenceof a need for a transformation in this example.

You can place a variable namedresidthat contains the residuals into the main workspace by typing (please dothis)

output r=resid

Similarly you can place a variable namedpredictthat contains the predicted values into the main workspace by typing(please do this)

A Handbook Of Small Data Sets

output p=predict

A Handbook Of Small Data Sets For Analysis

To get out oflm,just hit return.For more information aboutlmtypehelp lm
once you are out oflm.

Just for fun, copy over the following commands in one piece tosee how you can use theresidand predictvectors outsidelm.

  1. The data ncsu contains the number of degrees awardedby North Carolina State University from 1894 to1983 (ncsu.degreeand ncsu.year).Fit a least squares line to predict the number of degrees as alinear function of the year. From your estimated equation find theestimate of the number of degrees at the years 1900 and 1980.Comment on the appropriateness of this model.

    (Hint: look at theactual degrees granted in those years versus the estimated number.)

  2. In place of the raw number of degrees, consider the natural logarithmof the number(use log(degree)withinlm).Carefully describe thefunctional relationship between degrees and years based on this transformation.Fit a least squares line to the log degrees as a function of theyear. Comment on any patterns evident in the residuals and try toexplain why they might be there. Use your fitted line to predictthe number of degrees that will be awarded in the year 2000. Commenton the validity of your prediction. Suggest a different strategy thatmight improve your prediction (hint: consider only a subset of the data).
  3. The data set earthqincludes the dominant frequency and magnitude of 148 earthquakes (taken from Earthquake Engineering and Structural Dynamics, Vol. 23, p. 583-597,1994). Fit a least squares lineto explain the relationship of freq (the dependent variableY) to mag(the independent variable X).Superimpose the fitted straight line over the scatter plot of the data(recall, pplot by maglines=1). Thenplot the residuals versus mag(rplot by mag).Do you detect any nonlinearity from either plot?
  4. How much energy do you save by insulating your house? The data setinsulatetaken from A Handbook of Small Data Sets is oneperson's record of weekly gas consumption(gas,in 1000 cubic feet) and outside temperature(temp,in degrees Celsius), before(insulation=0) and after(insulation=1)insulating a house. Thehouse thermostat was set at 20 degrees Celsius during the 26 weeks beforeand 30 weeks after insulating. First make a plot of the data with

    z=insulate
    plot(z.temp,z.gas,'*')

    Since the there appears to be two distinct groups, lets make twodata sets depending on the value ofz.insulation:

    z0=substr(insulate,'insulation=0')
    z1=substr(insulate,'insulation=1')

    Now let's plot them and put least squares lines through them usinglsline.

    plot(z1.temp,z1.gas,'*')
    lsline
    hold on
    plot(z0.temp,z0.gas,'+')
    lsline
    hold off

    Find the least squares lines forz0 andz1 usinglmand compare them.What conclusions can you draw about insulating a house?

  5. How does a child's vocabulary change with age? The data setvocabcontains the average oral vocabulary size(words = number of words)for children at different ages(age)(taken from Discovering Psychology byWeiner, 1977). Plot the data and fit a straight line ofwordsversus age. Overlay the fitted line and plot the residuals.What conclusionscan you draw? Are there any unusual patterns or data points?
  6. One view of the relation of lung cancer to amount of cigarette smokingcan be seen in the data set cancer taken from A Handbook ofSmall Data Sets by Hand, et al. (1994, p.67). The data consist of the``smoking ratio,'(smoke, a standardized measure of smoking amount)and the standardized mortality ratio(SMR)for males in England andWales in 1970-72 who were working in 25 different broadgroups of jobs such as textile workers, miners, etc. Plot the data,fit a straight line, and plot the residuals. What conclusions can youdraw?
  7. Temperature should have an effect on how fast one can runa race. The data setmarathon lists daily temperature andthe winning times at the New Yorkmarathon for both men and women for the years 1978-1998. Fit a straightline for the women's timesmarathon.wtime versusthe temperaturemarathon.temp. Givethe fitted equation and tell whether a straight line seems appropriate.
  8. Repeat the previous question using the men's timesmarathon.mtimein place of the women's times.
  9. In Spring 2000 a team measured ping times of Internet servers atvarious distances from Raleigh using a software program calledNeoTrace. They actually measured ping times at 4 different times of theday, but since there was very little difference over time, we have averaged overthe times of day and constructed the data setpingwith variablesdistandtime.Uselm tofind the least squares line of ping time versus distance. Withinlm makeplots withWrite down the least squares equation. Do either of the two plotsindicate that a straight line is inappropriate or that there areoutliers or unusual patterns?
  10. Is the US population getting older? The first three lines of the data setus_ageisIt contains the average age for all Americans(all),females(f),and males(m)for the years 1990-1999. Actually, the data from the US Census Bureauare based on the 1990 census and then updated yearly.First plot the data by copying and pasting(Recall that the dots ... allow one to continue to the next line in matlab.Also, you can put multiple plots on the same graph using one plot statement.)Find separate least squares fits of average age to year for females and males.State the meaning of the slopes in this situation (e.g., what happensto the mean age if you change the year from one value to another?).Is one group lengthening its age faster than the other? To actuallytest this hypothesis, look at the slope and p-value for a least squaresfit of the age difference between females and males to year.
  11. How fast is the US population growing?us_popcontains the US population in millions for the years 1900-1999.First plot the data and notice that the growth rate at the beginning ofthe century seems to be less than at the end. Then create two data sets

    z=us_pop
    z1=subset(z,'(obs#>70)')
    z2=subset(z,'(obs#<31)')

    Uselmto get least squares lines for the two data sets. Recall that after themodel statement,pplot by year lines=1will show whether a straight line is appropriate. Then explain whetherthe straight lines are appropriate in these casesand how the two growth rates compare. These data are from the USCensus Bureau website.

  12. One of the problems at the end of M-Lab 2 used the data setdraftto investigate the randomness of the 1970 Draft Lottery. In a followupstudy, Sommers (2003, Chance Magazine) looked up deaths by age and birthdayon the Vietnam Veterans Memorial (apparently available online at thewall-usa.com).Thus, the data set has deaths by month as well as average lottery number by monthfor men with birthdays in the years 1944-1950 (those eligible in the 1970 draft).First, plot the deaths by average monthly lottery number,plot(draft.lottnum,draft.deaths,'*') (for fun, add the month names withtext(draft.lottnum,draft.deaths,draft.month). Then uselm to fit a straight linefit of deaths versus average monthly lottery number.Which month has the largest residual from the straight line fit?It appears that getting a lowlottery number was not just unlucky but translated into more deaths as might be expected sincerelatively more men were drafted from the low lottery number months.