Stata Fundamentals 3: Linear Regression

More information on how the session is run

How to work together:

Plesase turn on your microphone and webcam.
One shares the screen and the other requests remote control.
Take turns on who types for each exercise.

What to do when getting stuck:

Ask the trainer if you struggle to find a solution.
Use the help command. To get help with a specific command type help "command name"
Search online. The statalist.org forum is usually the most useful resource.

In this practical session, you will learn to:

Use familiar Stata commands to explore a dataset prior to running a regression analysis.
Create and customize scatter plots
Run a simple linear regression and read the analysis output.
Create scatter plots for a set of variables using a scatter matrix
Run a multiple linear regression and read the analysis output.
Create a correlation matrix.

Linear Regression

Stata allows you to easily perform and visualise simple linear regression.

For this example, we will use the sample dataset nlsw88.dta in-built in Stata. We can access this dataset with the sysuse command. Add the clear option, if a dataset is already loaded in Stata.

. sysuse nlsw88.dta, clear
(NLSW, 1988 extract)

Let us start by inspecting the variables in the dataset. We use the describe command, to get an overview of all variables in the dataset and what they represent.

. describe

Contains data from C:\Program Files (x86)\Stata15\ado\base/n/nlsw88.dta
  obs:         2,246                          NLSW, 1988 extract
 vars:            17                          1 May 2016 22:52
 size:        60,642                          (_dta has notes)
--------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------
idcode          int     %8.0g                 NLS id
age             byte    %8.0g                 age in current year
race            byte    %8.0g      racelbl    race
married         byte    %8.0g      marlbl     married
never_married   byte    %8.0g                 never married
grade           byte    %8.0g                 current grade completed
collgrad        byte    %16.0g     gradlbl    college graduate
south           byte    %8.0g                 lives in south
smsa            byte    %9.0g      smsalbl    lives in SMSA
c_city          byte    %8.0g                 lives in central city
industry        byte    %23.0g     indlbl     industry
occupation      byte    %22.0g     occlbl     occupation
union           byte    %8.0g      unionlbl   union worker
wage            float   %9.0g                 hourly wage
hours           byte    %8.0g                 usual hours worked
ttl_exp         float   %9.0g                 total work experience
tenure          float   %9.0g                 job tenure (years)
--------------------------------------------------------------------------------
Sorted by: idcode

Although it is possible to include categorical variables in a linear regression model, in this tutorial, we will only consider numerical non-categorical variables.

Examples of a categorical variable is sex, race or occupation. These have a limited set of values and there is no inherent order in the different values.

Examples of numerical non-categorical variables are age, wage, hours. These variables might have an unlimited set of values or they can be ordered in a pre-defined way.

To have a look at the whole dataset and to narrow down potential variables for our analysis, we use the browse command.

. browse

We are interested in the two variables wage and tenure. Performing simple summary statistics can help you to better understand the distribution of the variables.

. sum tenure wage

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      tenure |      2,231     5.97785    5.510331          0   25.91667
        wage |      2,246    7.766949    5.755523   1.004952   40.74659

Scatter Plots

Stata has a very easy-to-use graphical interface to create all plots. However, it is useful to get used to using code instead of the graphical interface as early as possible to be able to work your way up to more sophisticated graphs.

Here’s how you make a very simple scatter plot in Stata using the scatter command. Remember that the first variable is always displayed on the Y axis and the second variable is displayed on the X axis.

. scatter wage tenure

Exercise: Scatter plot

Open the auto dataset, which is like the nlsw88.dta data one of Stata's built-in datasets.
Explore your dataset using the different commands that you have learned so far. In your opinion, on which variables could you perform a linear regression?
Make a scatter plot with the miles per gallon of the car (mpg) and their weight (weight). What is the relationship between weight and mpg?
Customise your plot to show the markers in a different colour and symbol.
1. Bonus: Can you show a label with the model of the car in your scatter plot?

Running a linear regression

Now that we have visualised the relationship between tenure and wage on a scatter plot, it is time to find out if there is an actual linear relationship between the two variables. We can find out by using the regress command, which will calculate a model to estimate the relationship between the wage and tenure variable.

Remember to type the dependent variable first and the independent variable second.

. regress wage tenure

      Source |       SS           df       MS      Number of obs   =     2,231
-------------+----------------------------------   F(1, 2229)      =     72.66
       Model |  2339.38077         1  2339.38077   Prob > F        =    0.0000
    Residual |  71762.4469     2,229  32.1949066   R-squared       =    0.0316
-------------+----------------------------------   Adj R-squared   =    0.0311
       Total |  74101.8276     2,230  33.2295191   Root MSE        =    5.6741

------------------------------------------------------------------------------
        wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      tenure |   .1858747   .0218054     8.52   0.000     .1431138    .2286357
       _cons |   6.681316   .1772615    37.69   0.000     6.333702    7.028931
------------------------------------------------------------------------------

As a result, Stata gives me a range of useful statistics. Perhaps the most interesting is the first line lower table (tenure), in which we see that there is a regression coefficient of 0.1858747, alongside a standard error of 0.0218054.

We can interpret this as each year increase in work tenure is associated with an increase of 0.1858747 in hourly wage. The standard error gives us an indication of how certain this estimation is. The lower the standard error, the more certain we can be that the regression coefficient is a good estimation of the relationship between the two variables.

The p-value (P>|t|) shows that the correlation is statistically significant, as it is below the critical level of 0.05. The _cons line shows us values about the intercept. These are usually not very interesting for the interpretation of the data. The intercept is the predicted value for a tenure of 0 years, which is 6.681316.

Scatter plot with fitted line

Based on the model that Stata has calculated, we can now create a new variable with predicted values for wage. We can use the predict command to generate this new variable with predicted values. We create a new variable called pred_wage by typing:

. predict pred_wage, xb
(15 missing values generated)

Using the fitted values from pred_wage, we can now add a fitted line to our scatter plot.

With the twoway command, we can combine a scatter and a line plot in a single chart. We first plot the scatter and then the line on top. When using twoway the scatter and line command both have to be put in separate parentheses.

. twoway (scatter wage tenure, msize(0.5)) (line pred_wage tenure)

Exercise: Simple linear regression

Perform a regression analysis with the variables mpg and weight
Fit a regression line on your scatter plot.

Multiple Linear Regression

In the next section, we look at an example of multiple linear regression. Here we have multiple predictors, or independent variables. For this, we will be using a life expectancy dataset from the Stata website:

. use http://www.stata-press.com/data/r13/lifeexp, clear
(Life expectancy, 1998)

Scatter Matrix

First, we can do a preliminary screening of the relationships by creating a scatter matrix using the graph matrix command passing all variables that we would like to explore.

. graph matrix popgrowth lexp gnppc safewater

We can also group our scatter matrix by another variable using the by() option:

. graph matrix popgrowth lexp gnppc safewater, by(region)

Or we can only show half of the graph:

. graph matrix popgrowth lexp gnppc safewater, half

Running a multiple Linear Regression

Similar to the previous example, we can use the regress command for multiple variables, with the dependent variable going first.

. regress popgrowth lexp gnppc safewater

      Source |       SS           df       MS      Number of obs   =        37
-------------+----------------------------------   F(3, 33)        =      4.28
       Model |  8.66091026         3  2.88697009   Prob > F        =    0.0117
    Residual |  22.2693603        33  .674829098   R-squared       =    0.2800
-------------+----------------------------------   Adj R-squared   =    0.2146
       Total |  30.9302705        36  .859174181   Root MSE        =    .82148

------------------------------------------------------------------------------
   popgrowth |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        lexp |  -.0611253   .0480779    -1.27   0.212    -.1589404    .0366899
       gnppc |   -.000031   .0000196    -1.58   0.124     -.000071    8.96e-06
   safewater |   .0066364    .014125     0.47   0.642    -.0221012     .035374
       _cons |   5.477221   2.803195     1.95   0.059     -.225923    11.18036
------------------------------------------------------------------------------

The resulting table looks very similar to the results of the simple linear regression and you can check the relevant statistics for each variable.

This analysis yields and interesting pattern. The upper right table gives us statistics on the model as such. The R-squared value indicates how well the model predicts the The p-value (Prob > F) is below the critical value of 0.05, which means that there is sufficient evidence for us to conclude that the predictors (lexp, gnppc, safewater) together predict the independent variable popgrowth.

At the same time, we can see in the lower table that none of the regression coefficients is statistically significant. This indicates that there is high correlation between our predictors.

Correlation matrix

We can use the pwcorr command to get a correlation matrix for all variables in our analysis.

. pwcorr popgrowth lexp gnppc safewater

             | popgro~h     lexp    gnppc safewa~r
-------------+------------------------------------
   popgrowth |   1.0000 
        lexp |  -0.4360   1.0000 
       gnppc |  -0.3580   0.7182   1.0000 
   safewater |  -0.4280   0.8297   0.7063   1.0000

Exercise: Multiple linear regression

Now, try to perform a multiple linear regression by yourself. We use the auto dataset again:

. sysuse auto.dta, clear
(1978 Automobile Data)

Make a scatter matrix of the variables mpg, price, weight and length. What do you think is the most likely relationship between the variables?
Can you show two different plots for foreign and domestic cars?
Perform a regression analysis with the previously mentioned variables. What do the results mean?

Final task: Please give us your feedback!

Upon completing the survey, you will receive the link to the solution file, to check how your commands compares to the sample solution.

In order to adapt our training to your needs and provide the most valuable learning experience for you, we depend on your feedack.

We would be grateful if you could take 1 min before the end of the workshop to get your feedback!

Click here to open the survey!