Stata Fundamentals 2: Do-files and computing variables

More information on how the session is run

What to do when getting stuck:

Ask the trainer if you struggle to find a solution.
Use the help command. To get help with a specific command type help "command name"
Search online. The statalist.org forum is usually the most useful resource.

In this practical session, you will learn to:

use do-files to create analysis scripts
create frequency tables
compute new variables
save datasets

Access to Datasets

Please use this link to get access to the datasets that you will be using in the exercises

With do-files you can keep a record of your commands in a text file, which makes it easy to re-run commands for pre-processing, plotting and analysing your data. It also provides a means to keep a record of your analysis and to share it with colleagues or to attach it to your dissertation or make it available with your publications.

Just take the first section in which you changed the working directory and loaded a file into Stata as an example. By using a do-file you won’t have to re-type these commands in order to load the dataset the next time you want to work on your analysis.

Create a new do file

To create a new do-file use the shortcut Ctrl+9 (Windows) or Cmd+9 (Mac). A new do-file editor with an empty do-file script will open. To create additional do-files use the Ctrl+N (Windows) or Cmd+N (Mac) shortcut.

Adding comments

Comments, which are always in green, will be ignored by Stata. That means, Stata will not try to execute it as a command. You can use comments to explain what your script does.

In the example do-file below, we use three differnt types of comments:

double forward slash //
asterisk *
forward slash-asterisk combination /**/

In line 1 the double forward slash is being used to turn the first line into a comment. The // can be used for the title of the do-file or a section title within your do-file.

The other character that can be used to create comments is the asterisk. In line 3-5, the astrisk is used to comment out each line of that three-line paragraph about the dataset.

The forward slash-asterisk combination /**/ can be used within a line to add a comment after a command (see line 7).

Running a do-file

You can use the Ctrl+D (Windows) Cmd+D (Mac) command to run the whole file. To only run a single line or a selection of lines, highlight the lines and then use the Ctrl+D (Windows) or Cmd+D (Mac) shortcut.

Exercise: Do-files

Create a new do-file to reproduce the commands from the previous session on Inspecting a dataset. See the commands below.
Add a comment in the first line to describe what your do-file does.
Copy the commands from the previous exercise into your do-file.
Run the whole do-file.
Run only the first three commands.

Creating frequency tables

With the tabulate command, you can create one- and two-way frequency tables. We’re going to start with a simple frequency table for the hiqual variable. The tabulate command can be abbreviated as tab. In this example, we are again working with the lfs2017 dataset.

. use lfs2017, clear

. tab hiqual

         Highest qualification |
           (detailed grouping) |      Freq.     Percent        Cum.
-------------------------------+-----------------------------------
                     No answer |         11        0.11        0.11
          Degree or equivalent |      3,446       34.13       34.24
              Higher education |        982        9.73       43.97
     GCE A level or equivalent |      2,312       22.90       66.87
GCSE grades A*-C or equivalent |      2,037       20.18       87.04
           Other qualification |        728        7.21       94.26
              No qualification |        503        4.98       99.24
                    Don't know |         77        0.76      100.00
-------------------------------+-----------------------------------
                         Total |     10,096      100.00

Exercise: Creating frequency tables

There is growing evidence that breastfeeding improves cognitive development of infants. The MCS_testscore dataset contains data on the Foundation Stage Profile scores, which is a teacher assessment of children’s development at the end of reception year, and amongst many other data also data on the months a child has been breastfed. We can use a frequency table to explore whether we can find evidence for a relationship between the length of breastfeeding and the foundation stage profile scores on the descriptive level in this dataset.

Open the MCS_testscore dataset, which is in the files folder.
The variables we are interested in are fsp (quartiles for the foundation stage profile score) and mbreastt (lenght of breastfeeding in months). Create a two-way frequency table that lists the months of breastfeeding in the rows and the fsp score quartiles in the columns.
Is the data in line with the notion that breastfeeding improves cognitive development?

Computing variables

Computing a new variable

With the generate command we can compute new variables in our dataset. For instance, we can calculate a new variable for the annual income based on the weekly income which is listed in the grosswk variable.

We can abbreviate the generate command with gen. After gen we specify the name of the new variable for the yearly income as grossyr. After the = sign, we write an expression to calculate the values of the new variable. We multiply the values from the grosswk variable, the weekly income, by 52 to calculate the approximate annual income and assign the calculated values to a new variable called grossyr.

. use lfs2017, clear

. gen grossyr = grosswk*52

We can now use the summarize command to compare the mean and standard deviation of the grosswk and grossyr variables.

. sum grossyr grosswk

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
     grossyr |     10,096     27131.9    27189.54        260    1199952
     grosswk |     10,096    521.7672    522.8757          5      23076

Replacing a variable

The replace command enables you to overwrite an existing variable in your dataset. As an illustration, let us replace the values of the newly created grossyr variable, using the same expression we used to create the grossyr variable.

. replace grossyr = grosswk*52
(0 real changes made)

Since we used the same formula, the values in grossyr haven’t effectively changed.

. sum grossyr grosswk

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
     grossyr |     10,096     27131.9    27189.54        260    1199952
     grosswk |     10,096    521.7672    522.8757          5      23076

Exercise: Computing variables

The auto dataset lists the turn circle, which is the radius of the smallest circular turn a car can make. To determine in how far the turn circle depends on a car’s length, we want to calculate the ratio of the turn circle and the length.

Open the auto dataset using the **webuse** command
Convert the length variable, which lists the length of a car in inches, to feet (12 inch = 1 ft). Overwrite the length variable.
Calculate a new variable TLrat as the ratio of the turn circle and the length in feet.
Print the mean and standard deviation of the new variable.

Saving data

Saving as new file and replacing files

We have now added a new variable to our dataset, the grossyr variable. With the save command we can save changes we made to our dataset. If you want to save the dataset under a new name, you have to specify the new file name following save. With the added replace option, we instruct Stata to overwrite an existing version of the data.

use lfs2017, clear

save lfs2017b, replace

Exercise: Saving data

Save the lfs2017 dataset under the new name lfs2017_raw.
The grosswk variable lists the weekly income in Pound Sterling. Create a new variable for the weekly income in USD. The conversion is 1 GBP = 1.24 USD.
Save the dataset with the newly calculated variable under lfs2017.

Final task: Please give us your feedback!

Upon completing the survey, you will receive the link to the solution file, to check how your commands compares to the sample solution.

In order to adapt our training to your needs and provide the most valuable learning experience for you, we depend on your feedack.

We would be grateful if you could take 1 min before the end of the workshop to get your feedback!

Click here to open the survey!