3 Basic Descriptive Statistics
When you start performing your own sociological analysis, you’ll use own dataset that you imported in the last chapter. For demonstration purposes, we’ll use an example dataset that comes along with the survey
package. Feel free to follow along - all the code below will also work on your machine as long the survey
package is loaded in your R session. To load the package and the example datasets into your R workspace, run:
You should now have several datasets loaded into you RStudio workspace. These contain data on Academic Performance Index (API), a measure of student performance for California schools for 1999 and 2000. Specifically, we’ll be working with the apistrat
dataset in RStudio. Learn more about this example dataset and its variables in the survey package documentation (p 5-7).
3.1 Calculating Means
If you skimmed through the resource materials we provided at the start of this guide, you probably already know that R makes it extremely easy for you to calculate the mean - or average - of a variable. Let’s say we want the average of the api00
, the API (Academic Performance Index) in 2000. We could run the code below which tells R to compute the mean of the api00
variable in the apipop
dataset.
Sidenote: Another way to calculate the mean is to run the code below:
Sidenote: both of these chunks of code give you the same output. R is very flexible in its use, so there will often be several ways to achieve the same output. This might feel overwhelming in the beginning. For now, you don’t have to keep track of everything or learn every single method. Stick to learning just one method for the tasks you want to accomplish. As you become more comfortable with R syntax, you’ll be able to understand other methods and adapt to different syntax more easily.
Great! We now know that the average API in 2000 was 664.71! When it comes to survey data, however, it is rarely accurate to just calculate the simple mean of your data. Survey data is often based on a sample of the population, which means that there are usually weights to each response (or row) in the dataset. In our case, the apistrat
dataset provides probability weights in the pw
column.
3.2 Adding Survey Weights
The first step in our survey analysis process is to add weights to the data frame. For this, we’re relying on the survey
package described in the previous chapter. The svydesign
function from this package allows us to create a new dataset with the added weights.
Hopefully, this code is straightforward enough for you to decipher. If you’re unsure about any arguments (especially ids
), take a look at the documentation for this package to help yourself figure them out. You can also run ?svydesign
in RStudio to view the help page for the function. Reading function documentation to pull out the information you need - no matter how complex it may all seem - is an important coding skill! Honing it early will be helpful, so make it a habit to skim through help pages for commonly-used functions.
Note that we add the ~
symbol (called a tilde) when specifying weights because that’s the syntax followed by the survey package. This means that this package generally requires the tilde when specifying variable names or R will return an error.
3.3 Using the Survey
Package
Great, now we’re ready to do some real (weighted) analysis!
3.3.1 Weighted Means
The survey package provides specific functions to compute descriptive statistics for weighted, survey data. We’ll use the svymean
function to compute some means.
The svymean
function takes two required arguments: (1) the variable for which to compute the mean, and (2) the weighted dataset. Let’s try.
This returns the mean of the API in 2000 for California schools.
If the dataset you’re working with has missing values, R may return an error or NAs. You can fix this using the additional argument na.rm
for all functions in this package.
This ensures that missing values are omitted while computing the means. If your data has missing values, discuss with Dr. King to decide the best way to deal with them.
Note that the dataset data.w
we provide in this command is the modified, weighted data, not the original unweighted data. You can also leave out the argument names, a.k.a design =
, and simply type data.w after the comma, but it is a good idea to practice including argument names in the beginning.
3.3.2 Working with several variables
When conducting your research, it’s likely that your research question will concern several variables. Taking the example of the API dataset, we might be interested in changes in the Academic Performance Index from 1999 to 2000. To calculate the mean API for both years, we could run:
While this gives us the information we wanted, it can sometimes be annoying to type out similar commands repeatedly. Fortunately, the survey
packages allows us to calculate the means of several variable simultaneously. Simply separate your variables of interest using a plus (+
) sign.
3.3.3 Quantiles
Next, let’s try to compute quantiles and medians for this data using the svyquantile
function. This function takes an extra required argument called “quantiles” to tell R the percentiles you want to compute. In this example, let’s first try computing the median (or 50th percentile) of the API in 2000.
Read the output carefully. It provides us several pieces of key information: quantile, ci.2.5, ci.97.5, and se. Try to read the svyquantile function’s documentation by running ?svyquantile
to understand what these data mean.
You can also compute several quantiles simultaneously. The following line of code computes the 25, 50, and 75th percentiles.
Like in svymean, you can compute the quantiles (and other survey statistics) for more than one variable using the plus sign.
Great work! On your own, try calculating every 10th percentile of the API in the years 1999 and 2000.
3.3.4 Variance
Next, let’s try computing variance. We’ll use the svyvar
function that relies on the same structure as the functions above.
3.3.5 Totals
This next function, svytotal
also uses a similar format to estimate population totals for a variable. Let’s try to figure out the number of students enrolled in each level (E, M, H) of California schools.
## total SE
## stypeE 4421 313.40
## stypeH 755 92.70
## stypeM 1018 124.99
Notice how the unweighted data only has a total of 200 schools, yet our weighted data returns thousands of schools. This is because of our survey weights.
3.3.6 Statistics for subsets of data
For your research question, you are most likely most interested in the relationship between two or more variables. In this data, we might be interested in exploring the API across various school types, i.e., elementary, middle, and high schools. We can do such computations using the `svyby` function.
Let’s estimate the mean API in 2000 across the different levels/types of schools (E. M, H).
## stype api00 se
## E E 674.43 12.49343
## H H 625.82 15.34078
## M M 636.60 16.50239
Here, we are asking R to compute the mean API in 2000 by the type of school. You can calculate different statistics by providing different functions to the FUN
argument. Try calculating the variance of API in 2000 by school type. (Hint: you would replace something in the code above with svyvar
.)
Recall that you can calculate the mean of several variables across several variables using the + sign. The code below returns the API means for 1999 and 2000 categorized by the school level and year-round status.
## stype yr.rnd api99 api00 se.api99 se.api00
## E.No E No 659.5976 695.4024 14.15177 13.23504
## H.No H No 620.0816 628.8571 15.76721 15.34873
## M.No M No 614.5625 641.2708 16.91108 16.68175
## E.Yes E Yes 527.7778 578.8889 22.63897 23.67867
## H.Yes H Yes 484.0000 477.0000 0.00000 0.00000
## M.Yes M Yes 505.5000 524.5000 49.26724 57.77382
The output also reports the standard errors. You can ask to report confidence intervals or other measures of variability by adding the argument vartype
as below:
## stype yr.rnd api99 api00 ci_l.api99 ci_l.api00 ci_u.api99
## E.No E No 659.5976 695.4024 631.8606 669.4622 687.3345
## H.No H No 620.0816 628.8571 589.1785 598.7742 650.9848
## M.No M No 614.5625 641.2708 581.4174 608.5752 647.7076
## E.Yes E Yes 527.7778 578.8889 483.4062 532.4796 572.1493
## H.Yes H Yes 484.0000 477.0000 484.0000 477.0000 484.0000
## M.Yes M Yes 505.5000 524.5000 408.9380 411.2654 602.0620
## ci_u.api00
## E.No 721.3426
## H.No 658.9401
## M.No 673.9665
## E.Yes 625.2982
## H.Yes 477.0000
## M.Yes 637.7346
3.3.7 Saving outputs
While conducting your research and analysis, it’s likely that you’ll use the values you estimated above several times (e.g., to compare with other means, to create figures and charts, to export in a spreadsheet, etc.) Instead of computing these values every time you want to conduct analyses, you can save them as an ‘object’ in R. Do this using the <-
symbol. Use an object name/abbreviation that you can be consistent with and recall easily.
Now, whenever you want to take a look at this table, you can view the means
table and use it to run other analyses. If other functions ever give you trouble when running analyses on the means
object due to its “object type” being “svyby” (you can see this in the right-hand Environment pane in RStudio), you can also convert it to the data frame format by running means <- as.data.frame(means).