Intro to Markdown using Iris
2022-08-25
This is a beginner-level tutorial for basic data analysis in R. We will examine a commonly used set called iris
, which contains 150 observations of Iris flowers. The flowers are subcategorized into three separate species, which we will ultimately try to predict based on other measurements of the flowers provided in the dataset. The library datasets
gives us access to a plethora of datasets at varying levels of complexity, which we can seamlessly load from online.
library(datasets)
data(iris)
dim(iris) #you can view the dimensions, which are also shown in the top-right pane
## [1] 150 5
summary(iris) #provides a brief summary of each variable
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
names(iris) #this allows you to view the column names easily
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
str(iris) #this provides a quicker alternative to looking at the data
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
We can use head()
as well, which is probably the most common to look at the “head” of a dataset quickly, but if we put it in the above chunk, it wouldn’t give the cool output inside your R markdown script like this will.
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Let’s try plotting.
Not very useful! It's important that we get more specific when we tell R to plot something. R is a wonderful tool to master, but it doesn't come with the intuitive training wheels that some other programs have. If we just say "plot this dataset", it's not going to give us some hodge podge of nice plots for each reasonable relationship in the dataset. We have to tell it exactly what we want it to do.
Now, I don’t know much about flowers, but I would assume that petal length and width are correlated.
plot(iris$Petal.Width, iris$Petal.Length)
We can clearly see a positive correlation between the two.
Let’s see how our variables are distributed. You may note that when you knit a chunk to create multiple figures, it will print them out individually with each line above the exact figure it creates.
hist(iris$Petal.Width)
hist(iris$Petal.Length)
hist(iris$Sepal.Width)
hist(iris$Sepal.Length)
Let’s try some other ways of visualizing our data. There are three different species in this dataset: “setosa”, “versicolor”, and “virginica”. As mentioned, I know nothing about flowers, but maybe we can try to determine some defining features of these species.
If you’re one of my students from ECON482, you may have remembered what I said about data mining. A simple dataset like this1 is very easy to data mine, but let’s think of a theory and a research question.
Here’s my theory - “petal and sepal measurements vary by species”. Knowing nothing about flowers, I am still sure there must be something to this theory.
Can I really come up with a research question in this case? There aren’t many moving parts to analyze… it’s not as if we’re going to treat the flowers with different types of fertilizer to test their effectiveness.
We’ll come back later to see if we can formulate a research question, but now I at least have a theory of how the world might work when it comes to flower classification. Some exploratory analysis could help us understand these flowers a bit better.
boxplot(Petal.Length ~ Species, data=iris)
boxplot(Petal.Width ~ Species, data=iris)
boxplot(Sepal.Length ~ Species, data=iris)
boxplot(Sepal.Width~ Species, data=iris)
One nice thing about markdown is that, when we create several figures at once, we can click back and forth between them. Let’s go one-by-one. It is probably helpful to know how to read a boxplot, which you can see here.
Petal Length
- Setosa’s distribution is very low, between 1 and 2 centimeters.
- Versicolor’s petals are clearly much longer.
- Virginica seems ot have the longest petals. The smallest Virginicas (in this dataset) are around the median for Versicolor.
Why did I say “in this dataset”? It’s important to note that we only have a sample of 50 of each, and these boxplots are not confidence intervals. Later on in the tutorial, we’ll build some confidence intervals to determine whether we can actually conclude that versicolor and virginica have different petal lengths.
Petal Width
- Basically the same story. Not surprising, considering the strong correlation between length and width.
As we move into sepals, I am reminded once again that I know nothing about flowers. What is a sepal? Well, in the interesting of not data mining, maybe we should inform our theories about the world a bit more by learning about flowers. Unlike our computers, which can only see numbers, we can form a more hollistic, well-rounded understanding of these numbers if we know something about the data generating process for the numbers. What is the DGP of these numbers? well, it has something to do with the anatomy of an iris.
A flower!
This seems congruent to what we’ve found so far. For irises, sepals are clearly much larger than petals (apparently this is not the case with some other flower types). This visualization can help us in a plethora of ways; understanding a bit more about the data-generating process for our data helps us spot errors and develop better theories of causality. If we found an iris in our dataset where a petal was wider than a sepal, that would be pretty odd based on this picture!
iris$Sepal.Width < iris$Petal.Width
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE
Wow! How about that, not a single one. All 150 irises have wider sepals than petals.
Let’s get back to work.
Sepal Length
- The same trend from the petals is here. Virginica > Versicolor > Setosa.
- That being said, it’s a tighter distribution! There is clearly more overlap in sepal length.
Sepal Width
- Setosa wins out in this category! Comparing the boxpplot to the picture, this makes sense to me.
So we’ve done some exploratory data analysis; combined with our more hollistic understanding of “What an iris looks like”, we are now experts in Iris dimensions. We have a problem, though…
What if these 150 irises are special and unlike the millions of other irises out in the wild?
Don’t panic, it’s not likely. That being said, Everything we do in statistics is based on some level of confidence through which we can try to make claims, given a sample from a population we are studying. There is a “population” of irises, which is at least in the millions. We observe 150 of them.
For example, we can see if we have enough evidence to claim that the mean petal length between versicolor and virginica differs. We observe it in our data, but are still limited by the standard error of our sample means, which is a function of the variance in our observations along with our sample size.
Recall that \(Standard Error = \frac{Std. Dev.}{sqrt(n)}\), so larger sample size means smaller error.
We’re going to use the base R function t.test
. Nested within our defining of x and y, I will use square bracket subsetting to tell R to first look at Petal Lengths for Versicolors, then for Virginicas. You could alternatively do x <- iris$Petal.Length[iris$Species=="versicolor"]
and y <- iris$Petal.Length[iris$Species=="virginica"]
, followed by t.test(x,y)
.
t.test(iris$Petal.Length[iris$Species=="versicolor"], iris$Petal.Length[iris$Species=="virginica"])
##
## Welch Two Sample t-test
##
## data: iris$Petal.Length[iris$Species == "versicolor"] and iris$Petal.Length[iris$Species == "virginica"]
## t = -12.604, df = 95.57, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.49549 -1.08851
## sample estimates:
## mean of x mean of y
## 4.260 5.552
And there it is, the mean petal lengths differ significantly between the two. This suggests that petal length might be a good predictor of species. After all, the petal lengths clearly differ significantly between species.
We can also run some basic regressions, which do not tell us much more than correlation, but it could still be interesting. I will load the package lfe
to use the felm()
command. You could also use the package fixest
, which has slightly different syntax.
First, we’ll do something a bit uncouth. We’ll regress petal length on the categorical variable, species. Luckily, R will automatically convert a categorical variable to separate dummy variables for analysis. This is not a very robust model, but it will tell us something unsurprising.
library(lfe)
lm1 <- felm(Petal.Length ~ Species, data=iris)
summary(lm1)
##
## Call:
## felm(formula = Petal.Length ~ Species, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.260 -0.258 0.038 0.240 1.348
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.46200 0.06086 24.02 <2e-16 ***
## Speciesversicolor 2.79800 0.08607 32.51 <2e-16 ***
## Speciesvirginica 4.09000 0.08607 47.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4303 on 147 degrees of freedom
## Multiple R-squared(full model): 0.9414 Adjusted R-squared: 0.9406
## Multiple R-squared(proj model): 0.9414 Adjusted R-squared: 0.9406
## F-statistic(full model): 1180 on 2 and 147 DF, p-value: < 2.2e-16
## F-statistic(proj model): 1180 on 2 and 147 DF, p-value: < 2.2e-16
Where is setosa? This shines some light on the concept of a categorical variable split into “factors”. If you look in your global environment at iris, you’ll notice that Species is defined as a “Factor w/ 3 levels”. You can see these levels defined if you use unique(iris$Species)
. Each level of the factor is a different value it could take on. If we tried to force our model to include all three factors, it would have nothing to compare to! Everything we do is in comparison to something else.
Our coefficients are defined in relation to a lack thereof. Imagine a dataset that contains 100 people’s blood pressures after being given a blood pressure medication. That’s it. No before, no control group, just those 100 people after being given the medication. How could I possibly measure the effect of the medication? It’s impossible. The output above is defined in relation to the setosa species. It tells us something we should expect based on our previous analysis: Virginica petal lengths are about 4 units larger and Versicolor petal lengths are about 2.8 units larger.
Let’s perform some more interesting regressions. I used felm
above, yet I could have just as easily used base R since there were no fixed effects.
lm2 <- felm(Petal.Length ~ Petal.Width, data=iris) #Petal length regressed on width with no FE
lm3 <- felm(Petal.Length ~ Petal.Width|Species, data=iris)#regressed on width WITH FE
lm4 <- felm(Petal.Length ~ Petal.Width + Species, data=iris)
#let's throw it into stargazer to get a nice neat table out of it
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(lm2,lm3,lm4, type="text")
##
## ======================================================================
## Dependent variable:
## --------------------------------------------------
## Petal.Length
## (1) (2) (3)
## ----------------------------------------------------------------------
## Petal.Width 2.230*** 1.019*** 1.019***
## (0.051) (0.152) (0.152)
##
## Speciesversicolor 1.698***
## (0.181)
##
## Speciesvirginica 2.277***
## (0.281)
##
## Constant 1.084*** 1.211***
## (0.073) (0.065)
##
## ----------------------------------------------------------------------
## Observations 150 150 150
## R2 0.927 0.955 0.955
## Adjusted R2 0.927 0.954 0.954
## Residual Std. Error 0.478 (df = 148) 0.378 (df = 146) 0.378 (df = 146)
## ======================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
We have three outputs here, but which is best, and why are two and three identical? We’re examining how width affects length; this is clearly more of a correlation than anything causal when considering that the overall size of the plant is probably determined by a plethora of confounders. That being said, the coefficients tell a different story when there are fixed effects involved.
When including a categorical variable in R, automatically split into factors, you are essentially adding fixed effects to your model. That’s why the coefficient of interest in (2) and (3) are identical. The extra coefficients for species are the intercept adjustment from the fixed effects; this is hidden in (2). When we use fixed effects, the interpretation of our model is in regard to the within-group variation. In (1), we are measuring variation between groups. Which is better? Fixed effects. Consider that width and length are codetermined by the species of the iris. In that case, species is a confounder. A species fixed effect controls for this confounder and closes a backdoor path. Upon removing this omitted variable bias, we see that a more accurate measure of the relationship is that it’s about 1 to 1. Within a species of iris, 1cm more width is associated with 1cm more length.
It should be noted that this is an exercise for the sake of practice; the simultaneity between width and length is very clear, which is why we would generally not run this exact model in the real world.
Some call this dataset the “Hello World” of statistical analysis because it is so commonly used as the introduction to students. When you learn computer science, the first thing you’re taught is how to make a little pop-up box that says “Hello World”.↩︎