In the last post I complained that there is no way to apply a formula to rows of a data frame so that the formula will consume the rows as lists or vectors. Turns out I was wrong, and there is a very easy way to do it!

Not only that, but I was actually really close to it, and if I was just reading the relevant documentation bit more carefully…

As I wrote in that post, this works:

```
selections %>%
mutate(chisq = pmap_dbl(., ~sum((c(..1, ..2, ..3, ..4) -
expected)^2/expected))) %>%
head()
```

I was suggesting that perhaps something like

`pmap_dbl(selections, ~sum((.row - expected)^2/expected))`

would be nice to have. While experimenting, among other things I tries this:

`pmap_dbl(selections, ~sum((... - expected)^2/expected))`

which of course does not work, and I am not sure why I thought it would. The correct way, of course, is

`pmap_dbl(selections, ~sum((c(...) - expected)^2/expected))`

The `...`

is, in this case, equivalent to `..1, ..2, ..3, ..4`

.
Since I need `c(..1, ..2, ..3, ..4)`

, I have to use `c(...)`

.

## The simulation again

Anyway, let’s do the whole simulation from the previous post again:

```
library(tidyverse)
library(mosaic)
```

First enter the given information:

```
labels <- fct_inorder(c("White", "Black", "Hispanic", "Other"), ordered=TRUE)
percentages <- c(72, 7, 12, 9)
names(percentages) <- labels
observed <- c(205, 26, 25, 19)
names(observed) <- labels
```

Calculate the expected frequencies and the observed \(\chi^2\) score:

```
expected <- 275*percentages/100
observed_chi_square <- sum((observed - expected)^2/expected)
observed_chi_square
```

`## [1] 5.88961`

Repeatedly sample, with replacement, 275 pieces of paper from a bag representing the population, tally each sample, and record the frequencies in a data set:

```
bag <- rep(labels, percentages)
do(1000) * tally(sample(bag, 275, replace=TRUE)) -> selections
glimpse(selections)
```

```
## Observations: 1,000
## Variables: 4
## $ White <int> 200, 187, 198, 196, 197, 200, 193, 187, 200, 196, 208, …
## $ Black <int> 20, 19, 23, 23, 17, 20, 27, 17, 17, 20, 16, 25, 11, 19,…
## $ Hispanic <int> 31, 36, 32, 34, 35, 30, 35, 38, 35, 36, 29, 39, 44, 30,…
## $ Other <int> 24, 33, 22, 22, 26, 25, 20, 33, 23, 23, 22, 24, 24, 20,…
```

Add a new column to the data set, with the \(\chi^2\) scores of the simulated samples:

```
selections %>%
mutate(chisq = pmap_dbl(., ~sum((c(...) - expected)^2/expected))) ->
selections_with_chisq
```

This is what we have now:

`head(selections_with_chisq)`

```
## White Black Hispanic Other chisq
## 1 200 20 31 24 0.1933622
## 2 187 19 36 33 3.6370851
## 3 198 23 32 22 1.0663781
## 4 196 23 34 22 1.0865801
## 5 197 17 35 26 0.4523810
## 6 200 20 30 25 0.3246753
```

Plot the \(\chi^2\) scores of all the simulated samples, and mark the observed \(\chi^2\) on the plot with a vertical line.

```
gf_histogram(~chisq, data = selections_with_chisq) %>%
gf_vline(xintercept = observed_chi_square, color="red")
```

How many of the 1000 simulated samples had a \(\chi^2\) score greater than or equal to the observed \(\chi^2\) score?

`count(~(chisq >= observed_chi_square), data = selections_with_chisq)`

```
## n_TRUE
## 112
```

## Possible simplification

I am pretty happy with this. The only thing that I think can still confuse R
beginners is the use of `.`

as the first argument of `pmap_dbl`

. I usually do
few examples of *tidyverse* pipelines at the beginning of the semester, and
usually one or two of those use this somewhere, but I don’t think most students
will remember those at this point. Since we do not actually need to preserve
the original columns (the only reason I preserved them was to make the data set
look just like a worksheet they filled in while figuring out the whole idea of
\(\chi^2\) scores), we can do this instead:

```
selections %>%
pmap_df(~list(chisq = sum((c(...) - expected)^2/expected))) -> chisquares
```

and then replace `selections_with_chisq`

by `chisquares`

when making the
histogram and calculating the p-value. I am not sure if this really is that
much simpler, with the `list`

, and `pmap_df`

not being one of the basic *dplyr*
verbs we learned at the beginning of the semester…