Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added .RData
Binary file not shown.
512 changes: 512 additions & 0 deletions .Rhistory

Large diffs are not rendered by default.

182 changes: 182 additions & 0 deletions Ceballos_ggplot2_exercise.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
---
title: 'ggplot2 intro exercise'
author: "Cristina Ceballos"
date: "November 13, 2019"
output: html_document
---

This is an in-class exercise on using `ggplot2`.

Note, that this example is from the_grammar.R on [http://had.co.nz/ggplot2](). I've adapted this for psych 251 purposes

First install and load the package. It's part of the "core tidvyerse".

```{r}
library(tidyverse)
```

# Exploring ggplot2 using `qplot`

We'll start by using `qplot`. `qplot` is the easy interface, meant to replace `plot`. You can give it simple `qplot(x,y)` examples, or slightly more complex examples like `qplot(x, y, col=grp, data=d)`.

We're going to be using the `diamonds` dataset. This is a set of measurements of diamonds, along with their price etc.

```{r}
head(diamonds)
qplot(diamonds$carat, diamonds$price)
qplot(diamonds$price)
colnames(diamonds)
```

Scatter plots are trivial, and easy to add features to. Modify this plot so that it uses the dataframe rather than working from variables in the general namespace (good to get away from retyping `diamonds$` every time you reference a variable).

```{r}

qplot(data = diamonds, carat, price)


# Piping does not work for qplot, because the first argument of qplot is NOT data!! The first argument of qplot is "x". If you use ggplot, then you can pipe, but then it has different syntax and has aes.

```

Try adding clarity and cut, using shape and color as your visual variables.

```{r}


qplot(data = diamonds,
x = carat,
y = price,
shape = clarity,
color = cut)

```


# More complex with `ggplot`

`ggplot` is just a way of building `qplot` calls up more systematically. It's
sometimes easier to use and sometimes a bit more complicated. What I want to show off here is the functionality of being able to build up complex plots with multiple elements. You can actually do this using `qplot` pretty easily, but there are a few things that are hard to do.

`ggplot` is the basic call, where you specify A) a dataframe and B) an aesthetic (`aes`) mapping from variables in the plot space to variables in the dataset.

```{r}
d <- ggplot(diamonds, aes(x = carat, y = price)) # first you set the aesthetic and dataset
d + geom_point() # then you add geoms
# d + geom_point(aes(colour = carat)) # and you can keep doing this to add layers to the plot
```

Try writing this as a single set of additions (e.g. one line of R code, though you can put in linebreaks). This is the most common workflow for me.


```{r}

```


# Facets

Let's try out another version of the `qplot` example above. Rewrite the last qplot example with ggplot.

```{r}

#qplot(data = diamonds,
# x = carat,
# y = price,
# shape = clarity,
# color = cut)

ggplot(diamonds,
aes(x = carat, y = price,
shape = clarity,
color = cut)) +
geom_point()

# d + geom_point()

# ggplot(data = diamonds,
# mapping = aes(x = carat, y = price)
# + geom_point()
# )

```

One of the primary benefits of `ggplot2` is the use of facets - also known as small multiples in the Tufte vocabulary. That last plot was probably hard to read. Facets could make it better. Try adding `facet_grid(x ~ y)`. `x ~ y` means row facets are by `x`, column facets by `y`.

```{r}

ggplot(diamonds,
aes(x = carat, y = price,
color = cut)) +
geom_point() +
facet_grid(clarity ~ cut)

# Facet statement has two variables: clarity and cut. Cut is fair, good, premium, ideal. it tells you what to make the columsn. The rows are the cut.

# If you wanted just all the cuts. clarity is "x", cut is "y". Gives you a bunch of mini-plots.

# Otherwise, you can switch to facet_wrap. And it will wrap it around for the most pleasing grid of subplots.
# Facet_wrap(~clarity)

# When you have discrete variables, and you want to look at the effect in some continuous variable space. That's when facet is really useful.

# To export these kinds of plots, you can use ggssave("~/Desktop/Diamons.pdf", plot = p)

# You can also export it. Plot it to the plotting window, and use export. But this is less reproducible.


```

But facets can also get overwhelming. Try to strike a good balance between color, shape, and faceting.

HINT: `facet_grid(. ~ x)` puts x on the columns, but `facet_wrap(~ x)` (no dot) *wraps* the facets.

```{r}

```


# Geoms

As you've seen above, the basic unit of a ggplot plot is a "geom" - a mapping between data (via an "aesthetic") and a particular geometric configuration on coordinate axes.

Let's try adding some geoms and manipulating their parameters. One combo I really like is a scatterplot with a smoothing layer (`geom_smooth`). Try adding one onto this plot.

```{r}
ggplot(diamonds, aes(x=carat, y=price)) +
geom_point(shape = ".") +
facet_grid(cut ~ clarity) +
geom_smooth()
```


CHALLENGE: You could also try starting with a histogram and adding densities. (`geom_density`), Try [this link](https://stackoverflow.com/questions/5688082/ggplot2-overlay-histogram-with-density-curve).

```{r}

ggplot(diamonds, aes(x=carat)) +
geom_histogram(binwidth = 5) +
facet_grid(cut ~ clarity) +
geom_density()


```

# Themes and plot cleanup

I like a slightly cleaner look to my plots. Luckily, ggplot allows you to add "themes" to your plots. Try doing the same plot but adding `+ theme_bw()` or `+ theme_classic()`. Different themes work better for different applications, in my experience. My favorite right now is `ggthemes::theme_few`.

You can also try different color scales. Experiment with `scale_color_...` - try writing that and hitting TAB for autocomplete options. Check out the help on this.

You can also try transparency/different point sizes to clean up scatterplots. Try `alpha = .1` or `pch = "."` to make transparent or small points.

Finally, don't forget to "fix the axis labels"!

Here's a somewhat ugly plot - see if you can make it look awesome.

```{r}
ggplot(diamonds, aes(x = carat, y = price, col = cut)) +
geom_point() +
facet_wrap(~clarity)
```

192 changes: 192 additions & 0 deletions Ceballos_ps4.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
---
title: 'Psych 251 PS4: Simulation'
author: "Cristina Ceballos"
date: "11/11/2019"
output:
html_document:
toc: true
---

This is problem set #4, in which we want you to integrate your knowledge of data wrangling with some basic simulation skills. It's a short problem set to help you get your feet wet in testing statistical concepts through "making up data" rather than consulting a textbook or doing math.

For ease of reading, please separate your answers from our text by marking our text with the `>` character (indicating quotes).

```{r, warning=F, message=F}
library(tidyverse)
```

Let's start by convincing ourselves that t-tests have the appropriate false positive rate. Run 10,000 t-tests with standard, normally-distributed data from a made up 30-person, single-measurement experiment (the command for sampling from a normal distribution is `rnorm`).

The goal of these t-tests are to determine, based on 30 observations, whether the underlying distribution (in this case a normal distribution with mean 0 and standard deviation 1) has a mean that is different from 0. In reality, the mean is not different from 0 (we sampled it using `rnorm`), but sometimes the 30 observations we get in our experiment will suggest that the mean is higher or lower. In this case, we'll get a "significant" result and incorrectly reject the null hypothesis of mean 0.

What's the proportion of "significant" results ($p < .05$) that you see?

First do this using a `for` loop.

```{r}

result <- numeric(10000)

for (i in 1:10000) {
X <- rnorm(30, 0, 1)
result[i] <- t.test(X)$p.value
}

sig_results <- result < 0.05

sum(sig_results)/10000


```

Next, do this using the `replicate` function:

```{r}

pvals <- replicate(10000, t.test(rnorm(30))$p.value)

sig_pvals <- pvals < 0.05

sum(sig_pvals)/10000

```

How does this compare to the intended false-positive rate of $\alpha=0.05$?

> This answer makes sense. I'm getting a false positive rate of about 5 percent or slightly less than 5 percent. That's what I would expect given that I am testing for signifiance of ($p < .05$).

Ok, that was a bit boring. Let's try something more interesting - let's implement a p-value sniffing simulation, in the style of Simons, Nelson, & Simonsohn (2011).

Consider this scenario: you have done an experiment, again with 30 participants (one observation each, just for simplicity). The question is whether the true mean is different from 0. You aren't going to check the p-value every trial, but let's say you run 30 - then if the p-value is within the range p < .25 and p > .05, you optionally run 30 more and add those data, then test again. But if the original p value is < .05, you call it a day, and if the original is > .25, you also stop.

First, write a function that implements this sampling regime.

```{r}
double.sample <- function () {

first_sample <- rnorm(30)

pvals_2 <- (t.test(first_sample)$p.value)

if(pvals_2<0.05) {
return(pvals_2)
} else if (pvals_2>0.25) {
return(pvals_2)
} else {
second_sample <- c(first_sample, rnorm(30))
return(t.test(second_sample)$p.value)
}

}


```

Now call this function 10k times and find out what happens.

```{r}

tenk_results <- replicate(10000, double.sample())

sum(tenk_results < 0.05)/10000


```

Is there an inflation of false positives? How bad is it?

> Yes, there's an inflation of false positives. I'm getting about 7 percent false positives, using the double sample technique. This is what Mike talked about in class, with p-hacking. While I may falsely believe that I am keeping p-value at less than 0.05, the double sampling technique is instead giving me slightly higher false positives.

Now modify this code so that you can investigate this "double the sample" rule in a bit more depth. In the previous question, the researcher doubles the sample only when they think they got "close" to a significant result, i.e. when their not-significant p is less than 0.25. What if the researcher was more optimistic? See what happens in these 3 other scenarios:

* The researcher doubles the sample whenever their pvalue is not significant, but it's less than 0.5.
* The researcher doubles the sample whenever their pvalue is not significant, but it's less than 0.75.
* The research doubles their sample whenever they get ANY pvalue that is not significant.

How do these choices affect the false positive rate?

HINT: Try to do this by making the function `double.sample` take the upper p value as an argument, so that you can pass this through dplyr.

HINT 2: You may need more samples. Find out by looking at how the results change from run to run.

```{r}

# The researcher doubles the sample whenever their pvalue is not significant, but it's less than 0.5.

double.sample0.5 <- function () {

first_sample_0.05 <- rnorm(30)

pvals_2_0.05 <- (t.test(first_sample_0.05)$p.value)

if(pvals_2_0.05<0.05) {
return(pvals_2_0.05)
} else if (pvals_2_0.05>0.5) {
return(pvals_2_0.05)
} else {
second_sample_0.05 <- c(first_sample_0.05, rnorm(30))
return(t.test(second_sample_0.05)$p.value)
}

}

tenk_results_0.05 <- replicate(10000, double.sample0.5())

sum(tenk_results_0.05 < 0.05)/10000



# The researcher doubles the sample whenever their pvalue is not significant, but it's less than 0.75.

double.sample0.75 <- function () {

first_sample_0.75 <- rnorm(30)

pvals_2_0.75 <- (t.test(first_sample_0.75)$p.value)


if(pvals_2_0.75<0.05) {
return(pvals_2_0.75)
} else if (pvals_2_0.75>0.75) {
return(pvals_2_0.75)
} else {
second_sample_0.75 <- c(first_sample_0.75, rnorm(30))
return(t.test(second_sample_0.75)$p.value)
}

}

tenk_results_0.75 <- replicate(10000, double.sample0.75())

sum(tenk_results_0.75 < 0.05)/10000



# The researcher doubles the sample whenever their pvalue is not significant.

double.sample_whenever <- function () {

first_sample_whenever <- rnorm(30)
pvals_2_whenever <- (t.test(first_sample_whenever)$p.value)

if(pvals_2_whenever<0.05) {
return(pvals_2_whenever)
} else {
second_sample_whenever <- c(first_sample_whenever, rnorm(30))
return(t.test(second_sample_whenever)$p.value)
}

}

tenk_results_whenever <- replicate(10000, double.sample_whenever())

sum(tenk_results_whenever < 0.05)/10000




```

What do you conclude on the basis of this simulation? How bad is this kind of data-dependent policy?

> I see an inflation of false positives. The more liberal I am with my double-sampling policy, the higher the rate of false positives. For example, when I sampled again whenever, regardless of my p value, I got an 8 percent rate of false positives. This data-dependent double-sampling policy is leading to many more false positives, even though my putative p-value is set below 0.5
557 changes: 557 additions & 0 deletions Ceballos_ps4.html

Large diffs are not rendered by default.

Loading