30. Homework 10

from datascience import *
import numpy as np

%matplotlib inline

import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

from scipy import stats

An undergraduate statistics research project several years ago studied humor styles and personality. The data set from that study is called personality.csv. We will pull different subsets of that data frame for the work below. Here’s a description of the different variables you will see.

Sex

M/F response to question about biological sex

G21

Y/N response to “are you 21 years old or older?”

Greek

Y/N response to “are you involved in a social Greek fraternity or sorority?”

AccDate

Y/N response to question: “At a time in your life when you are not involved with anyone, someone asks you out. This person has a great personality, but you do not find them physically attractive. Do you accept the date?”

SitClass

Front/middle/back response to “where do you prefer to sit in class?”

Friends

Same/opposite/either response to “which sex do you find it easiest to make friends with.”

Stress1, Stress2

Pre-post measure of stress in the 2nd week (Stress1) and 7th week (Stress2) of the semester.

TxRel

Toxic relationships beliefs, higher scores indicate more toxicity.

Opt

Optimism, higher scores indicate more optimism.

SE

Self-esteem, higher score indicate higher levels of self-esteem.

Neuro

Neuroticism, higher scores indicate higher levels of neuroticism

Perf

Perfectionism, higher scores indicate higher levels of perfectionism.

Narc

Narcissism, higher scores indicate higher levels of narcissism.

You will likely recognize several data sets that we used in class examples and labs, too.

Data for examples

neuroanx = Table.read_table('http://faculty.ung.edu/rsinn/neuroanx.csv')
perfnarc = Table.read_table('http://faculty.ung.edu/rsinn/perfnarc.csv')
nba = Table.read_table('http://faculty.ung.edu/rsinn/nba_salaries.csv')
assault = Table.read_table('http://faculty.ung.edu/rsinn/crime_rates.csv').select(0,1,2,3,4,7)

Task 1

Using the perfnarc table, conduct an exploratory data analysis of the Stress1 values. Be sure to find the mean, median, sample size and standard deviation, and to display a histogram of the variable.

perfnarc.show(5)
Sex G21 Greek AccDate Stress1 Stress2 Perf Narc
F N N N 9 7 99 3
F Y N Y 11 13 86 2
F N Y N 15 14 118 4
F N N Y 16 15 113 2
F Y N Y 17 17 107 8

... (143 rows omitted)

Remember, you may use the descriptive statistics tools from notebook 26.

Task 2

Using data from the perfnarc table, conduct an A/B test on Stress1 values using the grouping variable Greek. The research question is whether students involved in Greek life would be more stressed during the 2nd week of the semester. Many social Greek organizations have meetings, socials and philanthropy events early in the semester, so perhaps they experience higher levels of stess.

perfnarc.show(5)
Sex G21 Greek AccDate Stress1 Stress2 Perf Narc
F N N N 9 7 99 3
F Y N Y 11 13 86 2
F N Y N 15 14 118 4
F N N Y 16 15 113 2
F Y N Y 17 17 107 8

... (143 rows omitted)

Be sure to include your null hypothesis and a for loop that simulates the null hypothesis test statistic. After displaying the simulated distrubtion and calculating your \(p\)-value, write a sentence or two about the real world conclusions you can draw based on your investigation.

Remember, you may use the A/B testing tools from notebook 20 and notebook 21.

Task 3

Using the nba_salary table, conduct an A/B test to determine if power forwards (PF) are paid more than shooting guards (SG).

nba.show(5)
PLAYER POSITION TEAM '15-'16 SALARY
Paul Millsap PF Atlanta Hawks 18.6717
Al Horford C Atlanta Hawks 12
Tiago Splitter C Atlanta Hawks 9.75625
Jeff Teague PG Atlanta Hawks 8
Kyle Korver SG Atlanta Hawks 5.74648

... (412 rows omitted)

Be sure to include your null hypothesis and a for loop that simulates the null hypothesis test statistic. After displaying the simulated distrubtion and calculating your \(p\)-value, write a sentence or two about the real world conclusions you can draw based on your investigation.

Remember, you may use the tools from notebook 20 and notebook 21.

Task 4

Using the violent crime data set called crime_rates, conduct an exploratory data analysis as well as a bootstrapping confidence interval estimate of the mean Aggravated Assault Rate in Georgia between 1960 and 1990.

assault.show(5)
State Year Population Violent Crime Rate Murder Rate Aggraveted Assault Rate
Alaska 1960 226167 104.3 10.2 45.1
Alaska 1961 234000 88.9 11.5 51.7
Alaska 1962 246000 91.5 4.5 54.5
Alaska 1963 248000 109.7 6.5 66.1
Alaska 1964 250000 150 10.4 96

... (2195 rows omitted)

Discuss your findings. Remember, you may use the bootstrapping tools from notebook 24 and notebook 25 along with the descriptive statistics tools in notebook 26

Task 5

Using the violent crime data set called crime_rates, conduct comparison of the Aggravated Assaults in Georgia and Alabama from 1960 to 1990. Use a 95% confidence interval for both means using a bootstrap confidence interval with resample size of 30 and 1,000 repetitions of your for loop.

With the null hypothesis that the aggravated assault rate distribution will be the same in both GA and AL, we can conduct a hypothesis test. If the confidence intervals do not overlap, we have evidence of a difference in aggravated assault rates between these two states. If the confidence intervals do overlap, there is no evidence for a difference in means.

assault.show(5)
State Year Population Violent Crime Rate Murder Rate Aggraveted Assault Rate
Alaska 1960 226167 104.3 10.2 45.1
Alaska 1961 234000 88.9 11.5 51.7
Alaska 1962 246000 91.5 4.5 54.5
Alaska 1963 248000 109.7 6.5 66.1
Alaska 1964 250000 150 10.4 96

... (2195 rows omitted)

Compare and contrast your two bootstrap distributions, and write a sentence or two about the real world conclusions you can draw based on your investigation.

Remember, you may use the tools from notebook 24 and notebook 25.

Task 6

Using the personality data set, test for a significant correlation between Anxiety and Optimism. Be sure to to display descriptive statistics for regression, a scatter plot, and a simulation of the null hypothesis test statistic calculated using a for loop with 2,000 to 5,000 repititions.

neuroanx.show(5)
Sex G21 SitClass Friends TxRel Anx Opt SE Neuro
M N F O 26 23 20 70 10
F N M S 21 24 22 68 11
M Y F E 25 27 29 65 11
M Y B E 22 30 28 61 15
M N M E 23 40 26 64 16

... (137 rows omitted)

Be sure to include your null hypothesis and a for loop that simulates the null hypothesis test statistic. After displaying the simulated distrubtion and calculating your \(p\)-value, write a sentence or two about the real world conclusions you can draw based on your investigation.

Remember, you may use the correlation and regression tools from notebook 26 and notebook 27

Task 7

Conduct an investigation that interests you using one of the included data sets. Describe your null hypothesis and how you plan to test it. Have fun. Be creative. Yet try to keep your ideas very similar to one of the example investigations shown. You should include a hypothesis test and discussion of your calculated \(p\)-value. An exploratory data analysis of a single numeric variable is not sufficient for this task but would be an excellent additional component to one of the hypothesis tests.