How to split dataframe in R to conduct statistical tests? [closed]
I have 3 variables, A1
, A2
and A3
A1
is temperature
A2
is month
A3
is location
A2
has 2 months - March and May.A3
has 2 cities - Chennai and Dubai.
but when I do a correlation between A1
and A3
:
cor(A1,A3, method = "pearson")
'y' must be numeric
How can I fix this, please?
Many Thanks,
Ishack
r dataframe split correlation
closed as off-topic by jogo, Andre Elrico, Rui Barradas, Sven Hohenstein, phiver Nov 19 '18 at 17:34
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Questions seeking debugging help ("why isn't this code working?") must include the desired behavior, a specific problem or error and the shortest code necessary to reproduce it in the question itself. Questions without a clear problem statement are not useful to other readers. See: How to create a Minimal, Complete, and Verifiable example." – jogo, Sven Hohenstein, phiver
If this question can be reworded to fit the rules in the help center, please edit the question.
add a comment |
I have 3 variables, A1
, A2
and A3
A1
is temperature
A2
is month
A3
is location
A2
has 2 months - March and May.A3
has 2 cities - Chennai and Dubai.
but when I do a correlation between A1
and A3
:
cor(A1,A3, method = "pearson")
'y' must be numeric
How can I fix this, please?
Many Thanks,
Ishack
r dataframe split correlation
closed as off-topic by jogo, Andre Elrico, Rui Barradas, Sven Hohenstein, phiver Nov 19 '18 at 17:34
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Questions seeking debugging help ("why isn't this code working?") must include the desired behavior, a specific problem or error and the shortest code necessary to reproduce it in the question itself. Questions without a clear problem statement are not useful to other readers. See: How to create a Minimal, Complete, and Verifiable example." – jogo, Sven Hohenstein, phiver
If this question can be reworded to fit the rules in the help center, please edit the question.
1
Welcome to SO! Please read How to Ask and give a Minimal, Complete, and Verifiable example in your question!
– jogo
Nov 19 '18 at 12:47
2
You cannot correlate nominal/ dichotomous data with continuous. You could do a?t.test
or?wilcox.test
forA1 vs A2
andA1 vs A3
. Please read and watch youtube videos on "correlation" and the two tests I mentioned.
– Andre Elrico
Nov 19 '18 at 12:50
Maybe the accepted answer to Correlations with unordered categorical variables will help.
– Rui Barradas
Nov 19 '18 at 12:56
Can I use t.test instead of the correlation?
– Ishack Marshook
Nov 19 '18 at 14:01
A t-test would test the null hypothesis that the average temperatures between the two cities are equal. A paired t-test would test the null hypothesis that pairs of temperature readings taken at the same time are equal. A correlation would explain the degree to which increases in temperature in Chennai are associated with increases in temperature in Dubai. Which hypothesis do you wish to test?
– Len Greski
Nov 19 '18 at 17:15
add a comment |
I have 3 variables, A1
, A2
and A3
A1
is temperature
A2
is month
A3
is location
A2
has 2 months - March and May.A3
has 2 cities - Chennai and Dubai.
but when I do a correlation between A1
and A3
:
cor(A1,A3, method = "pearson")
'y' must be numeric
How can I fix this, please?
Many Thanks,
Ishack
r dataframe split correlation
I have 3 variables, A1
, A2
and A3
A1
is temperature
A2
is month
A3
is location
A2
has 2 months - March and May.A3
has 2 cities - Chennai and Dubai.
but when I do a correlation between A1
and A3
:
cor(A1,A3, method = "pearson")
'y' must be numeric
How can I fix this, please?
Many Thanks,
Ishack
r dataframe split correlation
r dataframe split correlation
edited Nov 19 '18 at 13:26
Ned
1,0801422
1,0801422
asked Nov 19 '18 at 12:32
Ishack MarshookIshack Marshook
12
12
closed as off-topic by jogo, Andre Elrico, Rui Barradas, Sven Hohenstein, phiver Nov 19 '18 at 17:34
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Questions seeking debugging help ("why isn't this code working?") must include the desired behavior, a specific problem or error and the shortest code necessary to reproduce it in the question itself. Questions without a clear problem statement are not useful to other readers. See: How to create a Minimal, Complete, and Verifiable example." – jogo, Sven Hohenstein, phiver
If this question can be reworded to fit the rules in the help center, please edit the question.
closed as off-topic by jogo, Andre Elrico, Rui Barradas, Sven Hohenstein, phiver Nov 19 '18 at 17:34
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Questions seeking debugging help ("why isn't this code working?") must include the desired behavior, a specific problem or error and the shortest code necessary to reproduce it in the question itself. Questions without a clear problem statement are not useful to other readers. See: How to create a Minimal, Complete, and Verifiable example." – jogo, Sven Hohenstein, phiver
If this question can be reworded to fit the rules in the help center, please edit the question.
1
Welcome to SO! Please read How to Ask and give a Minimal, Complete, and Verifiable example in your question!
– jogo
Nov 19 '18 at 12:47
2
You cannot correlate nominal/ dichotomous data with continuous. You could do a?t.test
or?wilcox.test
forA1 vs A2
andA1 vs A3
. Please read and watch youtube videos on "correlation" and the two tests I mentioned.
– Andre Elrico
Nov 19 '18 at 12:50
Maybe the accepted answer to Correlations with unordered categorical variables will help.
– Rui Barradas
Nov 19 '18 at 12:56
Can I use t.test instead of the correlation?
– Ishack Marshook
Nov 19 '18 at 14:01
A t-test would test the null hypothesis that the average temperatures between the two cities are equal. A paired t-test would test the null hypothesis that pairs of temperature readings taken at the same time are equal. A correlation would explain the degree to which increases in temperature in Chennai are associated with increases in temperature in Dubai. Which hypothesis do you wish to test?
– Len Greski
Nov 19 '18 at 17:15
add a comment |
1
Welcome to SO! Please read How to Ask and give a Minimal, Complete, and Verifiable example in your question!
– jogo
Nov 19 '18 at 12:47
2
You cannot correlate nominal/ dichotomous data with continuous. You could do a?t.test
or?wilcox.test
forA1 vs A2
andA1 vs A3
. Please read and watch youtube videos on "correlation" and the two tests I mentioned.
– Andre Elrico
Nov 19 '18 at 12:50
Maybe the accepted answer to Correlations with unordered categorical variables will help.
– Rui Barradas
Nov 19 '18 at 12:56
Can I use t.test instead of the correlation?
– Ishack Marshook
Nov 19 '18 at 14:01
A t-test would test the null hypothesis that the average temperatures between the two cities are equal. A paired t-test would test the null hypothesis that pairs of temperature readings taken at the same time are equal. A correlation would explain the degree to which increases in temperature in Chennai are associated with increases in temperature in Dubai. Which hypothesis do you wish to test?
– Len Greski
Nov 19 '18 at 17:15
1
1
Welcome to SO! Please read How to Ask and give a Minimal, Complete, and Verifiable example in your question!
– jogo
Nov 19 '18 at 12:47
Welcome to SO! Please read How to Ask and give a Minimal, Complete, and Verifiable example in your question!
– jogo
Nov 19 '18 at 12:47
2
2
You cannot correlate nominal/ dichotomous data with continuous. You could do a
?t.test
or ?wilcox.test
for A1 vs A2
and A1 vs A3
. Please read and watch youtube videos on "correlation" and the two tests I mentioned.– Andre Elrico
Nov 19 '18 at 12:50
You cannot correlate nominal/ dichotomous data with continuous. You could do a
?t.test
or ?wilcox.test
for A1 vs A2
and A1 vs A3
. Please read and watch youtube videos on "correlation" and the two tests I mentioned.– Andre Elrico
Nov 19 '18 at 12:50
Maybe the accepted answer to Correlations with unordered categorical variables will help.
– Rui Barradas
Nov 19 '18 at 12:56
Maybe the accepted answer to Correlations with unordered categorical variables will help.
– Rui Barradas
Nov 19 '18 at 12:56
Can I use t.test instead of the correlation?
– Ishack Marshook
Nov 19 '18 at 14:01
Can I use t.test instead of the correlation?
– Ishack Marshook
Nov 19 '18 at 14:01
A t-test would test the null hypothesis that the average temperatures between the two cities are equal. A paired t-test would test the null hypothesis that pairs of temperature readings taken at the same time are equal. A correlation would explain the degree to which increases in temperature in Chennai are associated with increases in temperature in Dubai. Which hypothesis do you wish to test?
– Len Greski
Nov 19 '18 at 17:15
A t-test would test the null hypothesis that the average temperatures between the two cities are equal. A paired t-test would test the null hypothesis that pairs of temperature readings taken at the same time are equal. A correlation would explain the degree to which increases in temperature in Chennai are associated with increases in temperature in Dubai. Which hypothesis do you wish to test?
– Len Greski
Nov 19 '18 at 17:15
add a comment |
1 Answer
1
active
oldest
votes
There are many ways to split the data, but the first question to answer is "what hypothesis do I wish to test?"
Here is example code using average daily high temperatures in Chennai and Dubai from timeanddate.com
# data collected from average high temperatures collected from 2005 - 2015
# https://www.timeanddate.com/weather/india/chennai/climate
# https://www.timeanddate.com/weather/united-arab-emirates/dubai/climate
rawData <- "
temperature,month,city
75,Jan,Dubai
78,Feb,Dubai
83,Mar,Dubai
92,Apr,Dubai
100,May,Dubai
103,Jun,Dubai
106,Jul,Dubai
107,Aug,Dubai
102,Sep,Dubai
96,Oct,Dubai
87,Nov,Dubai
79,Dec,Dubai
86,Jan,Chennai
89,Feb,Chennai
93,Mar,Chennai
97,Apr,Chennai
102,May,Chennai
100,Jun,Chennai
97,Jul,Chennai
95,Aug,Chennai
95,Sep,Chennai
92,Oct,Chennai
87,Nov,Chennai
86,Dec,Chennai"
tempData <- read.csv(text=rawData)
# t-test for average temperatures
t.test(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city == "Chennai","temperature"],
paired=FALSE)
# paired t-test
t.test(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city == "Chennai","temperature"],
paired=TRUE)
# correlation
cor(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city =="Chennai","temperature"])
Two Sample t-test
The two sample t-test tests the null hypothesis that the two means are equal, irrespective of the association between data collected between the two groups in the test. Sometimes the association between two groups may be based on time (as in the case of the temperature data), but the pairing may be based on other characteristics (e.g. twins in a study that has test and control groups where each pair of twins is randomly assigned to test and control groups).
> t.test(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city == "Chennai","temperature"],
+ paired=FALSE)
Welch Two Sample t-test
data: tempData[tempData$city == "Dubai", "temperature"] and tempData[tempData$city == "Chennai", "temperature"]
t = -0.24817, df = 15.546, p-value = 0.8073
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-8.765568 6.932235
sample estimates:
mean of x mean of y
92.33333 93.25000
Since 0 is within the 95% confidence interval, we accept the null hypothesis that there is no difference in the monthly average high temperatures between Chennai and Dubai.
Paired t-test
The paired t-test calculates the difference between the pairs of observations and tests the null hypothesis that the average difference is 0.
> # paired t-test
> t.test(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city == "Chennai","temperature"],
+ paired=TRUE)
Paired t-test
data: tempData[tempData$city == "Dubai", "temperature"] and tempData[tempData$city == "Chennai", "temperature"]
t = -0.39555, df = 11, p-value = 0.7
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.017343 4.184009
sample estimates:
mean of the differences
-0.9166667
Since 0 is within the 95% confidence interval, we accept the null hypothesis that there is no difference in the monthly average high temperatures between Chennai and Dubai, when the test is conducted on the differences in pairs of monthly average high temperature values.
Correlation
The Pearson correlation measures the strength of the linear relationship between two variables, -1.0 = perfect negative correlation, 0 = no linear correlation, and 1 = perfect positive correlation.
> cor(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city =="Chennai","temperature"])
[1] 0.7929018
>
A correlation of 0.79 indicates a strong positive linear relationship between monthly average high temperatures in Dubai and Chennai.
Technique used to split the data
Since I created a raw data file and loaded it into R with read.csv()
, I used the [
form of the extract operator to extract rows based on the value of the city
column. I also created the raw data file in monthly order for each city, so the order of values in each subset matches by month, enabling a straightforward use of the pairwise t-test.
# extract temperature values for Dubai
tempData[tempData$city =="Dubai","temperature"]
A wide variety of techniques can be used to subset data from an R data frame, such as the which()
function and the sqldf()
function.
Thanks a lot Len.
– Ishack Marshook
Nov 20 '18 at 13:25
@IshackMarshook please accept the answer if you found it to be helpful.
– Len Greski
Nov 20 '18 at 13:26
done, Len. Can i ask a few more questions please?
– Ishack Marshook
Nov 20 '18 at 13:28
@IshackMarshook - if your followup questions are directly related to this one, yes. Otherwise, please post a new question and the SO community will answer it. Part of SO etiquette is that a post focuses on a single question.
– Len Greski
Nov 20 '18 at 15:42
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
There are many ways to split the data, but the first question to answer is "what hypothesis do I wish to test?"
Here is example code using average daily high temperatures in Chennai and Dubai from timeanddate.com
# data collected from average high temperatures collected from 2005 - 2015
# https://www.timeanddate.com/weather/india/chennai/climate
# https://www.timeanddate.com/weather/united-arab-emirates/dubai/climate
rawData <- "
temperature,month,city
75,Jan,Dubai
78,Feb,Dubai
83,Mar,Dubai
92,Apr,Dubai
100,May,Dubai
103,Jun,Dubai
106,Jul,Dubai
107,Aug,Dubai
102,Sep,Dubai
96,Oct,Dubai
87,Nov,Dubai
79,Dec,Dubai
86,Jan,Chennai
89,Feb,Chennai
93,Mar,Chennai
97,Apr,Chennai
102,May,Chennai
100,Jun,Chennai
97,Jul,Chennai
95,Aug,Chennai
95,Sep,Chennai
92,Oct,Chennai
87,Nov,Chennai
86,Dec,Chennai"
tempData <- read.csv(text=rawData)
# t-test for average temperatures
t.test(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city == "Chennai","temperature"],
paired=FALSE)
# paired t-test
t.test(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city == "Chennai","temperature"],
paired=TRUE)
# correlation
cor(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city =="Chennai","temperature"])
Two Sample t-test
The two sample t-test tests the null hypothesis that the two means are equal, irrespective of the association between data collected between the two groups in the test. Sometimes the association between two groups may be based on time (as in the case of the temperature data), but the pairing may be based on other characteristics (e.g. twins in a study that has test and control groups where each pair of twins is randomly assigned to test and control groups).
> t.test(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city == "Chennai","temperature"],
+ paired=FALSE)
Welch Two Sample t-test
data: tempData[tempData$city == "Dubai", "temperature"] and tempData[tempData$city == "Chennai", "temperature"]
t = -0.24817, df = 15.546, p-value = 0.8073
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-8.765568 6.932235
sample estimates:
mean of x mean of y
92.33333 93.25000
Since 0 is within the 95% confidence interval, we accept the null hypothesis that there is no difference in the monthly average high temperatures between Chennai and Dubai.
Paired t-test
The paired t-test calculates the difference between the pairs of observations and tests the null hypothesis that the average difference is 0.
> # paired t-test
> t.test(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city == "Chennai","temperature"],
+ paired=TRUE)
Paired t-test
data: tempData[tempData$city == "Dubai", "temperature"] and tempData[tempData$city == "Chennai", "temperature"]
t = -0.39555, df = 11, p-value = 0.7
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.017343 4.184009
sample estimates:
mean of the differences
-0.9166667
Since 0 is within the 95% confidence interval, we accept the null hypothesis that there is no difference in the monthly average high temperatures between Chennai and Dubai, when the test is conducted on the differences in pairs of monthly average high temperature values.
Correlation
The Pearson correlation measures the strength of the linear relationship between two variables, -1.0 = perfect negative correlation, 0 = no linear correlation, and 1 = perfect positive correlation.
> cor(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city =="Chennai","temperature"])
[1] 0.7929018
>
A correlation of 0.79 indicates a strong positive linear relationship between monthly average high temperatures in Dubai and Chennai.
Technique used to split the data
Since I created a raw data file and loaded it into R with read.csv()
, I used the [
form of the extract operator to extract rows based on the value of the city
column. I also created the raw data file in monthly order for each city, so the order of values in each subset matches by month, enabling a straightforward use of the pairwise t-test.
# extract temperature values for Dubai
tempData[tempData$city =="Dubai","temperature"]
A wide variety of techniques can be used to subset data from an R data frame, such as the which()
function and the sqldf()
function.
Thanks a lot Len.
– Ishack Marshook
Nov 20 '18 at 13:25
@IshackMarshook please accept the answer if you found it to be helpful.
– Len Greski
Nov 20 '18 at 13:26
done, Len. Can i ask a few more questions please?
– Ishack Marshook
Nov 20 '18 at 13:28
@IshackMarshook - if your followup questions are directly related to this one, yes. Otherwise, please post a new question and the SO community will answer it. Part of SO etiquette is that a post focuses on a single question.
– Len Greski
Nov 20 '18 at 15:42
add a comment |
There are many ways to split the data, but the first question to answer is "what hypothesis do I wish to test?"
Here is example code using average daily high temperatures in Chennai and Dubai from timeanddate.com
# data collected from average high temperatures collected from 2005 - 2015
# https://www.timeanddate.com/weather/india/chennai/climate
# https://www.timeanddate.com/weather/united-arab-emirates/dubai/climate
rawData <- "
temperature,month,city
75,Jan,Dubai
78,Feb,Dubai
83,Mar,Dubai
92,Apr,Dubai
100,May,Dubai
103,Jun,Dubai
106,Jul,Dubai
107,Aug,Dubai
102,Sep,Dubai
96,Oct,Dubai
87,Nov,Dubai
79,Dec,Dubai
86,Jan,Chennai
89,Feb,Chennai
93,Mar,Chennai
97,Apr,Chennai
102,May,Chennai
100,Jun,Chennai
97,Jul,Chennai
95,Aug,Chennai
95,Sep,Chennai
92,Oct,Chennai
87,Nov,Chennai
86,Dec,Chennai"
tempData <- read.csv(text=rawData)
# t-test for average temperatures
t.test(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city == "Chennai","temperature"],
paired=FALSE)
# paired t-test
t.test(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city == "Chennai","temperature"],
paired=TRUE)
# correlation
cor(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city =="Chennai","temperature"])
Two Sample t-test
The two sample t-test tests the null hypothesis that the two means are equal, irrespective of the association between data collected between the two groups in the test. Sometimes the association between two groups may be based on time (as in the case of the temperature data), but the pairing may be based on other characteristics (e.g. twins in a study that has test and control groups where each pair of twins is randomly assigned to test and control groups).
> t.test(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city == "Chennai","temperature"],
+ paired=FALSE)
Welch Two Sample t-test
data: tempData[tempData$city == "Dubai", "temperature"] and tempData[tempData$city == "Chennai", "temperature"]
t = -0.24817, df = 15.546, p-value = 0.8073
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-8.765568 6.932235
sample estimates:
mean of x mean of y
92.33333 93.25000
Since 0 is within the 95% confidence interval, we accept the null hypothesis that there is no difference in the monthly average high temperatures between Chennai and Dubai.
Paired t-test
The paired t-test calculates the difference between the pairs of observations and tests the null hypothesis that the average difference is 0.
> # paired t-test
> t.test(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city == "Chennai","temperature"],
+ paired=TRUE)
Paired t-test
data: tempData[tempData$city == "Dubai", "temperature"] and tempData[tempData$city == "Chennai", "temperature"]
t = -0.39555, df = 11, p-value = 0.7
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.017343 4.184009
sample estimates:
mean of the differences
-0.9166667
Since 0 is within the 95% confidence interval, we accept the null hypothesis that there is no difference in the monthly average high temperatures between Chennai and Dubai, when the test is conducted on the differences in pairs of monthly average high temperature values.
Correlation
The Pearson correlation measures the strength of the linear relationship between two variables, -1.0 = perfect negative correlation, 0 = no linear correlation, and 1 = perfect positive correlation.
> cor(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city =="Chennai","temperature"])
[1] 0.7929018
>
A correlation of 0.79 indicates a strong positive linear relationship between monthly average high temperatures in Dubai and Chennai.
Technique used to split the data
Since I created a raw data file and loaded it into R with read.csv()
, I used the [
form of the extract operator to extract rows based on the value of the city
column. I also created the raw data file in monthly order for each city, so the order of values in each subset matches by month, enabling a straightforward use of the pairwise t-test.
# extract temperature values for Dubai
tempData[tempData$city =="Dubai","temperature"]
A wide variety of techniques can be used to subset data from an R data frame, such as the which()
function and the sqldf()
function.
Thanks a lot Len.
– Ishack Marshook
Nov 20 '18 at 13:25
@IshackMarshook please accept the answer if you found it to be helpful.
– Len Greski
Nov 20 '18 at 13:26
done, Len. Can i ask a few more questions please?
– Ishack Marshook
Nov 20 '18 at 13:28
@IshackMarshook - if your followup questions are directly related to this one, yes. Otherwise, please post a new question and the SO community will answer it. Part of SO etiquette is that a post focuses on a single question.
– Len Greski
Nov 20 '18 at 15:42
add a comment |
There are many ways to split the data, but the first question to answer is "what hypothesis do I wish to test?"
Here is example code using average daily high temperatures in Chennai and Dubai from timeanddate.com
# data collected from average high temperatures collected from 2005 - 2015
# https://www.timeanddate.com/weather/india/chennai/climate
# https://www.timeanddate.com/weather/united-arab-emirates/dubai/climate
rawData <- "
temperature,month,city
75,Jan,Dubai
78,Feb,Dubai
83,Mar,Dubai
92,Apr,Dubai
100,May,Dubai
103,Jun,Dubai
106,Jul,Dubai
107,Aug,Dubai
102,Sep,Dubai
96,Oct,Dubai
87,Nov,Dubai
79,Dec,Dubai
86,Jan,Chennai
89,Feb,Chennai
93,Mar,Chennai
97,Apr,Chennai
102,May,Chennai
100,Jun,Chennai
97,Jul,Chennai
95,Aug,Chennai
95,Sep,Chennai
92,Oct,Chennai
87,Nov,Chennai
86,Dec,Chennai"
tempData <- read.csv(text=rawData)
# t-test for average temperatures
t.test(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city == "Chennai","temperature"],
paired=FALSE)
# paired t-test
t.test(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city == "Chennai","temperature"],
paired=TRUE)
# correlation
cor(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city =="Chennai","temperature"])
Two Sample t-test
The two sample t-test tests the null hypothesis that the two means are equal, irrespective of the association between data collected between the two groups in the test. Sometimes the association between two groups may be based on time (as in the case of the temperature data), but the pairing may be based on other characteristics (e.g. twins in a study that has test and control groups where each pair of twins is randomly assigned to test and control groups).
> t.test(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city == "Chennai","temperature"],
+ paired=FALSE)
Welch Two Sample t-test
data: tempData[tempData$city == "Dubai", "temperature"] and tempData[tempData$city == "Chennai", "temperature"]
t = -0.24817, df = 15.546, p-value = 0.8073
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-8.765568 6.932235
sample estimates:
mean of x mean of y
92.33333 93.25000
Since 0 is within the 95% confidence interval, we accept the null hypothesis that there is no difference in the monthly average high temperatures between Chennai and Dubai.
Paired t-test
The paired t-test calculates the difference between the pairs of observations and tests the null hypothesis that the average difference is 0.
> # paired t-test
> t.test(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city == "Chennai","temperature"],
+ paired=TRUE)
Paired t-test
data: tempData[tempData$city == "Dubai", "temperature"] and tempData[tempData$city == "Chennai", "temperature"]
t = -0.39555, df = 11, p-value = 0.7
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.017343 4.184009
sample estimates:
mean of the differences
-0.9166667
Since 0 is within the 95% confidence interval, we accept the null hypothesis that there is no difference in the monthly average high temperatures between Chennai and Dubai, when the test is conducted on the differences in pairs of monthly average high temperature values.
Correlation
The Pearson correlation measures the strength of the linear relationship between two variables, -1.0 = perfect negative correlation, 0 = no linear correlation, and 1 = perfect positive correlation.
> cor(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city =="Chennai","temperature"])
[1] 0.7929018
>
A correlation of 0.79 indicates a strong positive linear relationship between monthly average high temperatures in Dubai and Chennai.
Technique used to split the data
Since I created a raw data file and loaded it into R with read.csv()
, I used the [
form of the extract operator to extract rows based on the value of the city
column. I also created the raw data file in monthly order for each city, so the order of values in each subset matches by month, enabling a straightforward use of the pairwise t-test.
# extract temperature values for Dubai
tempData[tempData$city =="Dubai","temperature"]
A wide variety of techniques can be used to subset data from an R data frame, such as the which()
function and the sqldf()
function.
There are many ways to split the data, but the first question to answer is "what hypothesis do I wish to test?"
Here is example code using average daily high temperatures in Chennai and Dubai from timeanddate.com
# data collected from average high temperatures collected from 2005 - 2015
# https://www.timeanddate.com/weather/india/chennai/climate
# https://www.timeanddate.com/weather/united-arab-emirates/dubai/climate
rawData <- "
temperature,month,city
75,Jan,Dubai
78,Feb,Dubai
83,Mar,Dubai
92,Apr,Dubai
100,May,Dubai
103,Jun,Dubai
106,Jul,Dubai
107,Aug,Dubai
102,Sep,Dubai
96,Oct,Dubai
87,Nov,Dubai
79,Dec,Dubai
86,Jan,Chennai
89,Feb,Chennai
93,Mar,Chennai
97,Apr,Chennai
102,May,Chennai
100,Jun,Chennai
97,Jul,Chennai
95,Aug,Chennai
95,Sep,Chennai
92,Oct,Chennai
87,Nov,Chennai
86,Dec,Chennai"
tempData <- read.csv(text=rawData)
# t-test for average temperatures
t.test(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city == "Chennai","temperature"],
paired=FALSE)
# paired t-test
t.test(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city == "Chennai","temperature"],
paired=TRUE)
# correlation
cor(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city =="Chennai","temperature"])
Two Sample t-test
The two sample t-test tests the null hypothesis that the two means are equal, irrespective of the association between data collected between the two groups in the test. Sometimes the association between two groups may be based on time (as in the case of the temperature data), but the pairing may be based on other characteristics (e.g. twins in a study that has test and control groups where each pair of twins is randomly assigned to test and control groups).
> t.test(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city == "Chennai","temperature"],
+ paired=FALSE)
Welch Two Sample t-test
data: tempData[tempData$city == "Dubai", "temperature"] and tempData[tempData$city == "Chennai", "temperature"]
t = -0.24817, df = 15.546, p-value = 0.8073
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-8.765568 6.932235
sample estimates:
mean of x mean of y
92.33333 93.25000
Since 0 is within the 95% confidence interval, we accept the null hypothesis that there is no difference in the monthly average high temperatures between Chennai and Dubai.
Paired t-test
The paired t-test calculates the difference between the pairs of observations and tests the null hypothesis that the average difference is 0.
> # paired t-test
> t.test(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city == "Chennai","temperature"],
+ paired=TRUE)
Paired t-test
data: tempData[tempData$city == "Dubai", "temperature"] and tempData[tempData$city == "Chennai", "temperature"]
t = -0.39555, df = 11, p-value = 0.7
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.017343 4.184009
sample estimates:
mean of the differences
-0.9166667
Since 0 is within the 95% confidence interval, we accept the null hypothesis that there is no difference in the monthly average high temperatures between Chennai and Dubai, when the test is conducted on the differences in pairs of monthly average high temperature values.
Correlation
The Pearson correlation measures the strength of the linear relationship between two variables, -1.0 = perfect negative correlation, 0 = no linear correlation, and 1 = perfect positive correlation.
> cor(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city =="Chennai","temperature"])
[1] 0.7929018
>
A correlation of 0.79 indicates a strong positive linear relationship between monthly average high temperatures in Dubai and Chennai.
Technique used to split the data
Since I created a raw data file and loaded it into R with read.csv()
, I used the [
form of the extract operator to extract rows based on the value of the city
column. I also created the raw data file in monthly order for each city, so the order of values in each subset matches by month, enabling a straightforward use of the pairwise t-test.
# extract temperature values for Dubai
tempData[tempData$city =="Dubai","temperature"]
A wide variety of techniques can be used to subset data from an R data frame, such as the which()
function and the sqldf()
function.
edited Nov 20 '18 at 0:58
answered Nov 19 '18 at 17:28
Len GreskiLen Greski
3,1401421
3,1401421
Thanks a lot Len.
– Ishack Marshook
Nov 20 '18 at 13:25
@IshackMarshook please accept the answer if you found it to be helpful.
– Len Greski
Nov 20 '18 at 13:26
done, Len. Can i ask a few more questions please?
– Ishack Marshook
Nov 20 '18 at 13:28
@IshackMarshook - if your followup questions are directly related to this one, yes. Otherwise, please post a new question and the SO community will answer it. Part of SO etiquette is that a post focuses on a single question.
– Len Greski
Nov 20 '18 at 15:42
add a comment |
Thanks a lot Len.
– Ishack Marshook
Nov 20 '18 at 13:25
@IshackMarshook please accept the answer if you found it to be helpful.
– Len Greski
Nov 20 '18 at 13:26
done, Len. Can i ask a few more questions please?
– Ishack Marshook
Nov 20 '18 at 13:28
@IshackMarshook - if your followup questions are directly related to this one, yes. Otherwise, please post a new question and the SO community will answer it. Part of SO etiquette is that a post focuses on a single question.
– Len Greski
Nov 20 '18 at 15:42
Thanks a lot Len.
– Ishack Marshook
Nov 20 '18 at 13:25
Thanks a lot Len.
– Ishack Marshook
Nov 20 '18 at 13:25
@IshackMarshook please accept the answer if you found it to be helpful.
– Len Greski
Nov 20 '18 at 13:26
@IshackMarshook please accept the answer if you found it to be helpful.
– Len Greski
Nov 20 '18 at 13:26
done, Len. Can i ask a few more questions please?
– Ishack Marshook
Nov 20 '18 at 13:28
done, Len. Can i ask a few more questions please?
– Ishack Marshook
Nov 20 '18 at 13:28
@IshackMarshook - if your followup questions are directly related to this one, yes. Otherwise, please post a new question and the SO community will answer it. Part of SO etiquette is that a post focuses on a single question.
– Len Greski
Nov 20 '18 at 15:42
@IshackMarshook - if your followup questions are directly related to this one, yes. Otherwise, please post a new question and the SO community will answer it. Part of SO etiquette is that a post focuses on a single question.
– Len Greski
Nov 20 '18 at 15:42
add a comment |
1
Welcome to SO! Please read How to Ask and give a Minimal, Complete, and Verifiable example in your question!
– jogo
Nov 19 '18 at 12:47
2
You cannot correlate nominal/ dichotomous data with continuous. You could do a
?t.test
or?wilcox.test
forA1 vs A2
andA1 vs A3
. Please read and watch youtube videos on "correlation" and the two tests I mentioned.– Andre Elrico
Nov 19 '18 at 12:50
Maybe the accepted answer to Correlations with unordered categorical variables will help.
– Rui Barradas
Nov 19 '18 at 12:56
Can I use t.test instead of the correlation?
– Ishack Marshook
Nov 19 '18 at 14:01
A t-test would test the null hypothesis that the average temperatures between the two cities are equal. A paired t-test would test the null hypothesis that pairs of temperature readings taken at the same time are equal. A correlation would explain the degree to which increases in temperature in Chennai are associated with increases in temperature in Dubai. Which hypothesis do you wish to test?
– Len Greski
Nov 19 '18 at 17:15