How to split dataframe in R to conduct statistical tests? [closed]












-1















I have 3 variables, A1, A2 and A3





  • A1 is temperature


  • A2 is month


  • A3 is location


A2 has 2 months - March and May.A3 has 2 cities - Chennai and Dubai.



but when I do a correlation between A1 and A3:




cor(A1,A3, method = "pearson")
'y' must be numeric



How can I fix this, please?



Many Thanks,
Ishack










share|improve this question















closed as off-topic by jogo, Andre Elrico, Rui Barradas, Sven Hohenstein, phiver Nov 19 '18 at 17:34


This question appears to be off-topic. The users who voted to close gave this specific reason:


  • "Questions seeking debugging help ("why isn't this code working?") must include the desired behavior, a specific problem or error and the shortest code necessary to reproduce it in the question itself. Questions without a clear problem statement are not useful to other readers. See: How to create a Minimal, Complete, and Verifiable example." – jogo, Sven Hohenstein, phiver

If this question can be reworded to fit the rules in the help center, please edit the question.









  • 1





    Welcome to SO! Please read How to Ask and give a Minimal, Complete, and Verifiable example in your question!

    – jogo
    Nov 19 '18 at 12:47






  • 2





    You cannot correlate nominal/ dichotomous data with continuous. You could do a ?t.test or ?wilcox.test for A1 vs A2 and A1 vs A3. Please read and watch youtube videos on "correlation" and the two tests I mentioned.

    – Andre Elrico
    Nov 19 '18 at 12:50











  • Maybe the accepted answer to Correlations with unordered categorical variables will help.

    – Rui Barradas
    Nov 19 '18 at 12:56













  • Can I use t.test instead of the correlation?

    – Ishack Marshook
    Nov 19 '18 at 14:01











  • A t-test would test the null hypothesis that the average temperatures between the two cities are equal. A paired t-test would test the null hypothesis that pairs of temperature readings taken at the same time are equal. A correlation would explain the degree to which increases in temperature in Chennai are associated with increases in temperature in Dubai. Which hypothesis do you wish to test?

    – Len Greski
    Nov 19 '18 at 17:15
















-1















I have 3 variables, A1, A2 and A3





  • A1 is temperature


  • A2 is month


  • A3 is location


A2 has 2 months - March and May.A3 has 2 cities - Chennai and Dubai.



but when I do a correlation between A1 and A3:




cor(A1,A3, method = "pearson")
'y' must be numeric



How can I fix this, please?



Many Thanks,
Ishack










share|improve this question















closed as off-topic by jogo, Andre Elrico, Rui Barradas, Sven Hohenstein, phiver Nov 19 '18 at 17:34


This question appears to be off-topic. The users who voted to close gave this specific reason:


  • "Questions seeking debugging help ("why isn't this code working?") must include the desired behavior, a specific problem or error and the shortest code necessary to reproduce it in the question itself. Questions without a clear problem statement are not useful to other readers. See: How to create a Minimal, Complete, and Verifiable example." – jogo, Sven Hohenstein, phiver

If this question can be reworded to fit the rules in the help center, please edit the question.









  • 1





    Welcome to SO! Please read How to Ask and give a Minimal, Complete, and Verifiable example in your question!

    – jogo
    Nov 19 '18 at 12:47






  • 2





    You cannot correlate nominal/ dichotomous data with continuous. You could do a ?t.test or ?wilcox.test for A1 vs A2 and A1 vs A3. Please read and watch youtube videos on "correlation" and the two tests I mentioned.

    – Andre Elrico
    Nov 19 '18 at 12:50











  • Maybe the accepted answer to Correlations with unordered categorical variables will help.

    – Rui Barradas
    Nov 19 '18 at 12:56













  • Can I use t.test instead of the correlation?

    – Ishack Marshook
    Nov 19 '18 at 14:01











  • A t-test would test the null hypothesis that the average temperatures between the two cities are equal. A paired t-test would test the null hypothesis that pairs of temperature readings taken at the same time are equal. A correlation would explain the degree to which increases in temperature in Chennai are associated with increases in temperature in Dubai. Which hypothesis do you wish to test?

    – Len Greski
    Nov 19 '18 at 17:15














-1












-1








-1








I have 3 variables, A1, A2 and A3





  • A1 is temperature


  • A2 is month


  • A3 is location


A2 has 2 months - March and May.A3 has 2 cities - Chennai and Dubai.



but when I do a correlation between A1 and A3:




cor(A1,A3, method = "pearson")
'y' must be numeric



How can I fix this, please?



Many Thanks,
Ishack










share|improve this question
















I have 3 variables, A1, A2 and A3





  • A1 is temperature


  • A2 is month


  • A3 is location


A2 has 2 months - March and May.A3 has 2 cities - Chennai and Dubai.



but when I do a correlation between A1 and A3:




cor(A1,A3, method = "pearson")
'y' must be numeric



How can I fix this, please?



Many Thanks,
Ishack







r dataframe split correlation






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 19 '18 at 13:26









Ned

1,0801422




1,0801422










asked Nov 19 '18 at 12:32









Ishack MarshookIshack Marshook

12




12




closed as off-topic by jogo, Andre Elrico, Rui Barradas, Sven Hohenstein, phiver Nov 19 '18 at 17:34


This question appears to be off-topic. The users who voted to close gave this specific reason:


  • "Questions seeking debugging help ("why isn't this code working?") must include the desired behavior, a specific problem or error and the shortest code necessary to reproduce it in the question itself. Questions without a clear problem statement are not useful to other readers. See: How to create a Minimal, Complete, and Verifiable example." – jogo, Sven Hohenstein, phiver

If this question can be reworded to fit the rules in the help center, please edit the question.




closed as off-topic by jogo, Andre Elrico, Rui Barradas, Sven Hohenstein, phiver Nov 19 '18 at 17:34


This question appears to be off-topic. The users who voted to close gave this specific reason:


  • "Questions seeking debugging help ("why isn't this code working?") must include the desired behavior, a specific problem or error and the shortest code necessary to reproduce it in the question itself. Questions without a clear problem statement are not useful to other readers. See: How to create a Minimal, Complete, and Verifiable example." – jogo, Sven Hohenstein, phiver

If this question can be reworded to fit the rules in the help center, please edit the question.








  • 1





    Welcome to SO! Please read How to Ask and give a Minimal, Complete, and Verifiable example in your question!

    – jogo
    Nov 19 '18 at 12:47






  • 2





    You cannot correlate nominal/ dichotomous data with continuous. You could do a ?t.test or ?wilcox.test for A1 vs A2 and A1 vs A3. Please read and watch youtube videos on "correlation" and the two tests I mentioned.

    – Andre Elrico
    Nov 19 '18 at 12:50











  • Maybe the accepted answer to Correlations with unordered categorical variables will help.

    – Rui Barradas
    Nov 19 '18 at 12:56













  • Can I use t.test instead of the correlation?

    – Ishack Marshook
    Nov 19 '18 at 14:01











  • A t-test would test the null hypothesis that the average temperatures between the two cities are equal. A paired t-test would test the null hypothesis that pairs of temperature readings taken at the same time are equal. A correlation would explain the degree to which increases in temperature in Chennai are associated with increases in temperature in Dubai. Which hypothesis do you wish to test?

    – Len Greski
    Nov 19 '18 at 17:15














  • 1





    Welcome to SO! Please read How to Ask and give a Minimal, Complete, and Verifiable example in your question!

    – jogo
    Nov 19 '18 at 12:47






  • 2





    You cannot correlate nominal/ dichotomous data with continuous. You could do a ?t.test or ?wilcox.test for A1 vs A2 and A1 vs A3. Please read and watch youtube videos on "correlation" and the two tests I mentioned.

    – Andre Elrico
    Nov 19 '18 at 12:50











  • Maybe the accepted answer to Correlations with unordered categorical variables will help.

    – Rui Barradas
    Nov 19 '18 at 12:56













  • Can I use t.test instead of the correlation?

    – Ishack Marshook
    Nov 19 '18 at 14:01











  • A t-test would test the null hypothesis that the average temperatures between the two cities are equal. A paired t-test would test the null hypothesis that pairs of temperature readings taken at the same time are equal. A correlation would explain the degree to which increases in temperature in Chennai are associated with increases in temperature in Dubai. Which hypothesis do you wish to test?

    – Len Greski
    Nov 19 '18 at 17:15








1




1





Welcome to SO! Please read How to Ask and give a Minimal, Complete, and Verifiable example in your question!

– jogo
Nov 19 '18 at 12:47





Welcome to SO! Please read How to Ask and give a Minimal, Complete, and Verifiable example in your question!

– jogo
Nov 19 '18 at 12:47




2




2





You cannot correlate nominal/ dichotomous data with continuous. You could do a ?t.test or ?wilcox.test for A1 vs A2 and A1 vs A3. Please read and watch youtube videos on "correlation" and the two tests I mentioned.

– Andre Elrico
Nov 19 '18 at 12:50





You cannot correlate nominal/ dichotomous data with continuous. You could do a ?t.test or ?wilcox.test for A1 vs A2 and A1 vs A3. Please read and watch youtube videos on "correlation" and the two tests I mentioned.

– Andre Elrico
Nov 19 '18 at 12:50













Maybe the accepted answer to Correlations with unordered categorical variables will help.

– Rui Barradas
Nov 19 '18 at 12:56







Maybe the accepted answer to Correlations with unordered categorical variables will help.

– Rui Barradas
Nov 19 '18 at 12:56















Can I use t.test instead of the correlation?

– Ishack Marshook
Nov 19 '18 at 14:01





Can I use t.test instead of the correlation?

– Ishack Marshook
Nov 19 '18 at 14:01













A t-test would test the null hypothesis that the average temperatures between the two cities are equal. A paired t-test would test the null hypothesis that pairs of temperature readings taken at the same time are equal. A correlation would explain the degree to which increases in temperature in Chennai are associated with increases in temperature in Dubai. Which hypothesis do you wish to test?

– Len Greski
Nov 19 '18 at 17:15





A t-test would test the null hypothesis that the average temperatures between the two cities are equal. A paired t-test would test the null hypothesis that pairs of temperature readings taken at the same time are equal. A correlation would explain the degree to which increases in temperature in Chennai are associated with increases in temperature in Dubai. Which hypothesis do you wish to test?

– Len Greski
Nov 19 '18 at 17:15












1 Answer
1






active

oldest

votes


















0














There are many ways to split the data, but the first question to answer is "what hypothesis do I wish to test?"



Here is example code using average daily high temperatures in Chennai and Dubai from timeanddate.com



# data collected from average high temperatures collected from 2005 - 2015
# https://www.timeanddate.com/weather/india/chennai/climate
# https://www.timeanddate.com/weather/united-arab-emirates/dubai/climate
rawData <- "
temperature,month,city
75,Jan,Dubai
78,Feb,Dubai
83,Mar,Dubai
92,Apr,Dubai
100,May,Dubai
103,Jun,Dubai
106,Jul,Dubai
107,Aug,Dubai
102,Sep,Dubai
96,Oct,Dubai
87,Nov,Dubai
79,Dec,Dubai
86,Jan,Chennai
89,Feb,Chennai
93,Mar,Chennai
97,Apr,Chennai
102,May,Chennai
100,Jun,Chennai
97,Jul,Chennai
95,Aug,Chennai
95,Sep,Chennai
92,Oct,Chennai
87,Nov,Chennai
86,Dec,Chennai"

tempData <- read.csv(text=rawData)

# t-test for average temperatures
t.test(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city == "Chennai","temperature"],
paired=FALSE)

# paired t-test
t.test(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city == "Chennai","temperature"],
paired=TRUE)
# correlation
cor(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city =="Chennai","temperature"])


Two Sample t-test



The two sample t-test tests the null hypothesis that the two means are equal, irrespective of the association between data collected between the two groups in the test. Sometimes the association between two groups may be based on time (as in the case of the temperature data), but the pairing may be based on other characteristics (e.g. twins in a study that has test and control groups where each pair of twins is randomly assigned to test and control groups).



> t.test(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city == "Chennai","temperature"],
+ paired=FALSE)

Welch Two Sample t-test

data: tempData[tempData$city == "Dubai", "temperature"] and tempData[tempData$city == "Chennai", "temperature"]
t = -0.24817, df = 15.546, p-value = 0.8073
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-8.765568 6.932235
sample estimates:
mean of x mean of y
92.33333 93.25000


Since 0 is within the 95% confidence interval, we accept the null hypothesis that there is no difference in the monthly average high temperatures between Chennai and Dubai.



Paired t-test



The paired t-test calculates the difference between the pairs of observations and tests the null hypothesis that the average difference is 0.



> # paired t-test
> t.test(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city == "Chennai","temperature"],
+ paired=TRUE)

Paired t-test

data: tempData[tempData$city == "Dubai", "temperature"] and tempData[tempData$city == "Chennai", "temperature"]
t = -0.39555, df = 11, p-value = 0.7
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.017343 4.184009
sample estimates:
mean of the differences
-0.9166667


Since 0 is within the 95% confidence interval, we accept the null hypothesis that there is no difference in the monthly average high temperatures between Chennai and Dubai, when the test is conducted on the differences in pairs of monthly average high temperature values.



Correlation



The Pearson correlation measures the strength of the linear relationship between two variables, -1.0 = perfect negative correlation, 0 = no linear correlation, and 1 = perfect positive correlation.



> cor(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city =="Chennai","temperature"])
[1] 0.7929018
>


A correlation of 0.79 indicates a strong positive linear relationship between monthly average high temperatures in Dubai and Chennai.



Technique used to split the data



Since I created a raw data file and loaded it into R with read.csv(), I used the [ form of the extract operator to extract rows based on the value of the city column. I also created the raw data file in monthly order for each city, so the order of values in each subset matches by month, enabling a straightforward use of the pairwise t-test.



# extract temperature values for Dubai
tempData[tempData$city =="Dubai","temperature"]


A wide variety of techniques can be used to subset data from an R data frame, such as the which() function and the sqldf() function.






share|improve this answer


























  • Thanks a lot Len.

    – Ishack Marshook
    Nov 20 '18 at 13:25











  • @IshackMarshook please accept the answer if you found it to be helpful.

    – Len Greski
    Nov 20 '18 at 13:26











  • done, Len. Can i ask a few more questions please?

    – Ishack Marshook
    Nov 20 '18 at 13:28











  • @IshackMarshook - if your followup questions are directly related to this one, yes. Otherwise, please post a new question and the SO community will answer it. Part of SO etiquette is that a post focuses on a single question.

    – Len Greski
    Nov 20 '18 at 15:42


















1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














There are many ways to split the data, but the first question to answer is "what hypothesis do I wish to test?"



Here is example code using average daily high temperatures in Chennai and Dubai from timeanddate.com



# data collected from average high temperatures collected from 2005 - 2015
# https://www.timeanddate.com/weather/india/chennai/climate
# https://www.timeanddate.com/weather/united-arab-emirates/dubai/climate
rawData <- "
temperature,month,city
75,Jan,Dubai
78,Feb,Dubai
83,Mar,Dubai
92,Apr,Dubai
100,May,Dubai
103,Jun,Dubai
106,Jul,Dubai
107,Aug,Dubai
102,Sep,Dubai
96,Oct,Dubai
87,Nov,Dubai
79,Dec,Dubai
86,Jan,Chennai
89,Feb,Chennai
93,Mar,Chennai
97,Apr,Chennai
102,May,Chennai
100,Jun,Chennai
97,Jul,Chennai
95,Aug,Chennai
95,Sep,Chennai
92,Oct,Chennai
87,Nov,Chennai
86,Dec,Chennai"

tempData <- read.csv(text=rawData)

# t-test for average temperatures
t.test(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city == "Chennai","temperature"],
paired=FALSE)

# paired t-test
t.test(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city == "Chennai","temperature"],
paired=TRUE)
# correlation
cor(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city =="Chennai","temperature"])


Two Sample t-test



The two sample t-test tests the null hypothesis that the two means are equal, irrespective of the association between data collected between the two groups in the test. Sometimes the association between two groups may be based on time (as in the case of the temperature data), but the pairing may be based on other characteristics (e.g. twins in a study that has test and control groups where each pair of twins is randomly assigned to test and control groups).



> t.test(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city == "Chennai","temperature"],
+ paired=FALSE)

Welch Two Sample t-test

data: tempData[tempData$city == "Dubai", "temperature"] and tempData[tempData$city == "Chennai", "temperature"]
t = -0.24817, df = 15.546, p-value = 0.8073
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-8.765568 6.932235
sample estimates:
mean of x mean of y
92.33333 93.25000


Since 0 is within the 95% confidence interval, we accept the null hypothesis that there is no difference in the monthly average high temperatures between Chennai and Dubai.



Paired t-test



The paired t-test calculates the difference between the pairs of observations and tests the null hypothesis that the average difference is 0.



> # paired t-test
> t.test(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city == "Chennai","temperature"],
+ paired=TRUE)

Paired t-test

data: tempData[tempData$city == "Dubai", "temperature"] and tempData[tempData$city == "Chennai", "temperature"]
t = -0.39555, df = 11, p-value = 0.7
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.017343 4.184009
sample estimates:
mean of the differences
-0.9166667


Since 0 is within the 95% confidence interval, we accept the null hypothesis that there is no difference in the monthly average high temperatures between Chennai and Dubai, when the test is conducted on the differences in pairs of monthly average high temperature values.



Correlation



The Pearson correlation measures the strength of the linear relationship between two variables, -1.0 = perfect negative correlation, 0 = no linear correlation, and 1 = perfect positive correlation.



> cor(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city =="Chennai","temperature"])
[1] 0.7929018
>


A correlation of 0.79 indicates a strong positive linear relationship between monthly average high temperatures in Dubai and Chennai.



Technique used to split the data



Since I created a raw data file and loaded it into R with read.csv(), I used the [ form of the extract operator to extract rows based on the value of the city column. I also created the raw data file in monthly order for each city, so the order of values in each subset matches by month, enabling a straightforward use of the pairwise t-test.



# extract temperature values for Dubai
tempData[tempData$city =="Dubai","temperature"]


A wide variety of techniques can be used to subset data from an R data frame, such as the which() function and the sqldf() function.






share|improve this answer


























  • Thanks a lot Len.

    – Ishack Marshook
    Nov 20 '18 at 13:25











  • @IshackMarshook please accept the answer if you found it to be helpful.

    – Len Greski
    Nov 20 '18 at 13:26











  • done, Len. Can i ask a few more questions please?

    – Ishack Marshook
    Nov 20 '18 at 13:28











  • @IshackMarshook - if your followup questions are directly related to this one, yes. Otherwise, please post a new question and the SO community will answer it. Part of SO etiquette is that a post focuses on a single question.

    – Len Greski
    Nov 20 '18 at 15:42
















0














There are many ways to split the data, but the first question to answer is "what hypothesis do I wish to test?"



Here is example code using average daily high temperatures in Chennai and Dubai from timeanddate.com



# data collected from average high temperatures collected from 2005 - 2015
# https://www.timeanddate.com/weather/india/chennai/climate
# https://www.timeanddate.com/weather/united-arab-emirates/dubai/climate
rawData <- "
temperature,month,city
75,Jan,Dubai
78,Feb,Dubai
83,Mar,Dubai
92,Apr,Dubai
100,May,Dubai
103,Jun,Dubai
106,Jul,Dubai
107,Aug,Dubai
102,Sep,Dubai
96,Oct,Dubai
87,Nov,Dubai
79,Dec,Dubai
86,Jan,Chennai
89,Feb,Chennai
93,Mar,Chennai
97,Apr,Chennai
102,May,Chennai
100,Jun,Chennai
97,Jul,Chennai
95,Aug,Chennai
95,Sep,Chennai
92,Oct,Chennai
87,Nov,Chennai
86,Dec,Chennai"

tempData <- read.csv(text=rawData)

# t-test for average temperatures
t.test(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city == "Chennai","temperature"],
paired=FALSE)

# paired t-test
t.test(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city == "Chennai","temperature"],
paired=TRUE)
# correlation
cor(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city =="Chennai","temperature"])


Two Sample t-test



The two sample t-test tests the null hypothesis that the two means are equal, irrespective of the association between data collected between the two groups in the test. Sometimes the association between two groups may be based on time (as in the case of the temperature data), but the pairing may be based on other characteristics (e.g. twins in a study that has test and control groups where each pair of twins is randomly assigned to test and control groups).



> t.test(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city == "Chennai","temperature"],
+ paired=FALSE)

Welch Two Sample t-test

data: tempData[tempData$city == "Dubai", "temperature"] and tempData[tempData$city == "Chennai", "temperature"]
t = -0.24817, df = 15.546, p-value = 0.8073
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-8.765568 6.932235
sample estimates:
mean of x mean of y
92.33333 93.25000


Since 0 is within the 95% confidence interval, we accept the null hypothesis that there is no difference in the monthly average high temperatures between Chennai and Dubai.



Paired t-test



The paired t-test calculates the difference between the pairs of observations and tests the null hypothesis that the average difference is 0.



> # paired t-test
> t.test(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city == "Chennai","temperature"],
+ paired=TRUE)

Paired t-test

data: tempData[tempData$city == "Dubai", "temperature"] and tempData[tempData$city == "Chennai", "temperature"]
t = -0.39555, df = 11, p-value = 0.7
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.017343 4.184009
sample estimates:
mean of the differences
-0.9166667


Since 0 is within the 95% confidence interval, we accept the null hypothesis that there is no difference in the monthly average high temperatures between Chennai and Dubai, when the test is conducted on the differences in pairs of monthly average high temperature values.



Correlation



The Pearson correlation measures the strength of the linear relationship between two variables, -1.0 = perfect negative correlation, 0 = no linear correlation, and 1 = perfect positive correlation.



> cor(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city =="Chennai","temperature"])
[1] 0.7929018
>


A correlation of 0.79 indicates a strong positive linear relationship between monthly average high temperatures in Dubai and Chennai.



Technique used to split the data



Since I created a raw data file and loaded it into R with read.csv(), I used the [ form of the extract operator to extract rows based on the value of the city column. I also created the raw data file in monthly order for each city, so the order of values in each subset matches by month, enabling a straightforward use of the pairwise t-test.



# extract temperature values for Dubai
tempData[tempData$city =="Dubai","temperature"]


A wide variety of techniques can be used to subset data from an R data frame, such as the which() function and the sqldf() function.






share|improve this answer


























  • Thanks a lot Len.

    – Ishack Marshook
    Nov 20 '18 at 13:25











  • @IshackMarshook please accept the answer if you found it to be helpful.

    – Len Greski
    Nov 20 '18 at 13:26











  • done, Len. Can i ask a few more questions please?

    – Ishack Marshook
    Nov 20 '18 at 13:28











  • @IshackMarshook - if your followup questions are directly related to this one, yes. Otherwise, please post a new question and the SO community will answer it. Part of SO etiquette is that a post focuses on a single question.

    – Len Greski
    Nov 20 '18 at 15:42














0












0








0







There are many ways to split the data, but the first question to answer is "what hypothesis do I wish to test?"



Here is example code using average daily high temperatures in Chennai and Dubai from timeanddate.com



# data collected from average high temperatures collected from 2005 - 2015
# https://www.timeanddate.com/weather/india/chennai/climate
# https://www.timeanddate.com/weather/united-arab-emirates/dubai/climate
rawData <- "
temperature,month,city
75,Jan,Dubai
78,Feb,Dubai
83,Mar,Dubai
92,Apr,Dubai
100,May,Dubai
103,Jun,Dubai
106,Jul,Dubai
107,Aug,Dubai
102,Sep,Dubai
96,Oct,Dubai
87,Nov,Dubai
79,Dec,Dubai
86,Jan,Chennai
89,Feb,Chennai
93,Mar,Chennai
97,Apr,Chennai
102,May,Chennai
100,Jun,Chennai
97,Jul,Chennai
95,Aug,Chennai
95,Sep,Chennai
92,Oct,Chennai
87,Nov,Chennai
86,Dec,Chennai"

tempData <- read.csv(text=rawData)

# t-test for average temperatures
t.test(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city == "Chennai","temperature"],
paired=FALSE)

# paired t-test
t.test(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city == "Chennai","temperature"],
paired=TRUE)
# correlation
cor(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city =="Chennai","temperature"])


Two Sample t-test



The two sample t-test tests the null hypothesis that the two means are equal, irrespective of the association between data collected between the two groups in the test. Sometimes the association between two groups may be based on time (as in the case of the temperature data), but the pairing may be based on other characteristics (e.g. twins in a study that has test and control groups where each pair of twins is randomly assigned to test and control groups).



> t.test(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city == "Chennai","temperature"],
+ paired=FALSE)

Welch Two Sample t-test

data: tempData[tempData$city == "Dubai", "temperature"] and tempData[tempData$city == "Chennai", "temperature"]
t = -0.24817, df = 15.546, p-value = 0.8073
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-8.765568 6.932235
sample estimates:
mean of x mean of y
92.33333 93.25000


Since 0 is within the 95% confidence interval, we accept the null hypothesis that there is no difference in the monthly average high temperatures between Chennai and Dubai.



Paired t-test



The paired t-test calculates the difference between the pairs of observations and tests the null hypothesis that the average difference is 0.



> # paired t-test
> t.test(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city == "Chennai","temperature"],
+ paired=TRUE)

Paired t-test

data: tempData[tempData$city == "Dubai", "temperature"] and tempData[tempData$city == "Chennai", "temperature"]
t = -0.39555, df = 11, p-value = 0.7
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.017343 4.184009
sample estimates:
mean of the differences
-0.9166667


Since 0 is within the 95% confidence interval, we accept the null hypothesis that there is no difference in the monthly average high temperatures between Chennai and Dubai, when the test is conducted on the differences in pairs of monthly average high temperature values.



Correlation



The Pearson correlation measures the strength of the linear relationship between two variables, -1.0 = perfect negative correlation, 0 = no linear correlation, and 1 = perfect positive correlation.



> cor(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city =="Chennai","temperature"])
[1] 0.7929018
>


A correlation of 0.79 indicates a strong positive linear relationship between monthly average high temperatures in Dubai and Chennai.



Technique used to split the data



Since I created a raw data file and loaded it into R with read.csv(), I used the [ form of the extract operator to extract rows based on the value of the city column. I also created the raw data file in monthly order for each city, so the order of values in each subset matches by month, enabling a straightforward use of the pairwise t-test.



# extract temperature values for Dubai
tempData[tempData$city =="Dubai","temperature"]


A wide variety of techniques can be used to subset data from an R data frame, such as the which() function and the sqldf() function.






share|improve this answer















There are many ways to split the data, but the first question to answer is "what hypothesis do I wish to test?"



Here is example code using average daily high temperatures in Chennai and Dubai from timeanddate.com



# data collected from average high temperatures collected from 2005 - 2015
# https://www.timeanddate.com/weather/india/chennai/climate
# https://www.timeanddate.com/weather/united-arab-emirates/dubai/climate
rawData <- "
temperature,month,city
75,Jan,Dubai
78,Feb,Dubai
83,Mar,Dubai
92,Apr,Dubai
100,May,Dubai
103,Jun,Dubai
106,Jul,Dubai
107,Aug,Dubai
102,Sep,Dubai
96,Oct,Dubai
87,Nov,Dubai
79,Dec,Dubai
86,Jan,Chennai
89,Feb,Chennai
93,Mar,Chennai
97,Apr,Chennai
102,May,Chennai
100,Jun,Chennai
97,Jul,Chennai
95,Aug,Chennai
95,Sep,Chennai
92,Oct,Chennai
87,Nov,Chennai
86,Dec,Chennai"

tempData <- read.csv(text=rawData)

# t-test for average temperatures
t.test(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city == "Chennai","temperature"],
paired=FALSE)

# paired t-test
t.test(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city == "Chennai","temperature"],
paired=TRUE)
# correlation
cor(tempData[tempData$city =="Dubai","temperature"],
tempData[tempData$city =="Chennai","temperature"])


Two Sample t-test



The two sample t-test tests the null hypothesis that the two means are equal, irrespective of the association between data collected between the two groups in the test. Sometimes the association between two groups may be based on time (as in the case of the temperature data), but the pairing may be based on other characteristics (e.g. twins in a study that has test and control groups where each pair of twins is randomly assigned to test and control groups).



> t.test(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city == "Chennai","temperature"],
+ paired=FALSE)

Welch Two Sample t-test

data: tempData[tempData$city == "Dubai", "temperature"] and tempData[tempData$city == "Chennai", "temperature"]
t = -0.24817, df = 15.546, p-value = 0.8073
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-8.765568 6.932235
sample estimates:
mean of x mean of y
92.33333 93.25000


Since 0 is within the 95% confidence interval, we accept the null hypothesis that there is no difference in the monthly average high temperatures between Chennai and Dubai.



Paired t-test



The paired t-test calculates the difference between the pairs of observations and tests the null hypothesis that the average difference is 0.



> # paired t-test
> t.test(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city == "Chennai","temperature"],
+ paired=TRUE)

Paired t-test

data: tempData[tempData$city == "Dubai", "temperature"] and tempData[tempData$city == "Chennai", "temperature"]
t = -0.39555, df = 11, p-value = 0.7
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.017343 4.184009
sample estimates:
mean of the differences
-0.9166667


Since 0 is within the 95% confidence interval, we accept the null hypothesis that there is no difference in the monthly average high temperatures between Chennai and Dubai, when the test is conducted on the differences in pairs of monthly average high temperature values.



Correlation



The Pearson correlation measures the strength of the linear relationship between two variables, -1.0 = perfect negative correlation, 0 = no linear correlation, and 1 = perfect positive correlation.



> cor(tempData[tempData$city =="Dubai","temperature"],
+ tempData[tempData$city =="Chennai","temperature"])
[1] 0.7929018
>


A correlation of 0.79 indicates a strong positive linear relationship between monthly average high temperatures in Dubai and Chennai.



Technique used to split the data



Since I created a raw data file and loaded it into R with read.csv(), I used the [ form of the extract operator to extract rows based on the value of the city column. I also created the raw data file in monthly order for each city, so the order of values in each subset matches by month, enabling a straightforward use of the pairwise t-test.



# extract temperature values for Dubai
tempData[tempData$city =="Dubai","temperature"]


A wide variety of techniques can be used to subset data from an R data frame, such as the which() function and the sqldf() function.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 20 '18 at 0:58

























answered Nov 19 '18 at 17:28









Len GreskiLen Greski

3,1401421




3,1401421













  • Thanks a lot Len.

    – Ishack Marshook
    Nov 20 '18 at 13:25











  • @IshackMarshook please accept the answer if you found it to be helpful.

    – Len Greski
    Nov 20 '18 at 13:26











  • done, Len. Can i ask a few more questions please?

    – Ishack Marshook
    Nov 20 '18 at 13:28











  • @IshackMarshook - if your followup questions are directly related to this one, yes. Otherwise, please post a new question and the SO community will answer it. Part of SO etiquette is that a post focuses on a single question.

    – Len Greski
    Nov 20 '18 at 15:42



















  • Thanks a lot Len.

    – Ishack Marshook
    Nov 20 '18 at 13:25











  • @IshackMarshook please accept the answer if you found it to be helpful.

    – Len Greski
    Nov 20 '18 at 13:26











  • done, Len. Can i ask a few more questions please?

    – Ishack Marshook
    Nov 20 '18 at 13:28











  • @IshackMarshook - if your followup questions are directly related to this one, yes. Otherwise, please post a new question and the SO community will answer it. Part of SO etiquette is that a post focuses on a single question.

    – Len Greski
    Nov 20 '18 at 15:42

















Thanks a lot Len.

– Ishack Marshook
Nov 20 '18 at 13:25





Thanks a lot Len.

– Ishack Marshook
Nov 20 '18 at 13:25













@IshackMarshook please accept the answer if you found it to be helpful.

– Len Greski
Nov 20 '18 at 13:26





@IshackMarshook please accept the answer if you found it to be helpful.

– Len Greski
Nov 20 '18 at 13:26













done, Len. Can i ask a few more questions please?

– Ishack Marshook
Nov 20 '18 at 13:28





done, Len. Can i ask a few more questions please?

– Ishack Marshook
Nov 20 '18 at 13:28













@IshackMarshook - if your followup questions are directly related to this one, yes. Otherwise, please post a new question and the SO community will answer it. Part of SO etiquette is that a post focuses on a single question.

– Len Greski
Nov 20 '18 at 15:42





@IshackMarshook - if your followup questions are directly related to this one, yes. Otherwise, please post a new question and the SO community will answer it. Part of SO etiquette is that a post focuses on a single question.

– Len Greski
Nov 20 '18 at 15:42



Popular posts from this blog

How to change which sound is reproduced for terminal bell?

Title Spacing in Bjornstrup Chapter, Removing Chapter Number From Contents

Can I use Tabulator js library in my java Spring + Thymeleaf project?