11Jan2022

Can you run correlations with categorical variables

Since it has a good mix of continuous and categorical variables, having something like the x2y metric that can work for any type of variable pair is convenient. We will convert the blanks to NAs so that all the missing values can be treated consistently. Also, the rightmost three columns are free-text fields so we will remove them from the dataframe.

If we were working with the correlation coefficient, we could easily calculate a confidence interval for it and gauge if what we are seeing is real or not. Can we do the same thing for the x2y metric? We can, by using bootstrapping. With these numbers, we can construct a confidence interval easily this is available as an optional confidence argument in the R functions we have been using; please see the appendix.

Using an insight from Information Theory, we devised a new metric - the x2y metric - that quantifies the strength of the association between pairs of variables. If you found this note helpful, you may find these of interest. Thanks to Amr Farahat for helpful feedback on an earlier draft. The R script depends on two R packages - rpart and dplyr - so please ensure that they are installed in your environment.

Value : A dataframe with each row containing the output of running x2y u, v, confidence for u and v chosen from the dataframe.

Since this is just a standard R dataframe, it can be sliced, sorted, filtered, plotted etc. Update on April 16, : I learned from a commenter that a similar approach was proposed in April , and that the R package ppsr which implements that approach is now available on CRAN. You may leave a comment below or discuss the post in the forum community. In contrast, look at the first picture again. This is very promising! These are sometimes referred to as a baseline model. More on this below.

Two Caveats Before we demonstrate the x2y metric on a couple of datasets, I want to highlight two aspects of the x2y approach.

Length Sepal. Width Petal. Length Petal. Width Species 5. Width Species 94 Petal. Length Species 93 Petal. Length Width Length Species 62 Petal.

Width Sepal. Width Species 39 Petal. Width 30 Sepal. Conclusion Using an insight from Information Theory, we devised a new metric - the x2y metric - that quantifies the strength of the association between pairs of variables.

The x2y metric has several advantages: It works for all types of variable pairs continuous-continuous, continuous-categorical, categorical-continuous and categorical-categorical It captures linear and non-linear relationships Perhaps best of all, it is easy to understand and use.

I hope you give it a try in your work. Acknowledgements Thanks to Amr Farahat for helpful feedback on an earlier draft. Appendix: How to use the R script The R script depends on two R packages - rpart and dplyr - so please ensure that they are installed in your environment. We want to study the relationship between absorbed fat from donuts vs the type of fat used to produce donuts example is taken from here. Is there any dependence between the variables?

Sign up to join this community. The best answers are voted up and rise to the top. Stack Overflow for Teams — Collaborate and share knowledge with a private group. Create a free Team What is Teams? Learn more. How to get correlation between two categorical variable and a categorical variable and continuous variable?

Ask Question. Asked 7 years, 3 months ago. Active 4 years, 1 month ago. Viewed k times. Please answer the below questions Which correlation coefficient works best for the above cases? VIF calculation only works for continuous data so what is the alternative? What are the assumptions I need to check before I use the correlation coefficient you suggest? Improve this question. IharS 5, 10 10 gold badges 29 29 silver badges 43 43 bronze badges.

SE is a better place for questions about more theoretical statistics like this. If not, I'd say that the answer to your questions depend on context.

Sometimes it makes sense to flatten multiple levels into dummy variables, other times it's worth to model your data according to multinomial distribution, etc.

If yes, this can influence the type of correlation you want to look for. Small hiccough. The smaller the p-value, the better the "fit" between the two variables. Not the other way around. Add a comment. Active Oldest Votes. There also exists a Crammer's V that is a measure of correlation that follows from this test Example Suppose we have two variables gender: male and female city: Blois and Tours We observed the following data: Are gender and city independent?

So our expected values are the following So we run the chi-squared test and the resulting p-value here can be seen as a measure of correlation between these two variables.

Example We want to study the relationship between absorbed fat from donuts vs the type of fat used to produce donuts example is taken from here Is there any dependence between the variables? Improve this answer. Alexey Grigorev Alexey Grigorev 2, 1 1 gold badge 10 10 silver badges 18 18 bronze badges. Based on more research i found about polyserial and polychloric correlation. How is your approach better than these? They are technique for estimating the correlation between two latent variables, from two observed variables.

efretilen1989's Ownd

0コメント

1000 / 1000