# Thread: How do I apply multivariate regression to my dataset to formulate an equation?

1. ## How do I apply multivariate regression to my dataset to formulate an equation?

I have the following data points (for example)

x, y, z, a, b

Lets say those are each a column in a database (or Excel) and I have 1000 rows of data (so 5000 total values). I need to create an equation that solves for "b". How do I go about doing this? Excel will do it based on a collection of just x and y values, but I don t know if Excel goes higher than two variables? Does it? If so, how do I do that? DO NOT ASSUME THAT THIS IS A LINEAR EQUATION.

2. ## How do I apply multivariate regression to my dataset to formulate an equation

Say I have the following arbitrary dataset. I want to create an equation that solves for 'b'

 x y z a b 5 1.2 20 13 5.6 6.2 1.6 16 15 4.5 6 1.3 18 12 4.9 5.8 1.7 21 12 6.3

How do I go about doing this? Excel will do it based on a collection of just x and y values (using linear or non-linear regression), but I don't know if Excel goes higher than two variables? Does it? If so, how do I do that?

Are there any known calculators where you give it a dataset and it spits you out an equation??

DO NOT ASSUME THAT THIS IS A LINEAR EQUATION.

3. Excel's Data Analysis Tool Pack will do Multiple LINEAR Regression for you.

Excel is sufficiently flexible that you can design any sort of thing you like.

We're really about Math Students, here. The least you can do is show us YOUR work, rather than expecting us to be free consultants.

4. There are infinitely many equations that will reproduce your data. No finite number of values will fully define "an equation" without ambiguity.

5. Originally Posted by tkhunny
There are infinitely many equations that will reproduce your data. No finite number of values will fully define "an equation" without ambiguity.
I understand this. So I need an equation that "best fits" my data within some small margin of error. I'm not expecting it to be perfect.

6. 1) "solves for" means you DO want it to be exact. More precise language is always helpful.
2) You must formulate some sort of model. Where do the data come from? What sort of model might be appropriate? Are there ANY known relationships? How sure are you that all your variables are independent? Are there linear relationships or are they ALL nonlinear? A scatter plot of 'a' vs 'b' might suggest a linear relationship.
3) Are there any dependencies inside a single variable?
4) Excel does just fine with Multiple-LINEAR Regression. It's in the Data Analysis Tool Pack. p-values and everything.
5) Sometimes, you can make non-linear relationships linear by applying a logarithm or square root transformation. There are other sorts of transformations. Can you see any opportunities for transformations?
6) What sort of nonlinear relationships are you anticipating or imagining? Just polynomials of order higher than one (1)? Reciprocals? Rational Functions? Trig Functions? Logistics?
7) Ever hear of "R"? You can pursue any sort of GLM with enough R experience. https://www.r-project.org/about.html To be honest, you'll probably need more data to get very far.
8) Is this a student problem, a personal exploration, or maybe a business requirement? Purpose and audience are just as important as methods.

We're happy to help, but you have to show your work and give us something to go on. We all have opinions on how one might proceed, but if we are going to get YOUR views that lead to YOUR personal solution, you will have to offer up quite a bit more information.

7. Just noticed that you might find something with f(x+y+z+a) = b. It's not a GREAT linear relationship between the sum and the 'b', but it might be some place to start.

It actually gets a little better if you Square the Sums, f((x+y+z+a)^2).

The sum of the squares is even better! f(x^2 + y^2 + z^2 + a^2).

Plenty of nonlinear interactions in all of that.

Really, though, you're not going to get very far with this VERY TINY dataset.

8. My dataset is actually about 4.6 million rows of data.

Recording these data points actually turned out to not be an accurate way to solve my ultimate problem. Because after digging deeper I found I had even more variables.

Instead, I decided to go the physics route. It was complex physics, but it ended up solving my problem.