Home Basic Model Fitting in Python
Post
Cancel

Basic Model Fitting in Python

In previous categories we have looked at cleaning, summarizing, and subsetting data, with some minor calculations, but we haven’t yet looked at analyzing our data.

Python is a very powerful tool for data analysis. Similarly to R, we can fit linear models and view graphs. First, we will look at some basic data analysis processes in Python.

Let’s re-load in our Gapminder data:

1
2
3
import pandas as pd
url = 'https://raw.githubusercontent.com/jstaf/gapminder/master/gapminder/gapminder.csv'
df = pd.read_csv(url)

Let’s say we want to fit a linear model to see if there is a relationship between population and time. First, let’s do this with the entire dataset with all countries.

Note that OLS stands for ordinary least squares.

1
2
3
4
5
6
7
import statsmodels.api as sm
# define our x and y variables for clarity
y = df['pop']
x = df['year']
x = sm.add_constant(x)
model = sm.OLS(y,x).fit()
model.summary()
<table class="simpletable">
<caption>OLS Regression Results</caption>
<tr>
  <th>Dep. Variable:</th>           <td>pop</td>       <th>  R-squared:         </th> <td>   0.007</td> 
</tr>
<tr>
  <th>Model:</th>                   <td>OLS</td>       <th>  Adj. R-squared:    </th> <td>   0.006</td> 
</tr>
<tr>
  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th> <td>   11.61</td> 
</tr>
<tr>
  <th>Date:</th>             <td>Thu, 13 Jul 2023</td> <th>  Prob (F-statistic):</th> <td>0.000672</td> 
</tr>
<tr>
  <th>Time:</th>                 <td>16:33:32</td>     <th>  Log-Likelihood:    </th> <td> -33902.</td> 
</tr>
<tr>
  <th>No. Observations:</th>      <td>  1704</td>      <th>  AIC:               </th> <td>6.781e+04</td>
</tr>
<tr>
  <th>Df Residuals:</th>          <td>  1702</td>      <th>  BIC:               </th> <td>6.782e+04</td>
</tr>
<tr>
  <th>Df Model:</th>              <td>     1</td>      <th>                     </th>     <td> </td>    
</tr>
<tr>
  <th>Covariance Type:</th>      <td>nonrobust</td>    <th>                     </th>     <td> </td>    
</tr>
</table>
<table class="simpletable">
<tr>
    <td></td>       <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  
</tr>
<tr>
  <th>const</th> <td>-9.722e+08</td> <td> 2.94e+08</td> <td>   -3.306</td> <td> 0.001</td> <td>-1.55e+09</td> <td>-3.95e+08</td>
</tr>
<tr>
  <th>year</th>  <td> 5.061e+05</td> <td> 1.49e+05</td> <td>    3.407</td> <td> 0.001</td> <td> 2.15e+05</td> <td> 7.97e+05</td>
</tr>
</table>
<table class="simpletable">
<tr>
  <th>Omnibus:</th>       <td>2403.823</td> <th>  Durbin-Watson:     </th>  <td>   0.187</td> 
</tr>
<tr>
  <th>Prob(Omnibus):</th>  <td> 0.000</td>  <th>  Jarque-Bera (JB):  </th> <td>438354.240</td>
</tr>
<tr>
  <th>Skew:</th>           <td> 8.286</td>  <th>  Prob(JB):          </th>  <td>    0.00</td> 
</tr>
<tr>
  <th>Kurtosis:</th>       <td>79.807</td>  <th>  Cond. No.          </th>  <td>2.27e+05</td> 
</tr>
</table><br/><br/>Notes:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.<br/>[2] The condition number is large, 2.27e+05. This might indicate that there are<br/>strong multicollinearity or other numerical problems.

Let’s try fitting another linear model, but using only data from Africa.

1
2
df_AF = df[df['continent'] == 'Africa']
df_AF.head()
1
2
3
4
5
6
##     country continent  year  lifeExp       pop    gdpPercap
## 24  Algeria    Africa  1952   43.077   9279525  2449.008185
## 25  Algeria    Africa  1957   45.685  10270856  3013.976023
## 26  Algeria    Africa  1962   48.303  11000948  2550.816880
## 27  Algeria    Africa  1967   51.407  12760499  3246.991771
## 28  Algeria    Africa  1972   54.518  14760787  4182.663766
1
2
3
4
5
y = df_AF['pop']
x = df_AF['year']
x = sm.add_constant(x)
model1 = sm.OLS(y,x).fit()
model1.summary()
<table class="simpletable">
<caption>OLS Regression Results</caption>
<tr>
  <th>Dep. Variable:</th>           <td>pop</td>       <th>  R-squared:         </th> <td>   0.074</td> 
</tr>
<tr>
  <th>Model:</th>                   <td>OLS</td>       <th>  Adj. R-squared:    </th> <td>   0.072</td> 
</tr>
<tr>
  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th> <td>   49.66</td> 
</tr>
<tr>
  <th>Date:</th>             <td>Thu, 13 Jul 2023</td> <th>  Prob (F-statistic):</th> <td>4.87e-12</td> 
</tr>
<tr>
  <th>Time:</th>                 <td>16:33:33</td>     <th>  Log-Likelihood:    </th> <td> -11192.</td> 
</tr>
<tr>
  <th>No. Observations:</th>      <td>   624</td>      <th>  AIC:               </th> <td>2.239e+04</td>
</tr>
<tr>
  <th>Df Residuals:</th>          <td>   622</td>      <th>  BIC:               </th> <td>2.240e+04</td>
</tr>
<tr>
  <th>Df Model:</th>              <td>     1</td>      <th>                     </th>     <td> </td>    
</tr>
<tr>
  <th>Covariance Type:</th>      <td>nonrobust</td>    <th>                     </th>     <td> </td>    
</tr>
</table>
<table class="simpletable">
<tr>
    <td></td>       <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  
</tr>
<tr>
  <th>const</th> <td>-4.728e+08</td> <td> 6.85e+07</td> <td>   -6.902</td> <td> 0.000</td> <td>-6.07e+08</td> <td>-3.38e+08</td>
</tr>
<tr>
  <th>year</th>  <td> 2.438e+05</td> <td> 3.46e+04</td> <td>    7.047</td> <td> 0.000</td> <td> 1.76e+05</td> <td> 3.12e+05</td>
</tr>
</table>
<table class="simpletable">
<tr>
  <th>Omnibus:</th>       <td>477.344</td> <th>  Durbin-Watson:     </th> <td>   0.284</td>
</tr>
<tr>
  <th>Prob(Omnibus):</th> <td> 0.000</td>  <th>  Jarque-Bera (JB):  </th> <td>7771.738</td>
</tr>
<tr>
  <th>Skew:</th>          <td> 3.336</td>  <th>  Prob(JB):          </th> <td>    0.00</td>
</tr>
<tr>
  <th>Kurtosis:</th>      <td>18.950</td>  <th>  Cond. No.          </th> <td>2.27e+05</td>
</tr>
</table><br/><br/>Notes:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.<br/>[2] The condition number is large, 2.27e+05. This might indicate that there are<br/>strong multicollinearity or other numerical problems.
This post is licensed under CC BY 4.0 by the author.