Home Viewing Data in Python
Post
Cancel

Viewing Data in Python

The Pandas package is the go-to package in Python for data frames and data set analysis.

Let’s load in a dataset from a URL and then explore the data. We can enter the URL to the data set and save it as a variable so that it is easily accessible. Since the data set is a .csv file, we can use the pandas function read_csv() to save the data into a data frame!

1
2
3
import pandas as pd # we nickname it pd to save time when calling it later
url = 'https://raw.githubusercontent.com/jstaf/gapminder/master/gapminder/gapminder.csv'
df = pd.read_csv(url)

We name the data set df which is short for data frame. This is just a common name used for naming data, but any name could be used depending on preference.

The head() function is often used to show the first few rows of a data frame. The default is the first five rows, but you can enter a different value inside the brackets to get the first n number of rows.

1
df.head()
1
2
3
4
5
6
##        country continent  year  lifeExp       pop   gdpPercap
## 0  Afghanistan      Asia  1952   28.801   8425333  779.445314
## 1  Afghanistan      Asia  1957   30.332   9240934  820.853030
## 2  Afghanistan      Asia  1962   31.997  10267083  853.100710
## 3  Afghanistan      Asia  1967   34.020  11537966  836.197138
## 4  Afghanistan      Asia  1972   36.088  13079460  739.981106

But if we only wanted to see the first 3 rows…

1
df.head(3)
1
2
3
4
##        country continent  year  lifeExp       pop   gdpPercap
## 0  Afghanistan      Asia  1952   28.801   8425333  779.445314
## 1  Afghanistan      Asia  1957   30.332   9240934  820.853030
## 2  Afghanistan      Asia  1962   31.997  10267083  853.100710

Remember, Python starts at 0 and not 1, so the first row of data is technically zero! Also, notice that pandas noticed that the first row of the csv file was header names, so it automatically created column titles.

We can also do this for the last few rows of the data set, know as the tail.

1
df.tail()
1
2
3
4
5
6
##        country continent  year  lifeExp       pop   gdpPercap
## 1699  Zimbabwe    Africa  1987   62.351   9216418  706.157306
## 1700  Zimbabwe    Africa  1992   60.377  10704340  693.420786
## 1701  Zimbabwe    Africa  1997   46.809  11404948  792.449960
## 1702  Zimbabwe    Africa  2002   39.989  11926563  672.038623
## 1703  Zimbabwe    Africa  2007   43.487  12311143  469.709298

We can now explore the data a bit more, looking at the data types and data structures.

We can use the .info() function to get a summary of data frame information, such as the count of null values, data types, etc. Take a look below.

1
df.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 1704 entries, 0 to 1703
## Data columns (total 6 columns):
##  #   Column     Non-Null Count  Dtype  
## ---  ------     --------------  -----  
##  0   country    1704 non-null   object 
##  1   continent  1704 non-null   object 
##  2   year       1704 non-null   int64  
##  3   lifeExp    1704 non-null   float64
##  4   pop        1704 non-null   int64  
##  5   gdpPercap  1704 non-null   float64
## dtypes: float64(2), int64(2), object(2)
## memory usage: 80.0+ KB

We can also quickly see summary statistics using the .describe() function.

1
df.describe()
1
2
3
4
5
6
7
8
9
##              year      lifeExp           pop      gdpPercap
## count  1704.00000  1704.000000  1.704000e+03    1704.000000
## mean   1979.50000    59.474439  2.960121e+07    7215.327081
## std      17.26533    12.917107  1.061579e+08    9857.454543
## min    1952.00000    23.599000  6.001100e+04     241.165876
## 25%    1965.75000    48.198000  2.793664e+06    1202.060309
## 50%    1979.50000    60.712500  7.023596e+06    3531.846988
## 75%    1993.25000    70.845500  1.958522e+07    9325.462346
## max    2007.00000    82.603000  1.318683e+09  113523.132900

If we wanted to know the number of unique values of a specific column, we could use the value_counts() function.

1
df.country.value_counts()
1
2
3
4
5
6
7
8
9
10
11
12
## Afghanistan          12
## Pakistan             12
## New Zealand          12
## Nicaragua            12
## Niger                12
##                      ..
## Eritrea              12
## Equatorial Guinea    12
## El Salvador          12
## Egypt                12
## Zimbabwe             12
## Name: country, Length: 142, dtype: int64

This shows that there are 12 occurrences of each unique country value.

If we wanted to know the number of countries recorded for each year, how would we find that information?

We can do so like this:

1
df.year.value_counts()
1
2
3
4
5
6
7
8
9
10
11
12
13
## 1952    142
## 1957    142
## 1962    142
## 1967    142
## 1972    142
## 1977    142
## 1982    142
## 1987    142
## 1992    142
## 1997    142
## 2002    142
## 2007    142
## Name: year, dtype: int64

If we wanted to find specific values, let’s say the maximum values, we could use the function .max().

1
df.max()
1
2
3
4
5
6
7
## country         Zimbabwe
## continent        Oceania
## year                2007
## lifeExp           82.603
## pop           1318683096
## gdpPercap    113523.1329
## dtype: object

Other similar functions include .min(), .mean(), and .median().

If we wanted to group by specific data, let’s say country, and then apply an aggregate function, we could do it like this:

1
df.groupby('country').max()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
##                    continent  year  lifeExp       pop     gdpPercap
## country                                                            
## Afghanistan             Asia  2007   43.828  31889923    978.011439
## Albania               Europe  2007   76.423   3600523   5937.029526
## Algeria               Africa  2007   72.301  33333216   6223.367465
## Angola                Africa  2007   42.731  12420476   5522.776375
## Argentina           Americas  2007   75.320  40301927  12779.379640
## ...                      ...   ...      ...       ...           ...
## Vietnam                 Asia  2007   74.249  85262356   2441.576404
## West Bank and Gaza      Asia  2007   73.422   4018332   7110.667619
## Yemen, Rep.             Asia  2007   62.698  22211743   2280.769906
## Zambia                Africa  2007   51.821  11746035   1777.077318
## Zimbabwe              Africa  2007   62.351  12311143    799.362176
## 
## [142 rows x 5 columns]

We can also group by multiple columns.

1
df.groupby(['country', 'year']).max()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
##                  continent  lifeExp       pop   gdpPercap
## country     year                                         
## Afghanistan 1952      Asia   28.801   8425333  779.445314
##             1957      Asia   30.332   9240934  820.853030
##             1962      Asia   31.997  10267083  853.100710
##             1967      Asia   34.020  11537966  836.197138
##             1972      Asia   36.088  13079460  739.981106
## ...                    ...      ...       ...         ...
## Zimbabwe    1987    Africa   62.351   9216418  706.157306
##             1992    Africa   60.377  10704340  693.420786
##             1997    Africa   46.809  11404948  792.449960
##             2002    Africa   39.989  11926563  672.038623
##             2007    Africa   43.487  12311143  469.709298
## 
## [1704 rows x 4 columns]

If we wanted to know the number of countries in each continent recorded for each year, how would we find that information?

1
df.groupby(['year']).continent.value_counts()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
## year  continent
## 1952  Africa       52
##       Asia         33
##       Europe       30
##       Americas     25
##       Oceania       2
## 1957  Africa       52
##       Asia         33
##       Europe       30
##       Americas     25
##       Oceania       2
## 1962  Africa       52
##       Asia         33
##       Europe       30
##       Americas     25
##       Oceania       2
## 1967  Africa       52
##       Asia         33
##       Europe       30
##       Americas     25
##       Oceania       2
## 1972  Africa       52
##       Asia         33
##       Europe       30
##       Americas     25
##       Oceania       2
## 1977  Africa       52
##       Asia         33
##       Europe       30
##       Americas     25
##       Oceania       2
## 1982  Africa       52
##       Asia         33
##       Europe       30
##       Americas     25
##       Oceania       2
## 1987  Africa       52
##       Asia         33
##       Europe       30
##       Americas     25
##       Oceania       2
## 1992  Africa       52
##       Asia         33
##       Europe       30
##       Americas     25
##       Oceania       2
## 1997  Africa       52
##       Asia         33
##       Europe       30
##       Americas     25
##       Oceania       2
## 2002  Africa       52
##       Asia         33
##       Europe       30
##       Americas     25
##       Oceania       2
## 2007  Africa       52
##       Asia         33
##       Europe       30
##       Americas     25
##       Oceania       2
## Name: continent, dtype: int64

Notice that we did not change anything in our original data that we loaded in all of the code blocks above. We just simply viewed the data in different ways. If you wanted to clean the data or create a permanent copy of the grouped values, you would have to save it as a variable by typing a name on the left hand side, the equals sign, and then the action you want saved on the right hand side. For example:

1
df2 = df.groupby(['year']).continent.value_counts()

Here we just did the same thing as earlier, but saved it to a new variable called df2.

This post is licensed under CC BY 4.0 by the author.