Weather analysis
In this tutorial we use the Python Pandas and Matplotlib packages to analyse and visualise weather data. Time series graphs, scatter plots, histograms and box-and-whisker plots are created using matplotlib functions. Pandas functions are used to read the data file, display summary information and rename columns.
The data source used in this tutorial is from the Australian Bureau of Meteorology. We focus on the minimum and maximum daily temperatures for Adelaide, Australia, in the first three months of 2022 using data downloaded from the .
Note that the code from this tutorial is taken from a Jupyter notebook where commands are processed in cells and the results displayed. However this code can be easily adapted to run from a Python IDE such as IDLE or Pycharm. The main change required is to add a print statement to display table results, and a plt.show()
command to display graphs.
Importing packages
This tutorial uses the Pandas package to read the data from the source file into a dataframe. Graphical representations of the data, including histograms, box plots and time series graphs are created using functions from the Matplotlib package.
import pandas as pd
import matplotlib.pyplot as plt
Reading the data
We begin by reading the weather data for January 2022. The following options are used:
- the dayfirst option lets the reader know that the dates are given in Australian/European format where the days are given first (by default the reader uses the US format where the month is written first).
- the parse_dates option indicates which columns should be converted into dates.
=pd.read_csv("data/IDCJDW5081.202201.csv", dayfirst=True, parse_dates=['Date']) df1
Next we print out the first five rows of the data, restricting the view to the first three columns, which contain the date, minimum temperature and maximum temperature.
0:5][df1.columns[0:3]] df1[
Date | Minimum temperature (°C) | Maximum temperature (°C) | |
---|---|---|---|
0 | 2022-01-01 | 22.5 | 33.6 |
1 | 2022-01-02 | 19.3 | 30.6 |
2 | 2022-01-03 | 14.1 | 25.9 |
3 | 2022-01-04 | 14.2 | 24.4 |
4 | 2022-01-05 | 14.4 | 21.5 |
Renaming columns
To make it easier to refer to the minimum and maximum temperature columns we rename the label for these two columns. Calling the info function then prints a summary of the data stored in the dataframe. Notice that the minimum and maximum value columns have been renamed.
={df1.columns[1]: "Minimum", df1.columns[2] : "Maximum"}, inplace=True)
df1.rename(columns df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 31 non-null datetime64[ns]
1 Minimum 31 non-null float64
2 Maximum 31 non-null float64
3 Rainfall (mm) 31 non-null float64
4 Evaporation (mm) 0 non-null float64
5 Sunshine (hours) 0 non-null float64
6 Direction of maximum wind gust 31 non-null object
7 Speed of maximum wind gust (km/h) 31 non-null int64
8 Time of maximum wind gust 31 non-null object
9 9am Temperature (°C) 31 non-null float64
10 9am relative humidity (%) 31 non-null int64
11 9am cloud amount (oktas) 0 non-null float64
12 9am wind direction 31 non-null object
13 9am wind speed (km/h) 31 non-null object
14 9am MSL pressure (hPa) 31 non-null float64
15 3pm Temperature (°C) 31 non-null float64
16 3pm relative humidity (%) 31 non-null int64
17 3pm cloud amount (oktas) 0 non-null float64
18 3pm wind direction 31 non-null object
19 3pm wind speed (km/h) 31 non-null int64
20 3pm MSL pressure (hPa) 31 non-null float64
dtypes: datetime64[ns](1), float64(11), int64(4), object(5)
memory usage: 5.2+ KB
Time series graphs
The first graphs that we create are time series graphs, which will display the change in minimum/maximum temperatures over time. This is created using the plot_date function. This function takes two lines of the values – the first list corresponds to the dates in the Date column, the second list corresponds to the minimum temperatures column.
The autofmt_xdate function ensures that the dates are displayed in an appropriate manner.
=plt.subplots()
fig,ax"Minimum temperatures for January 2022")
ax.set_title("Date")
ax.set_xlabel("Temperature (°C)")
ax.set_ylabel(
fig.autofmt_xdate()"Date"]], df1[["Minimum"]], linestyle="solid", markersize=0) ax.plot_date(df1[[
[<matplotlib.lines.Line2D at 0x1da7f8a6a30>]
The same process is used to display the maximum temperatures over time.
=plt.subplots()
fig,ax"Maximum temperatures for January 2022")
ax.set_title("Date")
ax.set_xlabel("Temperature (°C)")
ax.set_ylabel(
fig.autofmt_xdate()"Date"]], df1[["Maximum"]], linestyle="solid", markersize=0, color="orange") ax.plot_date(df1[[
[<matplotlib.lines.Line2D at 0x1da7db9be50>]
Scatter plots
Scatter plots can be used to compare two sets of data values. In this case create a scatter plot to compare the daily minimum and maximum temperatures.
=plt.subplots()
fig,ax"Comparison of maximum and minimum temperatures")
ax.set_title("Minimum temperature (°C)")
ax.set_xlabel("Maximum temperature (°C)")
ax.set_ylabel("Minimum"]], df1[["Maximum"]], marker="x") ax.scatter(df1[[
<matplotlib.collections.PathCollection at 0x1da7dc8c160>
Combining data
Data can be read from multiple data sources and then combined into a single dataframe. For this example we combine the weather data from January 2022 with data from February 2022 and March 2022.
=pd.read_csv("data/IDCJDW5081.202202.csv", dayfirst=True, parse_dates=['Date'])
df2=pd.read_csv("data/IDCJDW5081.202203.csv", dayfirst=True, parse_dates=['Date']) df3
={df2.columns[1]: "Minimum", df2.columns[2] : "Maximum"}, inplace=True)
df2.rename(columns={df3.columns[1]: "Minimum", df3.columns[2] : "Maximum"}, inplace=True) df3.rename(columns
=pd.concat([df1, df2, df3]) df
0:3]] df[df.columns[
Date | Minimum | Maximum | |
---|---|---|---|
0 | 2022-01-01 | 22.5 | 33.6 |
1 | 2022-01-02 | 19.3 | 30.6 |
2 | 2022-01-03 | 14.1 | 25.9 |
3 | 2022-01-04 | 14.2 | 24.4 |
4 | 2022-01-05 | 14.4 | 21.5 |
… | … | … | … |
26 | 2022-03-27 | 16.9 | 31.8 |
27 | 2022-03-28 | 16.4 | 27.4 |
28 | 2022-03-29 | 12.5 | 24.0 |
29 | 2022-03-30 | 13.6 | 22.9 |
30 | 2022-03-31 | 14.6 | 21.0 |
90 rows × 3 columns
"Minimum","Maximum"]].describe() df[[
Minimum | Maximum | |
---|---|---|
count | 90.000000 | 90.000000 |
mean | 17.063333 | 28.116667 |
std | 3.524344 | 4.399240 |
min | 11.500000 | 21.000000 |
25% | 14.625000 | 24.400000 |
50% | 16.600000 | 27.550000 |
75% | 19.150000 | 31.775000 |
max | 27.200000 | 40.300000 |
Histograms
Histograms are used to show the distribution of continuous data. In this section we create histograms to display the distribution of minimum and maximum temperatures.
We begin by creating a histogram to display the minimum temperatures.
"Minimum"]], edgecolor="k", alpha=0.4)
plt.hist(df[["Temperature (°C)")
plt.xlabel("Frequency")
plt.ylabel("Distribution of minimum temperatures") plt.title(
Text(0.5, 1.0, 'Distribution of minimum temperatures')
Whilst this graph shows the distribution of temperatures quite clearly, the automatic selection of bins (the lower and upper limits of each of the columns in the histogram) is not ideal. In particular it is difficult to see what the exact limits of the bins are. We can improve this by setting these values.
In the code below we set the bins for the minimum and maximum temperature histograms. This is done using a list comprehension.
=[2*x for x in range(5, 15)]
minbins=[2*x for x in range(10, 21)] maxbins
minbins
[10, 12, 14, 16, 18, 20, 22, 24, 26, 28]
"Minimum"]],bins=minbins, edgecolor="k", alpha=0.4)
plt.hist(df[["Temperature (°C)")
plt.xlabel("Frequency")
plt.ylabel("Distribution of minimum temperatures") plt.title(
Text(0.5, 1.0, 'Distribution of minimum temperatures')
Combining graphs
Multiple graphs can be displayed using subplots.
In the example below we display histograms for minimum and maximum temperatures, showing the graphs side by side.
- The first argument of the subplots function defines the number of rows of graphs.
- The second argument of the subplots function defines the number of columns of graphs.
- The figsize option defines the size of the resulting figure containing the graphs. In this case the resultant figure will be 8 inches across, by 6 inches high.
- The sharey option indicates that the two graphs will share
the scale for y-axis.
The suptitle command sets a title for all graphs within the subplots.
= plt.subplots(1,2, figsize=(8,6), sharey=True)
fig, (ax1,ax2) "Minimum"]], bins=minbins, edgecolor='k', alpha=0.4)
ax1.hist(df[["Minimum temperature")
ax1.set_xlabel("Maximum"]], bins=maxbins, edgecolor='k', alpha=0.4, color="red")
ax2.hist(df[["Maximum temperature")
ax2.set_xlabel("Frequency")
ax1.set_ylabel("Distribution of minimum and maximum temperatures") plt.suptitle(
Text(0.5, 0.98, 'Distribution of minimum and maximum temperatures')
Box and whisker plots
Box and whisker plots are created using the boxplot function. In the following example we create box plot showing the distribution for minimum temperatures. The following options are used:
- vert determines whether or not to display the box plots vertically. In this case we set the option to false, meaning the boxplots will be displayed horizontally.
- labels takes a list of strings. These strings are used for the boxplot labels.
Outliers are displayed as a circle beyond the whiskers. In this case there is one outlier corresponding to the minimum temperature of 27.2°C.
= plt.subplots()
fig, ax"Minimum"], vert=False, labels=["Minimum"])
ax.boxplot(df["Minimum temperature distribution")
ax.set_title("Temperature") ax.set_xlabel(
Text(0.5, 0, 'Temperature')
Parallel boxplots
Parallel boxplots are created by passing multiple lists of values to the first input of the boxplot function.
= plt.subplots()
fig, ax"Minimum", "Maximum"]], vert=False, labels=["Minimum", "Maximum"])
ax.boxplot(df[["Comparison of temperature distributions")
ax.set_title("Temperature") ax.set_xlabel(
Text(0.5, 0, 'Temperature')
Hiding outliers
Outliers can be hidden in boxplots by setting the showfliers option to false.
= plt.subplots()
fig, ax"Minimum", "Maximum"]], vert=False, labels=["Minimum", "Maximum"], showfliers=False)
ax.boxplot(df[["Comparison of temperature distributions")
ax.set_title("Temperature") ax.set_xlabel(
Text(0.5, 0, 'Temperature')