import os, json
import numpy as np
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
To extract the data from the Endomondo export, I first create a list of all workout files.
path_to_json = 'workouts/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
Then, from each file, I extract the data I need and append it to a list, so every list item is a json.
data = []
for f in json_files:
with open('workouts/'+f) as json_file:
json_data = json.load(json_file)
d = {}
for j in json_data:
d.update(dict(j))
if d['sport'] == 'WALKING':
data.append(d)
Then I can use the list of jsons to create a data frame.
df = pd.DataFrame(data)
df = df[['start_time', 'distance_km', 'duration_s', 'speed_avg_kmh']]
Fixing some data type and unit tings.
df['start_time'] = pd.to_datetime(df['start_time'], format='%Y/%m/%d').dt.normalize()
df['duration_min'] = df['duration_s'] / 60
df.head()
Since I only need data from Feb and March 2020, I filter out the rest.
df = df[df['start_time']>='2020-02-01'].sort_values(by='start_time')
df.reset_index(inplace=True, drop=True)
print(f'number of datapoints: {len(df)}')
Since I'm mainly interested in average walking speed, let's see its descriptive statistics.
df[['speed_avg_kmh']].describe()
Then, let's try to visualise the relationship between distance and duration in a scatter plot. We should see a nice linear correlation.
In order to draw the regression line, we need to calculate its formula.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
x = df[['distance_km']]
y = df[['duration_min']]
model.fit(x, y)
model = LinearRegression().fit(x, y)
Now we can extract the y-intercept and the slop of the line.
intercept = model.intercept_[0]
slope = model.coef_[0][0]
print('intercept:', intercept)
print('slope:', slope)
And now finally to the visualisation.
fig,ax = plt.subplots(figsize = (7, 7))
# plot the dots for scatter
X = df[['distance_km']]
Y = df[['duration_min']]
plt.scatter(X, Y, s=100, alpha=.5, zorder=3)
# plot the regression line
x = np.linspace(np.min(X)-1, np.max(X)+1, 1000)
y = intercept + slope * x
plt.plot(x, y, linestyle='--', label='Regression Line', zorder=2)
# tweaks
ax.set_xlabel('distance km')
ax.set_ylabel('duration min')
ax.set_xlim(0, 3.5)
ax.set_ylim(0, 35)
plt.grid(which='major', zorder=0, color='#E8E8E8')
# plt.savefig('scatter.png', bbox_inches='tight', dpi=200)
plt.show()
We already learned before that my average walking speed was 6.03 km/h. According to Google, the average walking speed is some 5.2 km/h. So how do I compare? Am I significantly faster than an average person? Let's do a t-test.
import scipy.stats
pop_mean = 5.2
scipy.stats.ttest_1samp(df['speed_avg_kmh'], pop_mean)
The p value is very very small, so we can conclude that I am faster than an averge person.