Outlaw Ironman 2016 Analysis¶
I have previously created a CSV file after webscraping ironman finish times from a website (see my code here: https://jimhardy-datascience.com/Files/Juypter/OutlawScraping.html)
This was a traithlon I completed in 2016. Anyway, now it was time to do some analysis.
Import Libraries¶
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Load Data¶
In [2]:
df = pd.read_csv('Outlaw2016Results.csv')
Inspect Head and Tail¶
In [3]:
df.head()
Out[3]:
Position | Bib Number | Participant | Category | Club/Company/Sponsor | Finish time | Swim Time | Transition1 | Bike Time | Transition2 | Run Time | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 82 | KGF Dream TeamKGF Dream Team | RLY | KGF Dream Team | 7:47:34 | 48:46 | 5:22 | 4:24:58 | 0:31 | 2:27:57 |
1 | 2 | 37 | RAF Team One | RLY | NaN | 8:33:15 | 50:00 | 1:41 | 5:02:33 | 0:22 | 2:38:39 |
2 | 3 | 78 | KGF UnderdogsKGF Undergods | RLY | KGF Undergods | 8:57:56 | 50:34 | 3:03 | 4:59:54 | 0:24 | 3:04:01 |
3 | 4 | 292 | Chris WalkerKit Walker Triathlon | (M) 25-29 | Kit Walker Triathlon | 8:58:52 | 52:50 | 2:30 | 4:44:26 | 2:18 | 3:16:48 |
4 | 5 | 253 | Joe BeechOnit Cycles | (M) 25-29 | Onit Cycles | 9:13:56 | 1:00:19 | 3:06 | 4:52:24 | 1:47 | 3:16:20 |
In [4]:
df.tail(10)
Out[4]:
Position | Bib Number | Participant | Category | Club/Company/Sponsor | Finish time | Swim Time | Transition1 | Bike Time | Transition2 | Run Time | |
---|---|---|---|---|---|---|---|---|---|---|---|
1042 | DNF | 445 | Razwan IlyasBarracuda Triathlon Club | (M) 35-39 | Barracuda Triathlon Club | - | 2:12:30 | - | - | - | - |
1043 | DNF | 617 | Jonathan Howard | (M) 40-44 | NaN | - | - | - | - | - | - |
1044 | DNF | 74 | Witham Running Club | RLY | NaN | - | - | - | - | - | - |
1045 | DNF | 859 | Tony JefferysPsof | (M) 45-49 | Psof | - | - | - | - | - | - |
1046 | DSQ | 744 | Craig WardWakefield Triathlon Club | (M) 40-44 | Wakefield Triathlon Club | 13:53:41 | 1:21:15 | 5:30 | 6:07:53 | 6:29 | 6:12:34 |
1047 | DSQ | 857 | Paul Jackson | (M) 45-49 | NaN | 16:04:27 | 1:16:09 | 8:27 | 7:39:23 | 6:17 | 6:54:11 |
1048 | QUERY | 360 | James PontingDorking & Mole Valley AC | (M) 30-34 | Dorking & Mole Valley AC | - | 1:10:50 | 6:21 | 5:41:03 | 5:08 | - |
1049 | QUERY | 77 | KGF Den AgainKGF Den Again | RLY | KGF Den Again | - | 1:24:39 | 2:21 | 5:18:23 | 0:38 | - |
1050 | QUERY | 827 | Neil Goodrich | (M) 45-49 | NaN | - | 1:29:08 | 6:38 | 7:30:22 | 7:14 | - |
1051 | QUERY | 563 | Jason Cole | (M) 40-44 | NaN | - | 1:28:01 | 8:41 | 7:34:07 | 9:26 | - |
In [5]:
df = df[df['Position'] != 'DSQ']
# There were 2 people who got disqualified. Remove these rows
In [6]:
df = df[df['Position'] != 'QUERY']
# There were 4 people whose finish was queried. Remove these rows
Move the Relay Teams to Their Own Dataframe¶
In [7]:
df_RelayTeams = df[df['Category'].str.contains('RLY')]
In [8]:
print("The Number of Relay Teams: " , df_RelayTeams.shape[0])
The Number of Relay Teams: 89
In [9]:
df = df[df['Category'] != 'RLY']
# Remove these rows
Split Category Column in Gender and AgeBracekt¶
In [10]:
df[['Gender', 'AgeBracket']] = df['Category'].str.extract(r'\((M|F)\) (\d{2}-\d{2})')
In [11]:
df.drop(columns=['Category'])
Out[11]:
Position | Bib Number | Participant | Club/Company/Sponsor | Finish time | Swim Time | Transition1 | Bike Time | Transition2 | Run Time | Gender | AgeBracket | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 4 | 292 | Chris WalkerKit Walker Triathlon | Kit Walker Triathlon | 8:58:52 | 52:50 | 2:30 | 4:44:26 | 2:18 | 3:16:48 | M | 25-29 |
4 | 5 | 253 | Joe BeechOnit Cycles | Onit Cycles | 9:13:56 | 1:00:19 | 3:06 | 4:52:24 | 1:47 | 3:16:20 | M | 25-29 |
5 | 6 | 261 | Dean EdwardsBridgtown Cycles / carreg.com | Bridgtown Cycles / carreg.com | 9:16:12 | 1:00:15 | 2:15 | 5:00:02 | 1:52 | 3:11:48 | M | 25-29 |
7 | 8 | 272 | Matt LeemanBenfleet Running Club | Benfleet Running Club | 9:31:09 | 51:22 | 1:55 | 5:08:49 | 2:14 | 3:26:49 | M | 25-29 |
10 | 11 | 656 | Marcus McdonaldLichfield Tri | Lichfield Tri | 9:37:16 | 1:02:35 | 3:27 | 5:09:58 | 2:58 | 3:18:18 | M | 40-44 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1039 | DNF | 102 | Emily CheshireOxford Tri | Oxford Tri | - | 1:33:01 | 8:00 | - | - | - | F | 25-29 |
1040 | DNF | 219 | Jacqueline Harrison | NaN | - | 1:54:29 | 8:56 | - | - | - | F | 50-54 |
1042 | DNF | 445 | Razwan IlyasBarracuda Triathlon Club | Barracuda Triathlon Club | - | 2:12:30 | - | - | - | - | M | 35-39 |
1043 | DNF | 617 | Jonathan Howard | NaN | - | - | - | - | - | - | M | 40-44 |
1045 | DNF | 859 | Tony JefferysPsof | Psof | - | - | - | - | - | - | M | 45-49 |
957 rows × 12 columns
How Many People 'Did Not Finish', and How Many Quit At Each Discipline¶
In [12]:
DNFSwim = (df['Swim Time'] == '-').sum()
In [13]:
DNFT1 = (df['Transition1'] == '-').sum() - DNFSwim
In [14]:
DNFBike = (df['Bike Time'] == '-').sum() - (DNFSwim + DNFT1)
In [15]:
DNFT2 = (df['Transition2'] == '-').sum() - (DNFSwim + DNFT1 + DNFBike)
In [16]:
DNFRun = (df['Run Time'] == '-').sum() - (DNFSwim + DNFT1 + DNFBike + DNFT2)
In [17]:
data = {
'Discipline': ['Swim', 'Transition 1', 'Bike', 'Transition 2', 'Run'],
'Non-Finishers During This Discipline': [DNFSwim, DNFT1, DNFBike, DNFT2, DNFRun]
}
In [18]:
NonFinishers = pd.DataFrame(data)
In [19]:
print(NonFinishers)
Discipline Non-Finishers During This Discipline 0 Swim 3 1 Transition 1 1 2 Bike 16 3 Transition 2 12 4 Run 22
In [20]:
df = df[df['Position'] != 'DNF']
From the Finishers: Distribution of Men, Women and Age Groups¶
In [21]:
df = df.sort_values(by='Category', ascending=True)
In [22]:
sns.countplot(x = "AgeBracket", hue = "Gender", edgecolor='black', data = df)
plt.ylabel('Number of Triathlets')
plt.title('Triathletes In Each Age Category')
plt.show()
From The Finishers: Distribution of Finish Times¶
In [23]:
df['Seconds_Finish'] = pd.to_timedelta(df['Finish time']).dt.total_seconds()
In [24]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='AgeBracket', y='Seconds_Finish', hue='Gender', data= df)
plt.title('Finish Times (Seconds) by Gender and Age Group')
plt.xlabel('Age Group Category')
plt.ylabel('Seconds')
plt.show()
How Did I Do In Each Discipline?¶
In [25]:
def findPosition(df, participant_name, discipline):
sorted_df = df.sort_values(by=discipline, ascending=True).reset_index(drop=True)
position = sorted_df[sorted_df['Participant'] == participant_name].index[0] + 1
return position
In [26]:
mySwimPosition = findPosition(df, 'Jim Hardy', 'Swim Time')
In [27]:
myBikePosition = findPosition(df, 'Jim Hardy', 'Bike Time')
In [28]:
myRunPosition = findPosition(df, 'Jim Hardy', 'Run Time')
In [29]:
print(f"Swim Ranking: {mySwimPosition}")
Swim Ranking: 260
In [30]:
print(f"Bike Ranking: {myBikePosition}")
Bike Ranking: 292
In [31]:
print(f"Run Ranking: {myRunPosition}")
Run Ranking: 181
In [32]:
myFinishPosition = findPosition(df, 'Jim Hardy', 'Finish time')
In [33]:
print(f"My Overall Ranking: {myFinishPosition}")
My Overall Ranking: 166
Export DataFrames¶
Exporting these dataframes (rather than the origin) for use in Tableau will make things easier.
In [34]:
df.to_csv('IndividualFinishers.csv', index=False)
In [35]:
df_RelayTeams.to_csv('RelayFinishers.csv', index=False)