Outlaw Ironman 2016 Analysis¶

I have previously created a CSV file after webscraping ironman finish times from a website (see my code here: https://jimhardy-datascience.com/Files/Juypter/OutlawScraping.html)

This was a traithlon I completed in 2016. Anyway, now it was time to do some analysis.

Import Libraries¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Load Data¶

In [2]:
df = pd.read_csv('Outlaw2016Results.csv')

Inspect Head and Tail¶

In [3]:
df.head()
Out[3]:
Position Bib Number Participant Category Club/Company/Sponsor Finish time Swim Time Transition1 Bike Time Transition2 Run Time
0 1 82 KGF Dream TeamKGF Dream Team RLY KGF Dream Team 7:47:34 48:46 5:22 4:24:58 0:31 2:27:57
1 2 37 RAF Team One RLY NaN 8:33:15 50:00 1:41 5:02:33 0:22 2:38:39
2 3 78 KGF UnderdogsKGF Undergods RLY KGF Undergods 8:57:56 50:34 3:03 4:59:54 0:24 3:04:01
3 4 292 Chris WalkerKit Walker Triathlon (M) 25-29 Kit Walker Triathlon 8:58:52 52:50 2:30 4:44:26 2:18 3:16:48
4 5 253 Joe BeechOnit Cycles (M) 25-29 Onit Cycles 9:13:56 1:00:19 3:06 4:52:24 1:47 3:16:20
In [4]:
df.tail(10)
Out[4]:
Position Bib Number Participant Category Club/Company/Sponsor Finish time Swim Time Transition1 Bike Time Transition2 Run Time
1042 DNF 445 Razwan IlyasBarracuda Triathlon Club (M) 35-39 Barracuda Triathlon Club - 2:12:30 - - - -
1043 DNF 617 Jonathan Howard (M) 40-44 NaN - - - - - -
1044 DNF 74 Witham Running Club RLY NaN - - - - - -
1045 DNF 859 Tony JefferysPsof (M) 45-49 Psof - - - - - -
1046 DSQ 744 Craig WardWakefield Triathlon Club (M) 40-44 Wakefield Triathlon Club 13:53:41 1:21:15 5:30 6:07:53 6:29 6:12:34
1047 DSQ 857 Paul Jackson (M) 45-49 NaN 16:04:27 1:16:09 8:27 7:39:23 6:17 6:54:11
1048 QUERY 360 James PontingDorking & Mole Valley AC (M) 30-34 Dorking & Mole Valley AC - 1:10:50 6:21 5:41:03 5:08 -
1049 QUERY 77 KGF Den AgainKGF Den Again RLY KGF Den Again - 1:24:39 2:21 5:18:23 0:38 -
1050 QUERY 827 Neil Goodrich (M) 45-49 NaN - 1:29:08 6:38 7:30:22 7:14 -
1051 QUERY 563 Jason Cole (M) 40-44 NaN - 1:28:01 8:41 7:34:07 9:26 -
In [5]:
df = df[df['Position'] != 'DSQ']

# There were 2 people who got disqualified. Remove these rows
In [6]:
df = df[df['Position'] != 'QUERY']

# There were 4 people whose finish was queried. Remove these rows

Move the Relay Teams to Their Own Dataframe¶

In [7]:
df_RelayTeams = df[df['Category'].str.contains('RLY')]
In [8]:
print("The Number of Relay Teams: " , df_RelayTeams.shape[0])
The Number of Relay Teams:  89
In [9]:
df = df[df['Category'] != 'RLY']


# Remove these rows

Split Category Column in Gender and AgeBracekt¶

In [10]:
df[['Gender', 'AgeBracket']] = df['Category'].str.extract(r'\((M|F)\) (\d{2}-\d{2})')
In [11]:
df.drop(columns=['Category'])
Out[11]:
Position Bib Number Participant Club/Company/Sponsor Finish time Swim Time Transition1 Bike Time Transition2 Run Time Gender AgeBracket
3 4 292 Chris WalkerKit Walker Triathlon Kit Walker Triathlon 8:58:52 52:50 2:30 4:44:26 2:18 3:16:48 M 25-29
4 5 253 Joe BeechOnit Cycles Onit Cycles 9:13:56 1:00:19 3:06 4:52:24 1:47 3:16:20 M 25-29
5 6 261 Dean EdwardsBridgtown Cycles / carreg.com Bridgtown Cycles / carreg.com 9:16:12 1:00:15 2:15 5:00:02 1:52 3:11:48 M 25-29
7 8 272 Matt LeemanBenfleet Running Club Benfleet Running Club 9:31:09 51:22 1:55 5:08:49 2:14 3:26:49 M 25-29
10 11 656 Marcus McdonaldLichfield Tri Lichfield Tri 9:37:16 1:02:35 3:27 5:09:58 2:58 3:18:18 M 40-44
... ... ... ... ... ... ... ... ... ... ... ... ...
1039 DNF 102 Emily CheshireOxford Tri Oxford Tri - 1:33:01 8:00 - - - F 25-29
1040 DNF 219 Jacqueline Harrison NaN - 1:54:29 8:56 - - - F 50-54
1042 DNF 445 Razwan IlyasBarracuda Triathlon Club Barracuda Triathlon Club - 2:12:30 - - - - M 35-39
1043 DNF 617 Jonathan Howard NaN - - - - - - M 40-44
1045 DNF 859 Tony JefferysPsof Psof - - - - - - M 45-49

957 rows × 12 columns

How Many People 'Did Not Finish', and How Many Quit At Each Discipline¶

In [12]:
DNFSwim = (df['Swim Time'] == '-').sum()
In [13]:
DNFT1 = (df['Transition1'] == '-').sum() - DNFSwim
In [14]:
DNFBike = (df['Bike Time'] == '-').sum() - (DNFSwim + DNFT1)
In [15]:
DNFT2 = (df['Transition2'] == '-').sum() - (DNFSwim + DNFT1 + DNFBike)
In [16]:
DNFRun = (df['Run Time'] == '-').sum() - (DNFSwim + DNFT1 + DNFBike + DNFT2)
In [17]:
data = {
    'Discipline': ['Swim', 'Transition 1', 'Bike', 'Transition 2', 'Run'],
    'Non-Finishers During This Discipline': [DNFSwim, DNFT1, DNFBike, DNFT2, DNFRun]
}
In [18]:
NonFinishers = pd.DataFrame(data)
In [19]:
print(NonFinishers)
     Discipline  Non-Finishers During This Discipline
0          Swim                                     3
1  Transition 1                                     1
2          Bike                                    16
3  Transition 2                                    12
4           Run                                    22
In [20]:
df = df[df['Position'] != 'DNF']

From the Finishers: Distribution of Men, Women and Age Groups¶

In [21]:
df = df.sort_values(by='Category', ascending=True)
In [22]:
sns.countplot(x = "AgeBracket", hue = "Gender", edgecolor='black', data = df)
plt.ylabel('Number of Triathlets')
plt.title('Triathletes In Each Age Category')
plt.show()
No description has been provided for this image

From The Finishers: Distribution of Finish Times¶

In [23]:
df['Seconds_Finish'] = pd.to_timedelta(df['Finish time']).dt.total_seconds()
In [24]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='AgeBracket', y='Seconds_Finish', hue='Gender', data= df)
plt.title('Finish Times (Seconds) by Gender and Age Group')
plt.xlabel('Age Group Category')
plt.ylabel('Seconds')
plt.show()
No description has been provided for this image

How Did I Do In Each Discipline?¶

In [25]:
def findPosition(df, participant_name, discipline):
    sorted_df = df.sort_values(by=discipline, ascending=True).reset_index(drop=True)
    position = sorted_df[sorted_df['Participant'] == participant_name].index[0] + 1
    return position
In [26]:
mySwimPosition = findPosition(df, 'Jim Hardy', 'Swim Time')
In [27]:
myBikePosition = findPosition(df, 'Jim Hardy', 'Bike Time')
In [28]:
myRunPosition = findPosition(df, 'Jim Hardy', 'Run Time')
In [29]:
print(f"Swim Ranking: {mySwimPosition}")
Swim Ranking: 260
In [30]:
print(f"Bike Ranking: {myBikePosition}")
Bike Ranking: 292
In [31]:
print(f"Run Ranking: {myRunPosition}")
Run Ranking: 181
In [32]:
myFinishPosition = findPosition(df, 'Jim Hardy', 'Finish time')
In [33]:
print(f"My Overall Ranking: {myFinishPosition}")
My Overall Ranking: 166

Export DataFrames¶

Exporting these dataframes (rather than the origin) for use in Tableau will make things easier.

In [34]:
df.to_csv('IndividualFinishers.csv', index=False)
In [35]:
df_RelayTeams.to_csv('RelayFinishers.csv', index=False)