The ATP Tour

Davin Liu
16 min readMay 3, 2021

--

How have matches evolved in the Open Era? Who is the best player ever? Can we predict the winner of tennis matches?

Roger Federer winning Australian Open 2017 — my wallpaper from 2017–2020 until I broke my laptop.

Introduction

I love tennis — I first held a racquet in my hands when I was five and I have never let it go since. Whether it is watching the sport live or playing at the local club, it’s something that will always be a part of me.

Today, I want to dive into the world of men’s professional tennis. Whether you know nothing about tennis or you have been following the ATP Tour for your whole life, there will be insights here that may be of interest to you.

A Brief Crash Course on Men’s Tennis

The ATP Tour is the worldwide tennis tour organized by the Association of Tennis Professionals. It is comprised of top-tier tennis tournaments. There are a few important points of distinctions between tournaments.

Tournaments Levels

Tournaments are ranked in different tiers, where higher-tiered tournaments offer the most ranking points, prize money, public attention, and strength of players. Below is a summary of the levels of ATP tournaments.

Court Surface

Another distinction is the type of court surface. Modern tennis courts are one of three surfaces: hard, clay, or grass.

Court surface affects the trajectory of the ball’s bounce and affects play styles. That is why certain players excel on a certain surface and suffer on another.

Open Era

The Open Era is the current era of professional tennis that began in 1968 when Grand Slam tournaments allowed professional players to compete with amateurs. Now, everyone was allowed to play at the slams, instead of when tennis professionals were not allowed onto the events due to the “ungentlemanly” spirit of taking money for playing.

It was in the Open Era that the Association of Tennis Professionals began to keep match and ranking-related records.

This analysis will focus on the Open Era because there is sparse data on matches before then.

1. Tennis, Visualized

I gathered data on every match played since the beginning of the Open Era, 1968, until 2019 to produce the following analysis. Visualizations are accurate as of the beginning of 2019.

Dominant Hand

Are there more left-handed players in professional tennis than in society? It turns out no. The above figures are in line with the common estimate that 10% of people are left-handers.

Indoor or Outdoor?

The majority of courts are outdoors, likely due to the exorbitant cost of constructing a roof over an indoor stadium.

Dominant Countries in Tennis

While tennis is mostly an individual sport, nationalities do play a role. In tournaments like the Davis Cup, ATP Cup, and Olympics, players represent their country. But what nationality dominate on ATP tour matches?

When most people think about dominant tennis countries, their first instinct is Great Britain, France, Australia, and the U.S — the countries that host the grand slams.

It turns out that the USA has an outsized dominance on the sport in the Open Era, with more wins than the other three countries combined. Great Britain, despite its storied tennis history and being host to Wimbledon, seems to perform relatively poorly.

One interesting observation is that the top countries host a majority of top-level tournaments (Grand Slams and Masters 1000s). In fact, these top four countries host 9 out of 13 top-level tournaments each year.

  • United States — U.S Open, Indian Wells, Miami Open, Cincinnati Masters
  • Spain — Madrid Open
  • Australia — Australian Open
  • France — French Open, Monte-Carlo Masters, Paris Masters

What’s more mindblowing is that the U.S and France alone host 7 out of the 13 top-level tournaments! Now that’s an example of true dominance.

Surface

How have different surfaces evolved over the years?

Few observations:

  1. The number of matches annually skyrocketed from 1200 to 4000 between 1968 to 1980 — the start of the Open Era.
  2. What are carpet courts? I’m not being sly. The reason I didn’t introduce carpet courts in the earlier is because I didn’t know about it either. It turns out to be an artificial surface that ATP outlawed in 2010 because of frequently injuries and high ball speed. More on that here.
  3. Hardcourt is rising while clay court is in decline.
  4. Grass court makes up a small percentage of total matches played but is holding on to its presence on the ATP tour.

Prime Age for Tennis

Conventional wisdom states that the prime age of tennis players is believed to be 24–25. Does this belief have merit?

I tried to find out by looking at the distribution of the ages of the winners of ATP matches since the beginning of the Open Era.

The distribution has a slight right skew, with a mean of 26.51. The peak of the bell curve is ranged between 23.5 to 27 — this age range has the highest occurrence of match winners on the ATP tour. Conventional wisdom is not wrong.

Age Distribution of Match Winners by Tournament Type

There’s often this belief that older and more experienced players play at the more advanced tournaments while younger players have to earn their stripes on the lower-level circuits. But that turns out to be wrong.

The peak age for winning all tournament types is still 24–27. Above the age of 32, there is still a handful of match wins in the ATP 250 & 500 tournaments, but almost none in Grand Slams and Masters 1000s, and Davis Cups. Thus, I deemed it the unofficial retirement age of professional male tennis players.

Age of Match Winners by Tournament Round

We constantly hear about legends like Roger Federer, Novak Djokovic, Pete Sampras, etc. always competing in their thirties and winning tournaments of all levels. Federer won Australian Open 2017 when he was almost 36. Do older, more experienced players tend to advance later into tournaments?

Only select rounds are shown for illustrative purposes. F=Final, QF=Quarterfinal, R16=Round of 16, R32=Round of 32.

It turns out that no matter the round of the tournament, the peak age for tennis stays the same. So why doesn’t the data seem to reflect the wins of so many legends after the age of 30?

I attribute it to recency bias. We tend to remember these legends’ impressive wins towards their end of the career than their wins at the beginning of the career. Federer, Nadal, and Djokovic won their first Grand Slams at 21, 19, and 20 respectively — but those early titles often get lost in our memories.

Ranking Distribution of ATP Match Winners

What rank do you have to be to have a good shot at winning an ATP match?

Answer: below 59.2 — the average ranking of the winners of ATP matches.

Ranking by Tournament Type

Masters Cup, International Gold, and ATP500 Series were filtered out of this graphic for better display.

International and ATP 250 tournaments are most welcoming to lower-ranked players. Surprisingly, Grand Slams grant entry to more lower-ranked players than Masters 1000 tournaments, despite being more prestigious. This is because Grand Slams typically have elaborate Wild Card systems that give lower-ranked players a chance. This is how Marcus Willis, ranked 772nd, qualified for Wimbledon and played Roger Federer in his fairytale run.

Rankings Distribution by Round

In line with intuition, the ranking range tightens to higher and higher ranges as tournaments advance into later rounds. There are notably fewer high-ranked players in 1st round because they are often granted byes.

2. Greatest Players in the Open Era

All-time Wins and Losses Leaderboards

When I analyzed the match data for all-time win leaders, I was confronted with a list of all-time greats. Connors, Federer, Nadal, Agassi, Djokovic, Edberg. Out of the greats, Jimmy Connors and Roger Federer lead everyone else in match wins — by far.

But a chart we rarely see is the inverse — the all-time losers leaderboard.

No. These players are not bad — if they were, they would not qualify for 400+ ATP matches, let alone lose 400. In fact, there is no player on this list who has not ranked inside the top 20.

Career-high rankings:

  • 1–5: David Ferrer (3), Jonas Bjorkman (4), Tommy Robredo (5)
  • 5–10: Fernando Verdasco (7), Mikhail Youzhny (8), John Alexander (8)
  • 10–20: Feliciano Lopez (12), Philipp Kohlschreiber (16), Fabrice Santoro (17), Andreas Seppi (18)

In fact, David Ferrer is widely considered to be one of the best players to not have won a Grand Slam tournament. His losses, like other players on the list, are largely a result of their long careers.

Win-Loss Ratios

Another measure of greatness is the win/loss ratio, which shows a player’s skill and efficiency. That is: how many matches does one win for each loss?

This paints a far different picture than before — while older players like Jimmy Connors and Ivan Lendl led in match wins, Novak Djokovic, Rafael Nadal, and Roger Federer lead in win/loss ratios, which means they played fewer matches to achieve their wins.

To put this into perspective, here is the distribution of the win/loss ratios for all players on the ATP Tour with over 50 career matches.

The 50+ career matches requirement filters out outlier ratios from players with few matches

We see that even amongst ATP Tour veteran players with over 50 matches, the median win/loss ratio is 0.77 — meaning they only win around 3 matches for every 4 losses. This means that Novak Djokovic (W/L of 4.84) is ~6.3x more likely to win a match than the average ATP player. This chart provides us with a perspective on the different “leagues” in the sport, even amongst the top players.

Finals Appearances

A discussion about the greatest tennis players ever would be seriously lacking without insights into finals appearances.

First, how do these all-time greats perform in finals?

Pretty well. Most of the greats on average win around 2 titles out of 3 finals appearances, far above a 50/50 ratio. These ratios would be even higher if they didn’t have to face other greats in finals.

You often hear that Roger Federer is the greatest of all time because he has 20 Grand Slams. Grand Slams titles are the most common metric tennis fans use to evaluate greatness. Let’s explore the previous chart, except this time colored by Grand Slam appearances.

I’ll come clean: I have never heard of Ilie Nastase and Guillermo Vilas — despite knowing a lot about everyone else. And now, I know why.

Nastase (5) and Vilas (8) have the lowest Grand Slam appearances out of any player on the list, even far lower than players with fewer finals appearances like Andre Agassi (15) and Pete Sampras (18).

Let’s take a look at the match result of these Grand Slam appearances.

  • Pete Sampras has the best Grand Slam performance by far, with 14 wins out of 18 appearances.
  • Andy Murray has the worst Grand Slam finals record, with 3 wins and 8 losses. This is likely why is he was kicked out of the Big Four (now Big Three).
  • The Big Three — Federer, Nadal, Djokovic — have the most finals appearances and are neck-in-neck in terms of performance.

2: Understanding The Big Three

Throughout our analysis of the greatest Open Era players, three names kept popping up: Roger Federer, Novak Djokovic, and Rafael Nadal. They have the most Grand Slam Appearances, Grand Slam Wins, and highest Win/Loss Ratios. They are on virtually every all-time great tennis leaderboard, by any criteria.

Here’s an incredible statistic: From the 2003 Wimbledon Championships up to the 2021 Australian Open, the trio has won 58 of the 70 (83%) Grand Slam titles. The Trio has been an instrumental part of what has been labeled a new “Golden Era” in tennis. Andy Murray was once a part of this Big Four quartet, but as you may recall, his poor Grand Slam performance demoted him from this grouping.

Since I already looked at their big titles statistics in the previous section, I will recap it with an ATP Tour statistic (as of October 2018).

Title count as of October 2018. Djokovic has 14 Grand Slams here and 15 Grand Slams in data because the dataset extends to the beginning of 2019 when Djokovic won the 2019 Australian Open.

Introductions through Wordclouds

I scraped each of their Wikipedia pages and extensively cleaned the text to output unique Wordclouds for each player. I will use select keywords in each cloud to introduce the players.

Roger Federer

  • Oldest of the Trio, he was active during the 2000s (2004, 2005, 2006, 2008, 2009 are his best years, with a brief comeback in 2017)
  • Wimbledon and Australian Open are his specialties — he has 7 Wimbledon titles and 5 Australian Open titles.
  • He repeated faced early 2000–2010s players, including Del Potro, Berdych, and Roddick.
  • He’s Swiss, and so his Davis Cup compatriot Stan Wawrinka.

Rafael Nadal

  • Federer is Nadal’s main rival. He said so himself.
  • Nadal, the second-oldest of the Trio, was extremely dominant during 2010–2015 (2010, 2011, 2012, 2013, and 2014 are his best years).
  • Nadal is the king of the clay court. That’s why he specializes in European clay-court tournaments like Monte-Carlo, Madrid, and the French Open.
  • He repeated faced early 2005–2015s players, including Del Potro, Berdych, but not Roddickwho retired earlier.
  • He’s Spanish, along with compatriot David Ferrer.

Novak Djokovic

  • Federer and Nadal are both Djokovic’s rivals since they were both established players when he entered the Tour in 2007 and 2008.
  • Australian Open and Wimbledon are Djokovic’s specialties, winning 9 and 5 titles respectively.
  • Djokovic is Serbian and was born in Belgrade.
  • As the youngest member in the Trio, he is dominant from 2012 to now. (2012, 2013, 2014, 2016, and 2019 are his best years)
  • Djokovic is the Masters 1000 king, having a record 36 titles. He is the only player to have won all 9 Master 1000 tournaments, and he did so twice. His top Masters 1000 tournaments are Paris, Madrid, Rome, Indian Wells, Shanghai, Cincinnati, and Miami.

Popularity

I measured popularity through a combination of two measures: public interest and public sentiment. I measured public interest by scraping Google Search data using the gTrendsR package and public sentiment scraping Twitter mentions using TwitteR and syuzhet.

Public Interest

All-Time Interest in the Big Three

  • Interest is “seasonal” — it fluctuates up during major events like Grand Slams.
  • Roger Federer leads in search interest, which explains why he is #1 on Forbe’s Highest-Paid Athletes List due to his high marketability.
  • Novak Djokovic trails in search interest. This isn’t surprising because Novak Djokovic is one of the most disliked players on Tour.
  • Search interest in Federer and Djokovic are gaining, while interest in Nadal has stayed similar since 2005.

Year-to-Date Interest

The chart of the last year shows a different story.

As Roger Federer nears retirement, fan interest in him has declined drastically while Novak and Nadal have been leading given their latest success (Nadal winning 2020 French Open and Djokovic winning 2021 Australian Open). Could Nadal and Djokovic surpass Federer in public interest in the future? Only time will tell.

Twitter Sentiment Analysis

Scraping

I scraped as many Tweet about each player as I could under Twitter’s API Limits. (Twitter limits scraping volume and only allows standard API users to scrape Tweets in the past week). I ended up with more than sufficient Tweets to gain a good snapshot of the popularity of the players.

  • Federer — 7,566 Tweets
  • Nadal — 20,000 Tweets
  • Djokovic — 8,842 Tweets

Sentiment Analysis

I compiled the tweets and used the syuzhet package to analyze the sentiment of each Tweet. The package utilizes sentiment dictionaries and a method of accessing the robust sentiment extraction tools developed by the NLP group at Stanford. Below are the distributions of the sentiments.

Due to the large sample size, we expected to see similar distributions at first glance. However, taking a closer examination displays some revealing differences about the public sentiment of these players.

Federer has the highest mean out of all three players with 0.43 with the most right-skewed distribution. Federer has the highest density of positive tweets (Tweets with sentiment of 1+).This is especially impressive given that he is nearing retirement and had an early-round exit in his only appearance in 2021.

Nadal is a close second to Federer with a sentiment mean of 0.39. In comparison to Federer and Djokovic, public opinion on Nadal isn’t as strongly positive or negative.

Novak Djokovic trails the Trio with a sentiment mean of 0.28. Whereas both Federer and Nadal’s distributions have right skews, Djokovic’s sentiment distribution has close to no skew. Djokovic has the highest density of positive sentiment (0.1–0.5), but the long left-tail (-2.5–0) negates the positive impact on his overall mean sentiment.

Djokovic’s lower sentiment score is especially telling given his impressive recent performance winning the 2021 Australian Open three months ago. Much of it can be attributed to his arrogance, pseudo-science, and on-court behavior.

Popularity Conclusion

Roger Federer leads Nadal and Djokovic in both fan interest and sentiment. However, as he nears retirement, Nadal and Djokovic are toe-to-toe in public interest but Nadal leads in public sentiment. Interest in Djokovic is on the rise due to his impressive performance, but he carries a greater share of negative public sentiment due to his list of controversies and on-court behavior.

Who are my favorite players? Federer for his elegance and class and Djokovic for his technique and mental strength.

3. Prediction Model of Match Wins

As an experiment, I utilized another dataset that contained data of matches from 2000 to 2018 with player ranks to attempt a model to predict match wins. However, I faced extensive difficulties in this undertaking.

Lacking Prediction Variables

First, apart from player rankings, the dataset was lacking prediction variables. While I would’ve liked to add external data like player height, seed, entry type, ranking points, and more, I could not find them online in an accessible format. Thus, the only player variables I could find were elo-scores and elo-probabilities assigned to both players by betting agencies. I utilized the date, winner name, loser name, and location to join the two data sets.

Many variables were also unusable. First, I did not want to use any first-set match data because I wanted to simulate pre-game betting. Second, some variables (like average betting odds for each match) were missing too many values to be useful.

Winner-Loser Format to Random Format

The more significant problem, however, was that the data was in winner-loser format, where the winner name and statistics are in one set of columns while the loser name and statistics are in the other column. Thus, if we create a column where the value 1 denotes Player 1 has won, then every row in that column contains the value 1. Running a glm model on that data shows the following error: “glm model does not converge”. As a result, the glm produces a 100% correct predictions as the model does not have any 0 values to train it to distinguish between the binary results.

After some deliberation, I found a creative way around this. I randomly sampled 50% of the original dataset and renamed all the P1 columns to P2 and vice versa. By flipping the winner and loser, all the P1win columns in this subset of the data frame contain the value 0. I rbinded the two dataframes and then sampled the row numbers to randomize the order. The data is now in random format.

Generalized Linear Model

Despite the lack of player-specific data, I ran a glm to predict the chance of P1 winning. Unsurprisingly, the only statistically significant predictors were Player 1 Ranking, Player 2 Ranking, and the elo-probability calculated from the elo-ratings of the two players.

There were two benchmarks in this test:

Benchmark One: 50/50 tossing a coin to guess the winner. Since I randomized the winner-loser format of my dataset down the middle, this will result in a 50% accuracy rate.

Benchmark Two: Predicting the player with the higher-ranking wins. This model will be very difficult to beat because A) it is one of my only player-specific predictor variables apart from Elo-rating B) Tennis is a very individualistic sport, which means there is less randomness and upsets (when a lower-ranked player beats a higher-ranked player) are extremely rare.

Model Prediction Results on Test Data

On the final test data, the general regression model beat Benchmark One by 13.2% and Benchmark Two by 1.64%. Despite the limited number of prediction variables, a 1.6% increase in prediction accuracy is a significant improvement, especially considering the existing strategy is an extremely strong one in an individualistic sport like tennis.

Further Analysis

Going forward, further research should focus on adding more variables to the regression analysis to increase prediction accuracy and simulate betting profitability using betting odds.

Data Sources:

About Davin

Davin Liu is a junior at the Wharton School of the University of Pennsylvania studying Finance and Business Analytics. Davin has spent more time in front of televised tennis matches and on the tennis courts than he cares to remember. In his spare time, Davin enjoys embarking on road trips, shredding the slopes of Whistler, and immersing himself within psychological thrillers films

This data project was conducted for Prasanna Tambe’s course: Analytics & The Digital Economy.

--

--