Estimating Route Quality: Analysis and Applications

Data Science | Guest Contributor

Aug 11, 2021

What you will (hopefully) read here

This article is on climbing route quality (I will be discussing sport and trad routes only, a direct analogy can be made to boulders): how we can measure it, how we can use it once measured, and some suggestions for future measurement and use. First, I detail how route quality is currently measured. Next, I suggest improvements that can be made to the current system. I then show how route quality can be used in route search applications, and present an app I built that highlights sectors around the USA with a lot of quality routes. Finally, I present some guidelines on how I think routes should be rated. Keep in mind that any rating system founded on opinions will be controversial. A major assumption underlying this article is that we gain some sort of objectivity by averaging a “sufficiently” large number of opinions. I don't know what number is sufficient, but it is probably more than 10. Further, everything said in this article is my perspective on the matter at hand. If you think I missed a key point, made an error, or don't know what I am talking about, tough (just kidding, please add a comment).

Some background on rating routes

Metrics of route quality were first used by guidebook authors in the 1960s who wanted to direct their readers to “good” climbs. I just made that up, but it seems reasonable. Regardless of how we first started assessing route quality, we have arrived at a system of ordinal ranking. “Ordinal ranking” is just jargon for the assignment of numerical ranks to routes where the variation between ranks is not clearly established. These ranks are usually represented by some arbitrary symbol (stars, thumbs-up, smiling-poop emojis) for reasons I don't understand. Take, for example, the current star system for route quality, in which a route is assigned 0 (the worst), 1, 2, 3, or 4 (the best) stars. A route with 4 stars is not necessarily 1.33 times better than a route with 3 stars, which is why “the variation between ranks is not clearly established.” Generally, the star system is used as follows (all examples taken from Clear Creek Canyon in Golden, Colorado):

0 Star: A truly terrible route, extremely chossy, covered in plant life, a blight on humanity, etc. Example: Reds

1 Star: Not particularly enjoyable, somewhat chossy, needs some cleaning, short, too near another line, etc. Example: Low Priest

2 Star: Good and worth doing, quality rock, clean, maybe not consistently fun movement. Example: Chaos

3 Star: Area classic, one the of best routes in the vicinity (note that this does not apply if all the routes in the vicinity are terrible, use your judgement). Excellent rock, clean, great movement. Example: Reefer Madness

4 Star: Classic, one of the best routes anywhere, pristine, beautiful, striking movement, various superlatives, guide-book-cover nonsense. Example: Sonic Youth

Note that the above descriptions are just examples. The reality of route rating is complex and difficult to explain precisely, somewhat like route grades. A chossy route can be 2-star if the climbing is great. The crux of a route may have the 4-star movement, but the rest of the climb is uninteresting, yielding a 3-star route. Each climber rates a route (according to the above) using their personal experience climbing it. However, given that there is a consensus on what each star value means, the ratings given by individual climbers should be somewhat consistent.

Aggregating the opinions of multiple climbers

The advent of online guidebooks has made averaging the ratings provided by many individual climbers possible, which increases our certainty of the rating for most routes, especially when a lot of opinions are included. This is in contrast to guidebook ratings, which are often just the opinion of the author (and maybe some of their friends). Another advantage of average ratings is that they have higher resolution than the star system. That is, average ratings can have intermediate values between ranks. For example, a rating of 3.5 stars indicates that opinions are split on whether the route is an area classic or an absolute classic. Currently, Mountain Project presents overall route ratings as the mean (arithmetic average) of the distribution of user ratings. For example, a route may have a rating distribution of {2, 2, 3, 3, 4} from five climbers, the mean of this distribution is, then:

(2 + 2 + 3 + 3 + 4)/5 = 2.8

which is given as the aggregate route rating. There are benefits to using the mean as the average rating (here I use average as a synonym of “central tendency”, or the typical value of a distribution). However, the median is usually a better metric when dealing with ordinal data. It is robust against outliers and is suitable for skewed distributions, both of which are frequently observed in route rating data. One caveat to using median is each rating can take only one of five integer values. To see why this is a problem, take Route A with ratings {0, 3, 3, 3, 3} and Route B with ratings {0, 3, 3, 4, 4}. Both Route A and Route B have a median rating of 3, but the consensus is clearly that Route B is better. This issue can be solved by using the interpolated median (IM), which estimates what the median would be if the data were less granular (see the Wikipedia page on median for more details). The IMs for Route A and B are 2.875 and 3.25, respectively (which seems reasonable). However, the means for Route A and B are 2.4 and 2.8, much lower than the IMs due to the lone 0-vote, an outlier. I would be remiss if I didn't mention the “other” average metric here, namely the mode. I mention it now, for posterity.

So which to use? The mean or the median? The non-answer of “both” is probably best since they measure different things. One advantage of using the mean is that all opinions are considered equally, even the whack ones. So it is quite democratic (the word, not the party). The opinions on the fringes are technically used when calculating the median but do not contribute directly to the final value, so the median could be considered a bit republican (the word, not the party, I am not making political statements here). Personally, I favor median, since I think that outliers in rating distributions are usually the result of a poor attempt at objectivity. For example, the first ascentionist (FA) giving their route 4 stars, or someone giving a 1-star vote because they hate FA. More discussion on this topic is provided in the last section.

My philosophy regarding the “classic” designation

The goal of this article is not to discuss subjective classics, or “classics for you”, this is excellently covered in a previous OpenBeta contribution by Colin Brochard. Rather, I want to identify objective, or “true” classics. The only way this can be achieved is by adopting a practical definition of classic. The definition “just a really good route” doesn't cut it if you want to be precise. I adhere to the philosophy that a classic must be proven. That is, a route can only achieve classic status if enough people agree that it deserves it (i.e., rate it highly). Highly rated routes without enough votes can really only be considered suspicious classics, or more optimistically, future classics. What, then, is a high enough average rating with enough votes? This is not a straightforward question to answer. A route with 25 votes, all 4-star (mean/median of 4 stars), is probably a classic. However, a route with 50 votes, half of which are 3-star and the other half 4-star (median/mean of 3.5 stars) may not be. Intending to suggest a reasonable answer to this question, in the next section I propose a route quality metric that can be used to precisely (i.e. numerically) define a classic route, or an area classic, or a good route, etc.

Including the number of opinions in route quality

Intuition tells us that an average user rating that includes many opinions is more reliable than one that includes only a few. Generally, this pattern of reasoning is a heuristic: we have a sense that more individual ratings produce a more reliable result, but do not necessarily know exactly how or why. The same holds for route ratings. For example, if you want a classic project for your first 12d you will probably pick Anarchitect (3.9 stars from 137 votes) over the neighboring Mayhem (3.7 stars from 3 votes). I used this heuristic, and some other ideas from my background in chemistry, to define a metric I nominally refer to as the Route Quality Index or RQI.

This section is about to get fairly technical, so I wrote a summary at the end that outlines what you need to know to understand the rest of this article without all the following nonsense. You can skip there now if you want. For interested readers, RQI is inspired by metrics used to rank materials for gas adsorption, see Krishna, 2017. It is defined as:

RQI = S(1-1/N)

where S is the average stars and N is the number of votes. So, as the number of votes increases for a given route, the term 1 - 1/N approaches 1, and the RQI approaches S. However, for a small number of votes, 1 - 1/N reduces the RQI drastically. Some examples:

S = 4.0, N = 1 gives RQI = 0.00
S = 4.0, N = 2 gives RQI = 2.00
S = 4.0, N = 10 gives RQI = 3.60
S = 4.0, N = 100 gives RQI = 3.96
S = 4.0, N = 1000 gives RQI = 4.00

Several other metrics incorporating both average stars and the number of votes could be thought of, e.g. multiplying or adding the two together. However, I am not the IFSC. Also, multiplication/addition and several other options I thought of, either require additional data or are not easily interpretable as individual numbers. RQI can be easily calculated on a per-route basis and it ranges from 0 to 4, allowing it to be interpreted (essentially) the same as the average stars.

Since RQI is a combined rating/popularity metric, it is a natural choice for defining the term “classic”. I propose the following:

Classic: RQI of 3.5 or greater

Area Classic: RQI of 2.5 to 3.5

Good: RQI of 1.5 to 2.5

Bad: RQI of 0.5 to 1.5

Bomb: RQI less than 0.5

I think an RQI of 3.5 makes sense as a lower bound for classic status given the typical round-half-up method: when rounding to the nearest integer, 3.5 is usually rounded to 4. The same applies to the other ranges given above. To better show how RQI and the above classifications can be used, I plotted the median stars vs. the number of votes for each route (around 96,000) in the OpenBeta database:

**Figure 1.** Each point corresponds to an individual route. Each route is colored by my proposed rating designation. Lines of constant RQI are drawn, showing the boundary for each designation. Note that “bomb” routes are not included in this database. Also, note that it looks like there are far fewer “bad” routes than classics in the plot above, this is because bad routes tend to get fewer votes (I wonder why), meaning that there are many points superimposed on one another in the bad region.

Two key things are shown in this figure: (i) to achieve classic (or another) designation with a relatively low number of votes the average rating must be near the top of the range and (ii) about 3.4% of routes in the USA can be called classics according to my criteria, which seems reasonable. We don't want too many classics or the designation would start to lose meaning.

So, RQI is a wondrous metric that can be used to precisely define the term “classic”. However, there is a problem with RQI: hard routes get fewer ascents and thus fewer votes. Consequently, it tends to be more difficult for hard routes to achieve classic status according to RQI. For example, 8 votes are the minimum for a route to get an RQI of 3.5 since 4(1 - 1/8) = 3.5, and it is rare for a 14a (or harder) to get 8 votes. One could reason that this should be the case: enough people must agree that a route is a classic for it to be one. And, this number of people does not decrease with increased route difficulty. Personally, I think this is a good argument, the classic designation is essentially a popularity contest, and sending 5.14 is just not very popular. However, I recognize that not everyone will agree with this view. More importantly (I think), when using RQI to rank routes based on quality within a large range of grades (e.g. 12a-13a), the easier routes will tend to have higher RQIs. This is a problem for search applications. Thus, I defined another metric closely related to RQI, which I call the Adjusted RQI or ARQI:

ARQI = S(1-1/Nw)

where Nw is the number of weighted (or adjusted) votes, in which votes are weighted more for harder routes. The weights can be calculated from the votes-per-route for each grade. Votes-per-route can be estimated by summing all the votes and dividing by the total number of routes for each grade and type (i.e. sport and trad) in a large database of route ratings. Then, votes are weighted relative to the maximum votes-per-route observed for either sport or trad routes, whichever is relevant. For example, 13a sport routes have 8.3 votes-per-route, 4 times less than the maximum 33.2 for 5.9+ sport routes, as estimated from the OpenBeta database. So, each vote for a 13a sport route is multiplied by 4.0 (the adjustment factor in the following plot) when calculating ARQI:

**Figure 2.** Votes per route for each grade (only for sport routes in this figure, the case is analogous for trad routes, but with a different votes-per-route distribution) estimated from the OpenBeta database. The adjustment factor is the weight used for each vote when calculating ARQI.

RQI and ARQI are essentially the same in the 5.2 to 5.12a range since these routes usually see enough ascents for the RQI to not change much with a few more or fewer votes. This is because RQI changes less and less as the number of votes increases. Assuming the votes tend to fall around the same rating, which is almost always the case. However, for 12a and harder routes, I recommend using ARQI to rank routes by quality, especially when considering a grade range larger than a letter grade or two. Either RQI and ARQI can be used to designate classics, but I prefer RQI for the reasons discussed above, and because of its simplicity.

Summary of this section: I defined two route quality metrics, the Route Quality Index (RQI) and the Adjusted RQI (ARQI), which can be calculated from the average star rating and number of votes for any given route. These metrics can be used to categorize routes into classic, area classic, etc. designations or rank routes according to quality. ARQI corrects for the bias of RQI towards easier routes, especially for grades harder than 12a. I suggest that RQI is best used when determining whether a route is a classic or not. Either RQI or ARQI can be used when ranking routes in the 5.2 to 5.12a range. However, ARQI is better for ranking routes harder than 5.12a (unless all the routes are of the same grade, then it doesn't matter).

Ranking routes for search applications

I thought it would be useful to compare the Mountain Project (MP) route finder feature to route rankings given by RQI. The MP route finder ranks climbs within a range of grades and a minimum number of stars according to popularity. Or difficulty, or by area, but these sorting methods do not select for classics. I think the following comparison demonstrates that RQI rankings can be more useful when looking for high-quality routes than the MP route finder. I ran a route search considering US locations on Mountain Project for sport routes with 3+ stars (the highest threshold available) in the 11a to 11b range, with the ranking based on popularity (number of ticks). I then ran searches on the OpenBeta database in the same grade range with the ranking based on RQI. Here are the top-10 results for each search:

Many of the same routes are deservedly present in both top-10 lists, like Amarillo Sunset, Legacy, Wild at Heart, and Flying Hawaiian. However, the MP ranking does not include Dead Dog. In fact, Dead Dog is way down the MP list because it doesn't get ticked enough. However, it has 66 4-star votes out of 73, giving it one of the highest average ratings of any route in the USA. Almost all of the MP top-10 are high in the RQI ranking, just not all in the top-10. The only exception is Route Name Redacted, which does not have a very high average star rating (3.4 mean/3.5 median). Many other similar examples can be extracted by comparing other MP route finder results with rankings based on RQI. MP puts the number of ticks first, missing out on gems like Dead Dog. Additionally, many dubious classics (with 3.0 to 3.5 stars, like Route Name Redacted) will make the list since 3+ stars is the highest available threshold. Finally, MP only uses the mean as their average metric, which is debatably not the best, as discussed previously.

A sector map including route quality

Instead of just writing some rambling blog post with too much information
(this one), I thought I would try to make some of the points outlined above
actually useful, to demonstrate their practicality. To this end, I wrote an app that highlights sectors around the USA with a relatively large number of high-quality routes under grade, type, and USA state filters. Route quality can be assessed with average stars, RQI, or ARQI. A demo of the app is available here and all the code/data used to produce it is available on GitHub here. It is written in Python using Plotly (with Mapbox) for the mapping and Dash for the web app. Check out the tutorial page for directions. A snapshot from an example search is shown below. The app can be used, for example, to narrow down promising areas for a trip, find a crag with multiple potential projects in your area, or just explore American rock climbing with a new perspective.

**Figure 3.** A screenshot from the sector search map. Highlighted areas are sectors with (relatively) many quality 12a to 13a sport routes according to ARQI. One of the top areas (The Dihedrals Wall, in Smith Rock ) is highlighted.

Guidelines for consistent route rating

In this section, I list and describe some guidelines that I think are key to maintaining reasonable route ratings. These guidelines, should they be followed, will help to make future online guidebook aggregate ratings more convincing and useful. Hopefully, many climbers already follow some (or all) of these without explicitly laying them out. I think you will find they are similar in many, but not all, cases to the typical guidelines used for grading a route. Here they are, in no particular order:

Strive for a certain amount of objectivity: Ratings are personal. But they should not be influenced by external factors, such as your mood, the weather, the strength of your fingers, etc. FAs should rate their routes very carefully, if at all. Take, for example, Spark Seeker an unfortunate contribution to an excellent crag in Wild Iris. The two FAs gave 4 and 3 stars, however, half of the subsequent ratings are 0-star. Clearly, it is not one of the best routes in the country, Wild Iris, or even Lower Remuda, as the FAs seem to think. Also, if you only like filthy, moist offwidths and think that striking, sustained face climbs on bullet limestone are contemptible, probably avoid rating certain types of routes.
You must send the route before rating it: Mostly, we are talking about free climbs. So, even if you took a 1-second hang, you did an aid version of the climb, which is not the route you intended to rate. Obviously, this doesn't apply if you are doing an aid climb. I would guess that another common source of non-send ratings is TR ascents. If the route has lead bolts, then the FA intended it to be led, so a clean TR burn is not a send.
Safety does not matter for trad, safety should be considered (with caveats) for sport climbs: In a way, when you rate routes you are assessing the work done by the FA (or equipper, I will assume they are the same person). For trad routes, the FA can't alter natural gear placements. I guess they can, but if they do, please rate the route 0-stars. So, there is no point in considering safety unless you want to argue with geological forces. On the other hand, for sport climbs, safe bolting is key to the climbing experience. If a route is rap-bolted there is no good reason to have unsafe runouts, and this should be factored into the rating. A 4-star climb could be ruined by having an unnecessary runout with ground-fall potential or above a ledge. However, I think an objectively safe runout that may feel scary should not count against the climb. If a route is bolted on lead, the criteria are different; dangerous runouts often can't be avoided and this should be taken into consideration.
Only consider the climb itself: A heinous hike through snake-infested cacti does not make the route worse. A loose, 6-inch belay with snake-infested cacti behind it is your belayer's problem. Road noise is not the fault of the FA, it is the fault of the engineers who decided to preemptively ruin the crag by putting a road next to it. However, for sport routes (related to item 3), bolting is part of the climb. Even if the bolting is safe, strange bolt placements can make a climb worse. For example, having a difficult clip in the middle of the crux that cannot be safely skipped and that could be safely located after the crux is a mistake by the FA.
Consider your experience: Have you only climbed at one local crag? If so, you cannot know what a 4-star route is (by definition). Try to climb routes of various ratings in various locations before you start rating climbs. This will also help with item 1.
Don't be afraid to disagree, but only if you follow the above guidelines: Fundamentally, people will have different opinions. In the end, try to follow the above guidelines and just be honest.

Shane

Fantastic. This was literally going to be a DS grad Capstone project for me, though I was surprised at how difficult it was to get a hold of this data (intended to use Mountain Project API). I intended to build a front end that would allow users to do something like "Show me the very best climbs per grade within a X mile radius, for Bouldering, Sport, Trad, (maybe ICE). Idea started from daughter and I discussing things like, if you could only do one V2 in the United States, what would it be?

Expand full comment

3 replies

3 more comments...

OpenBeta Project

Discussion about this post