This site is no longer maintained. Use phish.net or current.phish.net.

*[We would** like to thank Paul Jakus (@paulj) of the Dept. of Applied Economics at Utah State University for this summary of research presented at the 2024 Phish Studies Conference. **-Ed.]*

This is the fourth and final blogpost regarding the current rating system. Previous posts can be found here, here and here.

Post #2 showed how two metrics—average deviation and entropy—have been used by product marketers to identify anomalous raters; Post #3 showed how anomalous users may increase bias in the show rating. Many Phish.Net users have intuitively known that anomalous raters increase rating bias, and have suggested using a rating system similar to that used by rateyourmusic.com (RYM). RYM is an album rating aggregation website where registered users have provided nearly 137 million ratings of 6.2 million albums recorded by nearly 1.8 million artists (as of August 2024).

Similar to Phish.Net, RYM uses a five-point star rating scale but, unlike .Net, an album’s rating is not a simple average of all user ratings. Instead, RYM calculates a weighted average, where the most credible raters are given greater weight than less credible raters. Weights differ across raters on the basis of the number of albums they have rated and/or reviewed, the length of time since their last review, whether or not the reviewer provides only extreme ratings (lowest and/or highest scores), and how often they log onto the site, among other measures. These measures identify credible reviewers and separate them from what the site describes as possible “trolls”. Weights are not made public, and the exact details of the weighting system are left deliberately opaque so as to avoid strategic rating behavior.

So, if assigning different weights to raters works for RYM, will they work for Phish.Net?

Following RYM, each of the 16,452 raters in the Phish.Net database was assigned a weight, ranging between zero and one, based on cutoff values for average deviation, entropy, and the number of shows rated. Instead of calculating a simple average show rating, where all raters have equal weight, my alternative show ratings are weighted averages. Differential weights assure that users believed to provide greater informational content and less statistical bias contribute more to the show rating than those believed to have less content and more bias (anomalous raters).

The first alternative system is called the “Modified RYM” (MRYM) because it represents my best effort to match the RYM system. MRYM is also the “harshest” of the weighting alternatives because it assigned the smallest weight (0.0) to those with zero entropy (where the expected average value of information is zero) and to those with exceptionally high deviation scores (top 2.5%). Low entropy users were assigned a weight of 0.25, and those who had rated fewer than 50 shows were given a weight of 0.5. (If a rater fell into more than one category they were assigned the smallest weight.) All others were assigned full weight (1.0). Four other weighting systems (“Alternative 1” through “Alternative 4”) gradually relaxed these constraints, giving raters increasingly larger weights relative to the MRYM, but less than the equal weights of the current system.

To summarize, we now have six different rating estimates for each show: the current (simple mean) rating with equal weights, and show ratings from five different weighted alternatives. The true value of the show rating remains unobservable, so how do we know which system is best?

My previous research ( https://phish.net/blog/1539388704/setlists-and-show-ratings.html ) demonstrated that show ratings (with equal rater weights) are significantly correlated with setlist elements such as the amount of jamming in a show, the number and type of segues, the relative rarity of the setlist, narrative songs, and other factors. If the deviation and entropy measures have successfully identified the raters who contribute the most information and least bias, then weighted show ratings should be even more correlated with show elements relative to the current system. Technically, this analytical approach is a test of convergent validity.

Using setlist data from 524 Modern Era shows (2009-2022), regression analysis demonstrates that all of the weighted show ratings exhibit better overall statistical properties relative to the current system (see the chart below.) Each of the weighted show ratings systems has better overall statistical significance (F-statistic) and correlates more strongly with show elements (Adjusted R-square) than the current system. Further, the prediction error (root mean square error) is marginally smaller.

The weighted ratings that perform best are the MRYM and its closest alternative, “Alt 1”, both of which give relatively low weights to anomalous raters and achieve the largest improvements over the current rating system.

Further work is still needed. For example, the cutoff values for the deviation and entropy metrics were rather arbitrary, as were the values for the weights. Testing cutoff and weight assumptions is not difficult, and the conclusion is likely to remain: weighted show ratings are statistically superior to the current system.

Will weighted show ratings come to Phish.Net in the future? I hope so, but the coding effort needed to get everything automated may prove challenging. For now, assessing the performance and stability of .Net’s new coding architecture remains the top priority, after which we can ask the coders to think about weighted show ratings.

Powered by Phish.net

Designed by Adam Scheinberg