Skip to main content

Resorting Media Ratings

Com­man­d­line tool pro­vid­ing in­ter­ac­tive sta­tis­ti­cal pair­wise rank­ing and sort­ing of items

User-created datasets using or­di­nal scales (such as media rat­ings) have ten­den­cies to drift or ‘clump’ to­wards the ex­tremes and fail to be in­for­ma­tive as pos­si­ble, falling prey to ceil­ing ef­fects and mak­ing it dif­fi­cult to dis­tin­guish be­tween the mediocre & ex­cel­lent.

This can be coun­ter­acted by rerat­ing the dataset to cre­ate a uni­form (and hence, in­for­ma­tive) dis­tri­b­u­tion of rat­ings—but such man­ual rerat­ing is dif­fi­cult.

I pro­vide an any­time CLI pro­gram, resorter, writ­ten in R (should be cross-platform but only tested on Linux) which keeps track of com­par­isons, in­fers un­der­ly­ing rat­ings as­sum­ing that they are noisy in the ELO-like Bradley-Terry model, and in­ter­ac­tively & in­tel­li­gently queries the user with com­par­isons of the media with the most un­cer­tain cur­rent rat­ings, until the user ends the ses­sion and a fully rescaled set of rat­ings are out­put.

Re­sorter reads in a CSV file of names (with an op­tional sec­ond col­umn of rat­ings), and then, with a num­ber of op­tions, will ask the user about spe­cific com­par­isons until such time as the user de­cides that the rank­ings are prob­a­bly ac­cu­rate enough (un­like stan­dard com­par­i­son sorts, the com­par­isons are al­lowed to be noisy!), and quits the in­ter­ac­tive ses­sion, at which point the un­der­ly­ing rank­ings are in­ferred and a uniformly-distributed re-ranked set of rat­ings are printed or writ­ten out.

Use

Avail­able fea­tures:

$ ./resorter --help
# usage: resorter [--help] [--verbose] [--no-scale] [--no-progress] [--opts OPTS] [--input INPUT] [--output OUTPUT] [--queries QUERIES]
#                 [--levels LEVELS] [--quantiles QUANTILES]
#
# sort a list using comparative rankings under the Bradley-Terry statistical model; see https://gwern.net/resorter
#
#
# flags:
#   -h, --help            show this help message and exit
#   -v, --verbose         whether to print out intermediate statistics
#   --no-scale           Do not discretize/bucket the final estimated latent ratings into 1-l levels/ratings; print out inferred latent scores.
#   --no-progress         Do not print out mean standard error of items
#
# optional arguments:
#   -x, --opts OPTS           RDS file containing argument values
#   -i, --input INPUT         input file: a CSV file of items to sort: one per line, with up to two columns. (eg. both 'Akira\n' and 'Akira, 10\n' are valid).
#   -o, --output OUTPUT           output file: a file to write the final results to. Default: printing to stdout.
#   -n, --queries QUERIES         Maximum number of questions to ask the user; if already rated, 𝒪(items) is a good max, but the more items and more levels in
#                                 the scale, the more comparisons are needed. [default: 2147483647]
#   -l, --levels LEVELS            The highest level; rated items will be discretized into 1-l levels, so l=5 means items are bucketed into 5 levels: [1,2,3,4,5],
#                                  etc [default: 5]
#   -q, --quantiles QUANTILES         What fraction to allocate to each level; space-separated; overrides `--levels`. This allows making one level of ratings
#                                     narrower (and more precise) than the others, at their expense; for example, one could make 3-star ratings rarer with quantiles
#                                      like `--quantiles '0 0.25 0.8 1'`. Default: uniform distribution (1--5 → '0.0 0.2 0.4 0.6 0.8 1.0').

Here is an ex­am­ple demon­strat­ing the use of resorter to have the user make 10 com­par­isons, rescale the scores into 1–3 rat­ings (where 3 is rare), and write the new rank­ings to a file:

$ cat anime.txt
# "Cowboy Bebop", 10
# "Monster", 10
# "Neon Genesis Evangelion: The End of Evangelion", 10
# "Gankutsuou", 10
# "Serial Experiments Lain", 10
# "Perfect Blue", 10
# "Jin-Rou", 10
# "Death Note", 10
# "Last Exile", 9
# "Fullmetal Alchemist", 9
# "Gunslinger Girl", 9
# "RahXephon", 9
# "Trigun", 9
# "Fruits Basket", 9
# "FLCL", 9
# "Witch Hunter Robin", 7
# ".hack//Sign", 7
# "Chobits", 7
# "Full Metal Panic!", 7
# "Mobile Suit Gundam Wing", 7
# "El Hazard: The Wanderers", 7
# "Mai-HiME", 6
# "Kimi ga Nozomu Eien", 6
$ ./resorter.R --input anime.txt --output new-ratings.txt --queries 10 --quantiles '0 0.33 0.9 1'
# Comparison commands: 1=yes, 2=tied, 3=second is better, p=print estimates, s=skip, q=quit
# Mean stderr: 70182  | Do you like 'RahXephon' better than 'Perfect Blue'? 2
# Mean stderr: 21607  | Do you like 'Monster' better than 'Death Note'? 1
# Mean stderr: 13106  | Do you like 'Kimi ga Nozomu Eien' better than 'Gunslinger Girl'? 3
# Mean stderr: 13324  | Do you like 'Kimi ga Nozomu Eien' better than 'Gankutsuou'? p
#                                           Media         Estimate              SE
#                                        Mai-HiME −2.059917500e+01 18026.307910587
#                             Kimi ga Nozomu Eien −2.059917500e+01 18026.308021536
#                              Witch Hunter Robin −1.884110950e-15     2.828427125
#                                     .hack//Sign −7.973125767e-16     2.000000000
#                                         Chobits  0.000000000e+00     0.000000000
#                               Full Metal Panic!  2.873961741e-15     2.000000000
#                         Mobile Suit Gundam Wing  5.846054261e-15     2.828427125
#                        El Hazard: The Wanderers  6.694628513e-15     3.464101615
#                                      Last Exile  1.911401531e+01 18026.308784841
#                             Fullmetal Alchemist  1.960906858e+01 18026.308743986
#                                 Gunslinger Girl  2.010412184e+01 18026.308672318
#                                            FLCL  2.059917511e+01 18026.308236990
#                                   Fruits Basket  2.059917511e+01 18026.308347939
#                                       RahXephon  2.059917511e+01 18026.308569837
#                                          Trigun  2.059917511e+01 18026.308458888
#                                         Jin-Rou  2.109422837e+01 18026.308740073
#                                      Death Note  2.109422837e+01 18026.308765943
#                                    Perfect Blue  2.109422837e+01 18026.308672318
#                         Serial Experiments Lain  2.158928163e+01 18026.308767629
#                                      Gankutsuou  2.208433490e+01 18026.308832127
#  Neon Genesis Evangelion: The End of Evangelion  2.257938816e+01 18026.308865814
#                                    Cowboy Bebop  2.307444143e+01 18026.308979636
#                                         Monster  2.307444143e+01 18026.308868687
# Mean stderr: 13324  | Do you like 'Monster' better than 'Jin-Rou'? s
# Mean stderr: 13324  | Do you like 'Last Exile' better than 'Death Note'? 3
# Mean stderr: 13362  | Do you like 'Mai-HiME' better than 'Serial Experiments Lain'? 3
# Mean stderr: 16653  | Do you like 'Trigun' better than 'Kimi ga Nozomu Eien'? 1
# Mean stderr: 14309  | Do you like 'Death Note' better than 'Kimi ga Nozomu Eien'? 1
# Mean stderr: 12644  | Do you like 'Trigun' better than 'Monster'? 3
$ cat new-ratings.txt
# "Media","Quantile"
# "Cowboy Bebop","3"
# "Monster","3"
# "Neon Genesis Evangelion: The End of Evangelion","3"
# "Death Note","2"
# "FLCL","2"
# "Fruits Basket","2"
# "Fullmetal Alchemist","2"
# "Gankutsuou","2"
# "Gunslinger Girl","2"
# "Jin-Rou","2"
# "Last Exile","2"
# "Perfect Blue","2"
# "RahXephon","2"
# "Serial Experiments Lain","2"
# "Trigun","2"
# "Chobits","1"
# "El Hazard: The Wanderers","1"
# "Full Metal Panic!","1"
# ".hack//Sign","1"
# "Kimi ga Nozomu Eien","1"
# "Mai-HiME","1"
# "Mobile Suit Gundam Wing","1"
# "Witch Hunter Robin","1"

Source Code

This R script re­quires the ⁠BradleyTerry2 & ⁠arg­parser li­braries to be avail­able or in­stal­lable. (An­drew Quinn’s Github ver­sion may be eas­ier to in­stall.)

De­pen­dency Com­pile Er­rors: in­stall pack­ages

As with any user-installed lan­guage, you may run into com­pile er­rors while in­stalling de­pen­den­cies.

Look for a miss­ing C li­brary to in­stall system-wide as root, or see if your pack­age man­ager pro­vides one of the de­pen­den­cies al­ready as a pre­com­piled bi­nary, so you can apt-get install r-cran-bradleyterry2 or apt-get install r-cran-lme4 to by­pass any com­pile er­rors.

Save it as a script some­where in your $PATH named resorter, and chmod +x resorter to make it ex­e­cutable.

#!/usr/bin/Rscript
 
# attempt to load a library implementing the Bradley-Terry model for inferring rankings based on
# comparisons; if it doesn't load, try to install it through R's in-language package management;
# otherwise, abort and warn the user
# https://www.jstatsoft.org/index.php/jss/article/download/v048i09/601
loaded <- library(BradleyTerry2, quietly=TRUE, logical.return=TRUE)
if (!loaded) { write("warning: R library 'BradleyTerry2' unavailable; attempting to install locally...", stderr())
              install.packages("BradleyTerry2")
              loadedAfterInstall <- library(BradleyTerry2, quietly=TRUE, logical.return=TRUE)
              if(!loadedAfterInstall) { write("error: 'BradleyTerry2' unavailable and cannot be installed. Aborting.", stderr()); quit() }
}
# similarly, but for the library to parse command line arguments:
loaded <- library(argparser, quietly=TRUE, logical.return=TRUE)
if (!loaded) { write("warning: R library 'argparser' unavailable; attempting to install locally...", stderr())
              install.packages("argparser")
              loadedAfterInstall <- library(argparser, quietly=TRUE, logical.return=TRUE)
              if(!loadedAfterInstall) { write("error: 'argparser' unavailable and cannot be installed. Aborting.", stderr()); quit() }
}
p <- arg_parser("sort a list using comparative rankings under the Bradley-Terry statistical model; see https://gwern.net/resorter", name="resorter")
p <- add_argument(p, "--input", short="-i",
                  "input file: a CSV file of items to sort: one per line, with up to two columns. (eg. both 'Akira\\n' and 'Akira, 10\\n' are valid).", type="character")
p <- add_argument(p, "--output", "output file: a file to write the final results to. Default: printing to stdout.")
p <- add_argument(p, "--verbose", "whether to print out intermediate statistics", flag=TRUE)
p <- add_argument(p, "--queries", short="-n", default=NA,
                  "Maximum number of questions to ask the user; defaults to N*log(N) comparisons. If already rated, 𝒪(n) is a good max, but the more items and more levels in the scale and more accuracy desired, the more comparisons are needed.")
p <- add_argument(p, "--levels", short="-l", default=5, "The highest level; rated items will be discretized into 1–l levels, so l=5 means items are bucketed into 5 levels: [1,2,3,4,5], etc. Maps onto quantiles; valid values: 2–100.")
p <- add_argument(p, "--quantiles", short="-q", "What fraction to allocate to each level; space-separated; overrides `--levels`. This allows making one level of ratings narrower (and more precise) than the others, at their expense; for example, one could make 3-star ratings rarer with quantiles like `--quantiles '0 0.25 0.8 1'`. Default: uniform distribution (1--5 → '0.0 0.2 0.4 0.6 0.8 1.0').")
p <- add_argument(p, "--no-scale", flag=TRUE, "Do not discretize/bucket the final estimated latent ratings into 1-l levels/ratings; print out inferred latent scores.")
p <- add_argument(p, "--progress", flag=TRUE, "Print out mean standard error of items")
argv <- parse_args(p)
 
# read in the data from either the specified file or stdin:
if(!is.na(argv$input)) { ranking <- read.csv(file=argv$input, stringsAsFactors=TRUE, header=FALSE); } else {
    ranking <- read.csv(file=file('stdin'), stringsAsFactors=TRUE, header=FALSE); }
 
# turns out noisy sorting is fairly doable in 𝒪(n * log(n)), so do that plus 1 to round up:
if (is.na(argv$queries)) { n <- nrow(ranking)
                           argv$queries <- round(n * log(n) + 1) }
 
# if user did not specify a second column of initial ratings, then put in a default of '1':
if(ncol(ranking)==1) { ranking$Rating <- 1; }
colnames(ranking) <- c("Media", "Rating")
# A set of ratings like 'foo,1\nbar,2' is not comparisons, though. We *could* throw out everything except the 'Media' column
# but we would like to accelerate the interactive querying process by exploiting the valuable data the user has given us.
# So we 'seed' the comparison dataset based on input data: higher rating means +1, lower means −1, same rating == tie (0.5 to both)
comparisons <- NULL
for (i in 1:(nrow(ranking)-1)) {
 rating1 <- as.numeric(ranking[i,]$Rating) ## HACK: crashes on reading its own output due to "?>? not meaningful for factors" if we do not coerce from factor to number?
 media1 <- ranking[i,]$Media
 rating2 <- as.numeric(ranking[i+1,]$Rating)
 media2 <- ranking[i+1,]$Media
 if (rating1 == rating2)
  {
     comparisons <- rbind(comparisons, data.frame("Media.1"=media1, "Media.2"=media2, "win1"=0.5, "win2"=0.5))
  } else { if (rating1 > rating2)
           {
            comparisons <- rbind(comparisons, data.frame("Media.1"=media1, "Media.2"=media2, "win1"=1, "win2"=0))
            } else {
              comparisons <- rbind(comparisons, data.frame("Media.1"=media1, "Media.2"=media2, "win1"=0, "win2"=1))
                   } } }
# the use of '0.5' is recommended by the BT2 paper, despite causing quasi-spurious warnings:
# > In several of the data examples (eg. `?CEMS`, `?springall`, `?sound.fields`), ties are handled by the crude but
# > simple device of adding half of a 'win' to the tally for each player involved; in each of the examples where this
# > has been done it is found that the result is similar, after a simple re-scaling, to the more sophisticated
# > analyses that have appeared in the literature. Note that this device when used with `BTm` typically gives rise to
# > warnings produced by the back-end glm function, about non-integer 'binomial' counts; such warnings are of no
# > consequence and can be safely ignored. It is likely that a future version of `BradleyTerry2` will have a more
# > general method for handling ties.
suppressWarnings(priorRankings <- BTm(cbind(win1, win2), Media.1, Media.2, data = comparisons))
 
if(argv$verbose) {
  print("higher=better:")
  print(summary(priorRankings))
  print(sort(BTabilities(priorRankings)[,1]))
}
 
set.seed(2015-09-10)
cat("Comparison commands: 1=yes, 2=tied, 3=second is better, p=print estimates, s=skip, q=quit\n")
for (i in 1:argv$queries) {
 
 # with the current data, calculate and extract the new estimates:
 suppressWarnings(updatedRankings <- BTm(cbind(win1, win2), Media.1, Media.2, br=TRUE, data = comparisons))
 coefficients <- BTabilities(updatedRankings)
 # sort by latent variable 'ability':
 coefficients <- coefficients[order(coefficients[,1]),]
 
 if(argv$verbose) { print(i); print(coefficients); }
 
 # select two media to compare: pick the media with the highest standard error and the media above or below it with the highest standard error:
 # which is a heuristic for the most informative pairwise comparison. BT2 appears to get caught in some sort of a fixed point with greedy selection,
 # so every few rounds pick a random starting point:
 media1N <- if (i %% 3 == 0) { which.max(coefficients[,2]) } else { sample.int(nrow(coefficients), 1) }
 media2N <- if (media1N == nrow(coefficients)) { nrow(coefficients)-1; } else { # if at the top & 1st place, must compare to 2nd place
   if (media1N == 1) { 2; } else { # if at the bottom/last place, must compare to 2nd-to-last
    # if neither at bottom nor top, then there are two choices, above & below, and we want the one with highest SE; if equal, arbitrarily choose the better:
    if ( (coefficients[,2][media1N+1]) > (coefficients[,2][media1N-1])  ) { media1N+1 } else { media1N-1 } } }
 
 targets <- row.names(coefficients)
 media1 <- targets[media1N]
 media2 <- targets[media2N]
 
 if (argv$`progress`) { cat(paste0("Mean stderr: ", round(mean(coefficients[,2]))), " | "); }
 cat(paste0("Is '", as.character(media1), "' greater than '", as.character(media2), "'? "))
 rating <- scan("stdin", character(), n=1, quiet=TRUE)
 
 switch(rating,
        "1" = { comparisons <- rbind(comparisons, data.frame("Media.1"=media1, "Media.2"=media2, "win1"=1, "win2"=0)) },
        "3" = { comparisons <- rbind(comparisons, data.frame("Media.1"=media1, "Media.2"=media2, "win1"=0, "win2"=1)) },
        "2" = { comparisons <- rbind(comparisons, data.frame("Media.1"=media1, "Media.2"=media2, "win1"=0.5, "win2"=0.5))},
        "p" = { estimates <- data.frame(Media=row.names(coefficients), Estimate=coefficients[,1], SE=coefficients[,2]);
                print(comparisons)
                print(warnings())
                print(summary(updatedRankings))
                print(estimates[order(estimates$Estimate),], row.names=FALSE) },
        "s" = {},
        "q" = { break; }
        )
}
 
# results of all the questioning:
if(argv$verbose) { print(comparisons); }
 
suppressWarnings(updatedRankings <- BTm(cbind(win1, win2), Media.1, Media.2, ~ Media, id = "Media", data = comparisons))
coefficients <- BTabilities(updatedRankings)
if(argv$verbose) { print(rownames(coefficients)[which.max(coefficients[2,])]);
                 print(summary(updatedRankings))
                 print(sort(coefficients[,1])) }
 
ranking2 <- as.data.frame(BTabilities(updatedRankings))
ranking2$Media <- rownames(ranking2)
rownames(ranking2) <- NULL
 
if(!(argv$`no_scale`)) {
 
    # if the user specified a bunch of buckets using `--quantiles`, parse it and use it,
    # otherwise, take `--levels` and make a uniform distribution
    levels <- max(2,min(100,argv$levels+1)) ## needs to be 2–100
    quantiles <- if (!is.na(argv$quantiles)) { (sapply(strsplit(argv$quantiles, " "), as.numeric))[,1]; } else {
      seq(0, 1, length.out=levels); }
    ranking2$Quantile <- with(ranking2, cut(ability,
                                    breaks=quantile(ability, probs=quantiles),
                                    labels=1:(length(quantiles)-1),
                                    include.lowest=TRUE))
    df <- subset(ranking2[order(ranking2$Quantile, decreasing=TRUE),], select=c("Media", "Quantile"));
    if (!is.na(argv$output)) { write.csv(df, file=argv$output, row.names=FALSE) } else { print(df); }
} else { # return just the latent continuous scores:
         df <- data.frame(Media=rownames(coefficients), Estimate=coefficients[,1]);
         if (!is.na(argv$output)) { write.csv(df[order(df$Estimate, decreasing=TRUE),], file=argv$output, row.names=FALSE); } else {
             print(finalReport); }
       }
 
cat("\nResorting complete")

Background

Rating Inflation

In rat­ing hun­dreds of media on a re­view site like GoodReads, Ama­zon, MyAnimeList, ⁠Uber/Lyft, free­lanc­ing mar­kets etc, the dis­tri­b­u­tions tend to be­come ‘lumpy’ and con­cen­trate on the top few pos­si­ble rat­ings: if it’s a 10-point scale, you won’t see many below 7, usu­ally, or if it’s 5-stars ⁠then any­thing ⁠below 4-stars in­di­cates pro­found ha­tred, lead­ing to a J-shaped dis­tri­b­u­tion and the In­ter­net’s ver­sion of grade in­fla­tion. After enough time and in­fla­tion, the rat­ings have de­gen­er­ated into a non­in­for­ma­tive bi­nary rat­ing scale, and some sites, rec­og­niz­ing this, aban­don the pre­tense, like ⁠YouTube or ⁠Net­flix switch­ing from 5-stars to like/dis­like. The self-selection and other is­sues in ex­ist­ing rat­ings like ⁠pub­li­ca­tion bias have some per­verse con­se­quences: for ex­am­ple, win­ning an award can lower a book’s GoodReads av­er­age rat­ing, be­cause the award leads a wider, less re­cep­tive, au­di­ence to read the book (Kovács & Sharkey2014).

This is un­for­tu­nate if you want to pro­vide rat­ings & re­views to other peo­ple and in­di­cate your true pref­er­ences; when I like on ⁠MAL­graph and see that over half my anime rat­ings are in the range 8–10, then, my rat­ings hav­ing de­gen­er­ated into a roughly 1–3 scale (crap/good/great) makes it harder to see which ones are truly worth watch­ing & also which ones I might want to go back and re­watch. So the rat­ings carry much less in­for­ma­tion than one might have guessed from the scale (a 10-point scale has 3.32 bits of in­for­ma­tion in each rat­ing, but if it’s de facto de­gen­er­ated into a 3-point scale, then the in­for­ma­tion has halved to 1.58 bits). If in­stead, my rat­ings were uni­formly dis­trib­uted over the 1–10 scale, such that 10% were rated 1, 10% were rated 2, etc, then my rat­ings would be much more in­for­ma­tive, and my opin­ion clearer. (First-world prob­lems, per­haps, but still an­noy­ing.)

What Distribution Should Ratings Be?

If the J-shaped dis­tri­b­u­tion is no good, what dis­tri­b­u­tion ought one’s rat­ings to be?

A nor­mal dis­tri­b­u­tion has a cer­tain es­thetic ap­peal, and com­mon, but that’s no jus­ti­fi­ca­tion. First, there’s not much rea­son to think that media ought to be normally-distributed in the first place. The uni­ver­sal­ity of the nor­mal dis­tri­b­u­tion comes from the CLT, and how many vari­ables being added will tend to yield a nor­mal; how­ever, are works of art only the sum of their parts? I would say no: they are some­thing more like the mul­ti­ple of their parts, and mul­ti­ply­ing out many ran­dom vari­ables tends to give some­thing like a log-normal dis­tri­b­u­tion in­stead. If art were re­ally nor­mally dis­trib­uted, why do we so often watch an anime and say, “that had re­ally great art and music and voice-acting but… the plot was so stu­pid I couldn’t enjoy it at all and I have to give it a low rat­ing and will never watch it again”? After all, if they only summed, all the other vari­ables would be more than good enough to make the over­all score great too. Mean­while, a work which does well on all as­pects may achieve great­ness sim­ply through lack­ing any flaws—it all comes to­gether in more than the sum of its parts. Sec­ond, we are not watch­ing ran­dom sam­ples drawn from the over­all dis­tri­b­u­tion, we are gen­er­ally try­ing to draw a highly bi­ased sam­ple from the right-tail of qual­ity (ie. watch only good ones). Our sam­ple of rat­ings would, if we are doing a good job, look noth­ing like the over­all dis­tri­b­u­tion, in much the same way that a graph of heights in the NBA would look noth­ing like a graph of heights in the US pop­u­la­tion.

A uni­form dis­tri­b­u­tion has sim­i­lar prob­lem. Its ad­van­tage, men­tioned above, is that it con­tains more in­for­ma­tion: a nor­mal dis­tri­b­u­tion clumps rat­ings to­wards the mid­dle with most falling in 5, with only a few reach­ing 1 or 10, while a uni­form dis­tri­b­u­tion max­i­mizes the in­for­ma­tion—the worst 10% get rated 1, and so on to the best 10% get­ting rated 10. If you learn some­thing is a 5 from a uni­form dis­tri­b­u­tion, you learn a lot more (it’s in the 50–60% range) than if it were a nor­mal dis­tri­b­u­tion (it’s 50±??%).

But, I think I would rather look at a critic’s rat­ings if they used a nor­mal dis­tri­b­u­tion in­stead of a uni­form dis­tri­b­u­tion. Why? Be­cause, as I said, we usu­ally try to watch good stuff, and if I see some­thing rated a 10, I know it’s in the top per­centile or two, while the uni­form would be 5–10× less se­lec­tive and a 10 there only tells me that it’s some­where in the top decile, which is not too se­lec­tive. If a critic thought some­thing was in the top per­centile, I will pay at­ten­tion; but top decile is not too use­ful in help­ing me find things I may want.

So, I think our tar­get dis­tri­b­u­tion ought to max­i­mize use­ful­ness: it is just a sum­mary of an un­know­able un­der­ly­ing dis­tri­b­u­tion, which we sum­ma­rize prag­mat­i­cally as rat­ings, and so sta­tis­tics like rat­ings can only be right or wrong as de­fined by their in­tended use. On a rat­ing web­site, we are not in­ter­ested in mak­ing fine dis­tinc­tions among mediocre or trash. We are look­ing for in­ter­est­ing new can­di­dates to con­sider, we are look­ing for the best. A skew max­i­mizes the pro­vided in­for­ma­tion to the reader in the re­gion of in­ter­est of likely rec­om­men­da­tions. So our dis­tri­b­u­tion ought to throw most of our rat­ings into an un­in­for­ma­tive ‘meh’ bucket, and spend more time on the right-tail ex­treme: do we think a given work an all-time mas­ter­piece, or ex­cep­tional, or merely good?

What we want is a re­verse J-shaped dis­tri­b­u­tion: one in which most rat­ings are the low­est, and only a few are the high­est. (For smaller sets of rat­ings, a less ex­treme skew would be use­ful: it’s no good if most of the rat­ings are al­most empty. One might also strug­gle to, even using pair­wise rat­ings, ex­press such pre­cise per­centiles and fall back to a min­i­mum size of, say, 5%.)

In terms of per­centiles, with my ~458 anime watched/dropped, I might bucket it like this:

  1. 0–50% (n = 231)

  2. 51–75% (n = 112)

  3. 76%–89% (n = 62)

  4. 90–93% (n = 16)

  5. 94–95% (n = 11)

  6. 96% (n = 7)

  7. 97% (n = 6)

  8. 98% (n = 6)

  9. 99% (n = 5)

  10. >99.5% (n = 2)

That would cor­re­spond to using the op­tion --quantiles '0 0.5 0.75 0.89 0.93 0.95 0.96 0.97 0.98 0.99 0.995'.

Rescaling

Interactive Ranking Through Sorting

For only a few rat­ings like 10 or 20, it’s easy to man­u­ally re­view and rescale rat­ings and re­solve the am­bi­gu­i­ties (which ‘8’ is worse than the other ’8’s and should be knocked down to 7?), but past that, it starts to be­come te­dious and be­cause judg­ment is so frag­ile and sub­jec­tive and ‘choice fa­tigue’ be­gins to set in, I find my­self hav­ing to re­peat­edly scan the list and ask my­self “is X re­ally bet­ter than Y…? hm…”. So un­sur­pris­ingly, for a large cor­pus like my 408 anime or 2059 books, I’ve never both­ered to try to do this—much less do it oc­ca­sion­ally as the rat­ings drift.

If I had some sort of pro­gram which could query me re­peat­edly for new rat­ings, store the re­sults, and then spit out a con­sol­i­dated list of ex­actly how to change rat­ings, then I might be able to, once in a great while, cor­rect my rat­ings. This would help us re-rate our ex­ist­ing media cor­puses, but it could also help us sort other things: for ex­am­ple, we could try to pri­or­i­tize fu­ture books or movies by tak­ing all the ones we marked ‘to read’ and then doing com­par­isons to rank each item by how ex­cited we are about it or how im­por­tant it is or how much we want to read it soon. (If we have way too many things to do, we could also sort our over­all To-Do lists this way, but prob­a­bly it’s easy to sort them by hand.)

But it can’t ask me for ab­solute rat­ings, be­cause if I was able to eas­ily give uniformly-distributed rat­ings, I wouldn’t have this prob­lem in the first place! So in­stead it should cal­cu­late rank­ings, and then take the final rank­ing and dis­trib­ute them across how­ever many buck­ets there are in the scale: if I rate 100 anime for MAL’s 1–10 rank­ing, then it will put the bot­tom 10 in the ‘1’ bucket, the 10–20th into the ‘2’ bucket, etc This can­not be done au­to­mat­i­cally.

How to get rank­ings: if there are 1,000 media, it’s im­pos­si­ble for me to ex­plic­itly rank a book ‘#952’, or ‘#501’. No­body has that firm a grip. Per­haps it would be bet­ter for it to ask me to com­pare pairs of media? Com­par­i­son is much more nat­ural, less fa­tigu­ing, and helps my judg­ment by re­mind­ing me of what other media there are and how I think of them—when a ter­ri­ble movie gets men­tioned in the same breath as a great movie, it re­minds you why one was ter­ri­ble and the other great. Com­par­i­son also im­me­di­ately sug­gests an im­ple­men­ta­tion as the clas­sic com­par­i­son sort al­go­rithms like Quick­sort or Merge­sort, where the com­par­i­son ar­gu­ment is an IO func­tion which sim­ply calls out to the user; yield­ing a rea­son­able 𝒪(n · log(n)) num­ber of queries (and pos­si­bly much less, 𝒪(n), if we have the pre-existing rat­ings and can treat it as an adap­tive sort). So we could solve this with a sim­ple script in any de­cent pro­gram­ming lan­guage like Python, and this sort of re-sorting is pro­vided by web­sites such as Flickchart.

Noisy Sorting

The comparison-sort al­go­rithms make an as­sump­tion that is out­ra­geously un­re­al­is­tic in this con­text: they as­sume that the com­par­i­son is 100% ac­cu­rate.

That is, say, you have two sorted lists of 1,000 el­e­ments each and you com­pare the low­est el­e­ment of one with the high­est el­e­ment of the other and the first is higher, then all 1,000 items in the first are higher than all 1,000 items in the sec­ond, that of the (21,000) = 499,500 pos­si­ble pair­wise com­par­isons, the sort­ing al­go­rithms as­sume that not a sin­gle item is out of place, not a sin­gle pair­wise com­par­i­son would be in­cor­rect. This as­sump­tion is fine in pro­gram­ming, since the CPU is good at com­par­ing bits for equal­ity & may go tril­lions of op­er­a­tions with­out mak­ing a sin­gle mis­take; but it’s ab­surd to ex­pect this level of re­li­a­bil­ity of a human for any task, much less one in which we are clum­sily feel­ing for the in­vis­i­ble move­ments of our souls in re­sponse to great artists.

Our com­par­isons of movies, or books, or music, are error-prone and so we need some sort of sta­tis­ti­cal sort­ing al­go­rithm. How much does this hurt? Do we get some much worse sample-efficiency?

No. It turns out that the “noisy sort­ing” set­ting is not that dif­fer­ent from the com­par­i­son sort set­ting: the op­ti­mal as­ymp­totic per­for­mance there too is 𝒪(n · log(n)). The penalty from the com­par­i­son error rate p just gets folded into the con­stants (since as p ap­proaches 0, it gets eas­ier and turns into reg­u­lar com­par­i­son sort).

Some ref­er­ences:

Implementation

So we’d like a command-line tool which con­sumes a list of pairs of media & rat­ings, then queries the user re­peat­edly with pairs of media to get the user’s rat­ing of which one is bet­ter, some­how mod­el­ing un­der­ly­ing scores while al­low­ing for the user to be wrong in mul­ti­ple com­par­isons and ide­ally pick­ing what­ever is the ‘most in­for­ma­tive’ next pair to ask about to con­verge on ac­cu­rate rank­ings as quickly as pos­si­ble, and after enough ques­tions, do a final in­fer­ence of the full rank­ing of media and map­ping it onto a uni­form dis­tri­b­u­tion over a par­tic­u­lar rat­ing scale.

The nat­ural way to see the prob­lem is to treat every com­peti­tor as hav­ing an un­ob­served la­tent vari­able ‘qual­ity’ or ‘abil­ity’ or ‘skill’ on a car­di­nal scale which is mea­sured with error by com­par­isons, and then weaken tran­si­tiv­ity to chain our com­par­isons: if A beats B and B beats C, then prob­a­bly A beats C, weighted by how many times we’ve ob­served beat­ing and have much pre­cise our es­ti­mate of A-C’s la­tent vari­ables are.

Paired or comparison-based data comes up a lot in com­pet­i­tive con­texts, such as the fa­mous Elo rat­ing sys­tem or the Bayesian TrueSkill. A gen­eral model for deal­ing with it is the Bradley-Terry model.

There are at least two pack­ages in R for deal­ing with Bradley-Terry mod­els, BradleyTerry2 and prefmod. The lat­ter sup­ports more so­phis­ti­cated analy­ses than the for­mer, but it wants its data in an in­con­ve­nient for­mat, while BradleyTerry2 is more eas­ily in­cre­men­tally up­dated. (I want to sup­ply my data to the li­brary as a long list of triplets of media/media/rat­ing, which is then eas­ily up­dated with user-input by sim­ply adding an­other triplet to the bot­tom; but prefmod wants a col­umn for each pos­si­ble media, which is awk­ward to start with and gets worse if there are 1,000+ media to com­pare.)

We start with two vec­tors: the list of media, and the list of orig­i­nal rat­ings. The orig­i­nal rat­ings, while lumpy and in need of im­prov­ing, are still use­ful in­for­ma­tion which should not be ig­nored; if BT-2 were a Bayesian li­brary, we could use them as in­for­ma­tive pri­ors, but it’s not, so in­stead I adopt a hack in which the pro­gram runs through the list, com­par­ing each anime with the next anime, and if equal, it’s treated as a tie, and oth­er­wise they are recorded as a win/loss—so even from the start we’ve made a lot of progress to­wards in­fer­ring their la­tent scores.

Then a loop asks the user n com­par­isons; we want to ask about the media whose es­ti­mates are most un­cer­tain, and a way of es­ti­mat­ing that is look­ing at which media have the biggest stan­dard error. You might think that ask­ing about only the two media with the cur­rent biggest stan­dard error would be the best ex­plo­ration strat­egy (since this cor­re­sponds to re­in­force­ment learn­ing and multi-armed ban­dit ap­proaches where you start off by ex­plor­ing the most un­cer­tain things), but sur­pris­ingly, this leads to ask­ing about the same media again and again, even as the stan­dard error plum­mets. I’m not sure why this hap­pens, but I think it has to do with the prior in­for­ma­tion cre­at­ing an or­der­ing or hi­er­ar­chy on all the media, and then the maximum-likelihood es­ti­mat­ing leads to re­peat­ing the same ques­tion again and again to shift the or­der­ing up or down the car­di­nal scale (even our dis­cretized rat­ings will be in­vari­ant no mat­ter how the es­ti­mates are trans­lated up and down). So in­stead, we al­ter­nate be­tween ask­ing about the media with the largest stan­dard error (choos­ing the near­est neigh­bor, up or down, with the larger stan­dard error), and oc­ca­sion­ally pick­ing at ran­dom, so it usu­ally fo­cuses on the most un­cer­tain ones but will oc­ca­sion­ally ask about other media as well. (Some­what anal­o­gous to fixed-epsilon ex­plo­ration in RL.)

After all the ques­tions are asked, a final es­ti­ma­tion is done, the media are sorted by rank, and mapped back onto the user-specified rat­ing scale.

Why Not Bayes?

Bayesian Improvements

There are two prob­lems with this ap­proach:

  1. it does not come with any prin­ci­pled in­di­ca­tion of un­cer­tainty,

  2. and as a con­se­quence, it may not be ask­ing the op­ti­mal ques­tions.

The first is the biggest prob­lem be­cause we have no way of know­ing when to stop. Per­haps after 10 ques­tions, the real un­cer­tainty is still high and our final rat­ings will still be mis­sorted; or per­haps di­min­ish­ing re­turns had set in and it would have taken so many more ques­tions to stamp out the error that we would pre­fer just to cor­rect the few er­ro­neous rat­ings man­u­ally in an once-over. And we don’t want to crank up the question-count to some­thing bur­den­some like 200 just on the off-chance that the sort­ing is in­suf­fi­cient. In­stead, we’d like some com­pre­hen­si­ble prob­a­bil­ity that each media is as­signed to its cor­rect bucket, and per­haps a bound on over­all error: for ex­am­ple, I would be sat­is­fied if I could spec­ify some­thing like ‘90% prob­a­bil­ity that all media are at least in their cor­rect bucket’.⁠⁠1⁠

Since BF-2 is fun­da­men­tally a fre­quen­tist li­brary, it will never give us this kind of an­swer; all it has to offer in this vein are p-values—which are an­swers to ques­tions I’ve never asked—and stan­dard er­rors, which are sort of in­di­ca­tors of un­cer­tainty & bet­ter than noth­ing but still im­per­fect. Be­tween our in­ter­est in a se­quen­tial trial ap­proach and our in­ter­est in pro­duc­ing mean­ing­ful prob­a­bil­i­ties of er­rors (and my own pref­er­ence for Bayesian ap­proaches in gen­eral), this mo­ti­vates look­ing at Bayesian im­ple­men­ta­tions.

The Bradley-Terry model can be fit eas­ily in JAGS/Stan; two less and more elab­o­rate ex­am­ples are in ⁠Jim Al­bert’s im­ple­men­ta­tion & Shawn E. Hal­li­nan, and btstan re­spec­tively. Al­though it doesn’t im­ple­ment the ex­tended B-T with ties, Al­bert’s ex­am­ple is eas­ily adapted, and we could try to infer rank­ings like thus:

comparisons2 <- comparisons[!(comparisons$win1==0.5),]
 
teamH = as.numeric(comparisons2$Media.1)
teamA = as.numeric(comparisons2$Media.2)
y = comparisons2$win1
n = comparisons2$win1 + comparisons2$win2
N = length(y)
J = length(levels(comparisons$Media.1))
data = list(y = y, n = n, N = N, J = J, teamA = teamA, teamH = teamH)
data ## reusing the baseball data for this example:
# $y
#  [1] 4 4 4 6 4 6 3 4 4 6 6 4 2 4 2 4 4 6 3 5 2 4 4 6 5 2 3 4 5 6 2 3 3 4 4 2 2 1 1 2 1 3
#
# $n
#  [1] 7 6 7 7 6 6 6 6 7 6 7 7 7 7 6 7 6 6 6 6 7 7 6 7 6 7 6 6 7 6 7 6 7 7 6 6 7 6 7 6 7 7
#
# $N
# [1] 42
#
# $J
# [1] 7
#
# $teamA
#  [1] 4 7 6 2 3 1 5 7 6 2 3 1 5 4 6 2 3 1 5 4 7 2 3 1 5 4 7 6 3 1 5 4 7 6 2 1 5 4 7 6 2 3
#
# $teamH
#  [1] 5 5 5 5 5 5 4 4 4 4 4 4 7 7 7 7 7 7 6 6 6 6 6 6 2 2 2 2 2 2 3 3 3 3 3 3 1 1 1 1 1 1
 
model1 <- "model {
    for (i in 1:N){
        logit(p[i]) <- a[teamH[i]] - a[teamA[i]]
        y[i] ~ dbin (p[i], n[i])
    }
    for (j in 1:J){
        a[j] ~ dnorm(0, tau)
    }
    tau <- pow(sigma, -2)
    sigma ~ dunif (0, 100)
}"
library(runjags)
j1 <- autorun.jags(model=model1, monitor=c("a", "y"), data=data); j1

(This reuses the converting-priors-to-data trick from be­fore.) This yields sen­si­ble rat­ings but MCMC un­avoid­ably have over­head com­pared to a quick it­er­a­tive maximum-likelihood al­go­rithm which only es­ti­mates the pa­ra­me­ters—it takes about 1s to run on the ex­am­ple data. Much of that over­head is from JAGS’s setup & in­ter­pret­ing the model, and could prob­a­bly be halved maybe by using Stan in­stead (since it caches com­piled mod­els) but re­gard­less, 0.5s is too long for par­tic­u­larly pleas­ant in­ter­ac­tive use (total time from re­sponse to next ques­tion should be <0.1s for best UX) but it’s worth pay­ing if we can get much bet­ter ques­tions? There might be speedups avail­able; this is about es­ti­mat­ing la­tent nor­mal/Gauss­ian vari­ables, which fre­quently has fast im­ple­men­ta­tions. For ex­am­ple, it would not sur­prise me if it turned out that the Bayesian in­fer­ence could be done out­side of MCMC through an an­a­lytic so­lu­tion, or more likely, a Lapla­cian ap­prox­i­ma­tion, such as im­ple­mented in INLA which sup­ports la­tent Gaus­sians. More re­cently, Stan sup­ports gra­di­ent de­scent optimization-based Bayesian in­fer­ence (vari­a­tional in­fer­ence), which de­liver ac­cu­rate enough pos­te­ri­ors while being fast enough for in­ter­ac­tive use.

Optimal Exploration

So mov­ing on, with a MCMC im­ple­men­ta­tion, we can look at how to op­ti­mally sam­ple data. In this case, since we’re col­lect­ing data and the user can stop any­time they feel like, we aren’t in­ter­est­ing in loss func­tions so much as max­i­miz­ing in­for­ma­tion gain—fig­ur­ing out which pair of media yields the high­est “ex­pected in­for­ma­tion gain”.

One pos­si­ble ap­proach would be to re­peat our heuris­tic: es­ti­mate the en­tropy of each media, pick the one with the high­est en­tropy/widest pos­te­rior dis­tri­b­u­tion, and pick a ran­dom com­par­i­son. A bet­ter one might be to vary this: pick the two media who most over­lap (more straight­for­ward with pos­te­rior sam­ples of the la­tent vari­ables than with just con­fi­dence in­ter­vals, since con­fi­dence in­ter­vals, after all, do not mean “the true vari­able is in this range with 95% prob­a­bil­ity”). That seems much bet­ter but still prob­a­bly not op­ti­mal, as it ig­nores any down­stream ef­fects—we might re­duce un­cer­tainty more by sam­pling in the mid­dle of a big clus­ter of movies rather than sam­pling two out­liers near each other.

For­mal EI is not too well doc­u­mented at an im­ple­men­ta­tion level, but the most gen­eral form of the al­go­rithm seems to go: for each pos­si­ble ac­tion (in this case, each pos­si­ble pair of media), draw a sam­ple from the pos­te­rior as a hy­po­thet­i­cal ac­tion (from the MCMC ob­ject, draw a score es­ti­mate for both media and com­pare which is higher), rerun & up­date the analy­sis (add the re­sult of the com­par­i­son to the dataset & run the MCMC again on it to yield a new MCMC ob­ject), and on these two old and new MCMC ob­jects, cal­cu­late the en­tropy of the dis­tri­b­u­tions (in this case, the sum of all the logs of prob­a­bil­ity of the higher in each pos­si­ble pair of media being higher?), and sub­tract new from old; do that ~10 times for each so the es­ti­mated change in en­tropy is rea­son­ably sta­ble; then find the pos­si­ble ac­tion with the largest change in en­tropy, and ex­e­cute that.

Computational Requirement

With­out im­ple­ment­ing it, this is in­fea­si­ble. From a com­pu­ta­tional per­spec­tive: there are (2media) pos­si­ble ac­tions; each MCMC run takes ~1s; there must be 10 MCMC runs; and the en­tropy cal­cu­la­tions re­quire a bit of time. With just 20 media, that’d re­quire ((220) × 10 × 1) / 60 = (190 × 10) / 60 = 31 min­utes.

We could run these in par­al­lel⁠⁠2⁠, could try to cut down the MCMC it­er­a­tions even more and risk the oc­ca­sional non-convergence, could try cal­cu­lat­ing EI for only a lim­ited num­ber of choices (per­haps 20 ran­domly cho­sen op­tions), or look for some closed-form an­a­lytic so­lu­tion (per­haps as­sum­ing nor­mal dis­tri­b­u­tions?)—but it’s hard to see how to get total run­time down to 0.5s—much less ~0.1s. And on the pack­ag­ing side of things, re­quir­ing users to have JAGS (much less Stan) in­stalled at all makes resorter harder to in­stall and much more likely that there will be opaque run­time er­rors.

If ask­ing the user to com­pare two things was rare and ex­pen­sive data, if we ab­solutely had to max­i­mize the in­for­ma­tional value of every sin­gle ques­tion, if we were doing hard­core sci­ence like as­tro­nom­i­cal ob­ser­va­tions where every minute of tele­scope time might be mea­sured in thou­sands of dol­lars, then a run­time of 1 hour to op­ti­mize ques­tion choice would be fine. But we’re not. If the user needs to be asked 5 ad­di­tional ques­tions be­cause the standard-error ex­plo­ration heuris­tic is sub­op­ti­mal, that costs them a cou­ple sec­onds. Not a big deal. And if we’re not ex­ploit­ing the ad­di­tional power of Bayesian meth­ods, why bother switch­ing from BF-2 to JAGS?

See Also


  1.  

    I think this could be done by sam­pling the pos­te­rior: draw a ran­dom sam­ple of the es­ti­mated score from the pos­te­rior for each media and dis­cretize them all down; do this, say, 100 times; com­pare the 100 dis­cretized and see how often any media changes cat­e­gories. If every media is in the same bucket, say, in 95 of those sam­ples—The God­fa­ther is in the 5-star bucket in 95 of the 100 sam­ples and in the 4-star bucket in the other 5—then the re­main­ing un­cer­tainty in the es­ti­mated scores must be small.

  2.  

    We could also spec­u­la­tively ex­e­cute while wait­ing for the user to make the cur­rent rat­ing, as there are only 3 pos­si­ble choices (bet­ter/worse/tie) and we have to com­pute 1 of them even­tu­ally any­way; if we com­pute the op­ti­mal next choice for all 3 hy­po­thet­i­cal choices so we al­ready know what next to ask, then as far as the user is con­cerned, it may ap­pear in­stan­ta­neous with zero la­tency.

Similar Links