Resorting Media Ratings
Commandline tool providing interactive statistical pairwise ranking and sorting of items
User-
created datasets using ordinal scales (such as media ratings) have tendencies to drift or ‘clump’ towards the extremes and fail to be informative as possible, falling prey to ceiling effects and making it difficult to distinguish between the mediocre & excellent. This can be counteracted by rerating the dataset to create a uniform (and hence, informative) distribution of ratings—but such manual rerating is difficult.
I provide an anytime CLI program,
resorter, written in R (should be cross-platform but only tested on Linux) which keeps track of comparisons, infers underlying ratings assuming that they are noisy in the ELO- like Bradley- Terry model , and interactively & intelligently queries the user with comparisons of the media with the most uncertain current ratings, until the user ends the session and a fully rescaled set of ratings are output.
Use
Available features:
$ ./resorter --help
# usage: resorter [--help] [--verbose] [--no-scale] [--no-progress] [--opts OPTS] [--input INPUT] [--output OUTPUT] [--queries QUERIES]
# [--levels LEVELS] [--quantiles QUANTILES]
#
# sort a list using comparative rankings under the Bradley-Terry statistical model; see https://gwern.net/resorter
#
#
# flags:
# -h, --help show this help message and exit
# -v, --verbose whether to print out intermediate statistics
# --no-scale Do not discretize/bucket the final estimated latent ratings into 1-l levels/ratings; print out inferred latent scores.
# --no-progress Do not print out mean standard error of items
#
# optional arguments:
# -x, --opts OPTS RDS file containing argument values
# -i, --input INPUT input file: a CSV file of items to sort: one per line, with up to two columns. (eg. both 'Akira\n' and 'Akira, 10\n' are valid).
# -o, --output OUTPUT output file: a file to write the final results to. Default: printing to stdout.
# -n, --queries QUERIES Maximum number of questions to ask the user; if already rated, 𝒪(items) is a good max, but the more items and more levels in
# the scale, the more comparisons are needed. [default: 2147483647]
# -l, --levels LEVELS The highest level; rated items will be discretized into 1-l levels, so l=5 means items are bucketed into 5 levels: [1,2,3,4,5],
# etc [default: 5]
# -q, --quantiles QUANTILES What fraction to allocate to each level; space-separated; overrides `--levels`. This allows making one level of ratings
# narrower (and more precise) than the others, at their expense; for example, one could make 3-star ratings rarer with quantiles
# like `--quantiles '0 0.25 0.8 1'`. Default: uniform distribution (1--5 → '0.0 0.2 0.4 0.6 0.8 1.0').Here is an example demonstrating the use of resorter to have the user make 10 comparisons, rescale the scores into 1–3 ratings (where 3 is rare), and write the new rankings to a file:
Source Code
This R script requires the BradleyTerry2 & argparser libraries to be available or installable. (Andrew Quinn’s Github version may be easier to install.)
Dependency Compile Errors: install packages
As with any user-
Look for a missing C library to install system-apt-get install r-cran-bradleyterry2 or apt-get install r-cran-lme4 to bypass any compile errors.
Save it as a script somewhere in your $PATH named resorter, and chmod +x resorter to make it executable.
#!/usr/bin/Rscript
# attempt to load a library implementing the Bradley-Terry model for inferring rankings based on
# comparisons; if it doesn't load, try to install it through R's in-language package management;
# otherwise, abort and warn the user
# https://www.jstatsoft.org/index.php/jss/article/download/v048i09/601
loaded <- library(BradleyTerry2, quietly=TRUE, logical.return=TRUE)
if (!loaded) { write("warning: R library 'BradleyTerry2' unavailable; attempting to install locally...", stderr())
install.packages("BradleyTerry2")
loadedAfterInstall <- library(BradleyTerry2, quietly=TRUE, logical.return=TRUE)
if(!loadedAfterInstall) { write("error: 'BradleyTerry2' unavailable and cannot be installed. Aborting.", stderr()); quit() }
}
# similarly, but for the library to parse command line arguments:
loaded <- library(argparser, quietly=TRUE, logical.return=TRUE)
if (!loaded) { write("warning: R library 'argparser' unavailable; attempting to install locally...", stderr())
install.packages("argparser")
loadedAfterInstall <- library(argparser, quietly=TRUE, logical.return=TRUE)
if(!loadedAfterInstall) { write("error: 'argparser' unavailable and cannot be installed. Aborting.", stderr()); quit() }
}
p <- arg_parser("sort a list using comparative rankings under the Bradley-Terry statistical model; see https://gwern.net/resorter", name="resorter")
p <- add_argument(p, "--input", short="-i",
"input file: a CSV file of items to sort: one per line, with up to two columns. (eg. both 'Akira\\n' and 'Akira, 10\\n' are valid).", type="character")
p <- add_argument(p, "--output", "output file: a file to write the final results to. Default: printing to stdout.")
p <- add_argument(p, "--verbose", "whether to print out intermediate statistics", flag=TRUE)
p <- add_argument(p, "--queries", short="-n", default=NA,
"Maximum number of questions to ask the user; defaults to N*log(N) comparisons. If already rated, 𝒪(n) is a good max, but the more items and more levels in the scale and more accuracy desired, the more comparisons are needed.")
p <- add_argument(p, "--levels", short="-l", default=5, "The highest level; rated items will be discretized into 1–l levels, so l=5 means items are bucketed into 5 levels: [1,2,3,4,5], etc. Maps onto quantiles; valid values: 2–100.")
p <- add_argument(p, "--quantiles", short="-q", "What fraction to allocate to each level; space-separated; overrides `--levels`. This allows making one level of ratings narrower (and more precise) than the others, at their expense; for example, one could make 3-star ratings rarer with quantiles like `--quantiles '0 0.25 0.8 1'`. Default: uniform distribution (1--5 → '0.0 0.2 0.4 0.6 0.8 1.0').")
p <- add_argument(p, "--no-scale", flag=TRUE, "Do not discretize/bucket the final estimated latent ratings into 1-l levels/ratings; print out inferred latent scores.")
p <- add_argument(p, "--progress", flag=TRUE, "Print out mean standard error of items")
argv <- parse_args(p)
# read in the data from either the specified file or stdin:
if(!is.na(argv$input)) { ranking <- read.csv(file=argv$input, stringsAsFactors=TRUE, header=FALSE); } else {
ranking <- read.csv(file=file('stdin'), stringsAsFactors=TRUE, header=FALSE); }
# turns out noisy sorting is fairly doable in 𝒪(n * log(n)), so do that plus 1 to round up:
if (is.na(argv$queries)) { n <- nrow(ranking)
argv$queries <- round(n * log(n) + 1) }
# if user did not specify a second column of initial ratings, then put in a default of '1':
if(ncol(ranking)==1) { ranking$Rating <- 1; }
colnames(ranking) <- c("Media", "Rating")
# A set of ratings like 'foo,1\nbar,2' is not comparisons, though. We *could* throw out everything except the 'Media' column
# but we would like to accelerate the interactive querying process by exploiting the valuable data the user has given us.
# So we 'seed' the comparison dataset based on input data: higher rating means +1, lower means −1, same rating == tie (0.5 to both)
comparisons <- NULL
for (i in 1:(nrow(ranking)-1)) {
rating1 <- as.numeric(ranking[i,]$Rating) ## HACK: crashes on reading its own output due to "?>? not meaningful for factors" if we do not coerce from factor to number?
media1 <- ranking[i,]$Media
rating2 <- as.numeric(ranking[i+1,]$Rating)
media2 <- ranking[i+1,]$Media
if (rating1 == rating2)
{
comparisons <- rbind(comparisons, data.frame("Media.1"=media1, "Media.2"=media2, "win1"=0.5, "win2"=0.5))
} else { if (rating1 > rating2)
{
comparisons <- rbind(comparisons, data.frame("Media.1"=media1, "Media.2"=media2, "win1"=1, "win2"=0))
} else {
comparisons <- rbind(comparisons, data.frame("Media.1"=media1, "Media.2"=media2, "win1"=0, "win2"=1))
} } }
# the use of '0.5' is recommended by the BT2 paper, despite causing quasi-spurious warnings:
# > In several of the data examples (eg. `?CEMS`, `?springall`, `?sound.fields`), ties are handled by the crude but
# > simple device of adding half of a 'win' to the tally for each player involved; in each of the examples where this
# > has been done it is found that the result is similar, after a simple re-scaling, to the more sophisticated
# > analyses that have appeared in the literature. Note that this device when used with `BTm` typically gives rise to
# > warnings produced by the back-end glm function, about non-integer 'binomial' counts; such warnings are of no
# > consequence and can be safely ignored. It is likely that a future version of `BradleyTerry2` will have a more
# > general method for handling ties.
suppressWarnings(priorRankings <- BTm(cbind(win1, win2), Media.1, Media.2, data = comparisons))
if(argv$verbose) {
print("higher=better:")
print(summary(priorRankings))
print(sort(BTabilities(priorRankings)[,1]))
}
set.seed(2015-09-10)
cat("Comparison commands: 1=yes, 2=tied, 3=second is better, p=print estimates, s=skip, q=quit\n")
for (i in 1:argv$queries) {
# with the current data, calculate and extract the new estimates:
suppressWarnings(updatedRankings <- BTm(cbind(win1, win2), Media.1, Media.2, br=TRUE, data = comparisons))
coefficients <- BTabilities(updatedRankings)
# sort by latent variable 'ability':
coefficients <- coefficients[order(coefficients[,1]),]
if(argv$verbose) { print(i); print(coefficients); }
# select two media to compare: pick the media with the highest standard error and the media above or below it with the highest standard error:
# which is a heuristic for the most informative pairwise comparison. BT2 appears to get caught in some sort of a fixed point with greedy selection,
# so every few rounds pick a random starting point:
media1N <- if (i %% 3 == 0) { which.max(coefficients[,2]) } else { sample.int(nrow(coefficients), 1) }
media2N <- if (media1N == nrow(coefficients)) { nrow(coefficients)-1; } else { # if at the top & 1st place, must compare to 2nd place
if (media1N == 1) { 2; } else { # if at the bottom/last place, must compare to 2nd-to-last
# if neither at bottom nor top, then there are two choices, above & below, and we want the one with highest SE; if equal, arbitrarily choose the better:
if ( (coefficients[,2][media1N+1]) > (coefficients[,2][media1N-1]) ) { media1N+1 } else { media1N-1 } } }
targets <- row.names(coefficients)
media1 <- targets[media1N]
media2 <- targets[media2N]
if (argv$`progress`) { cat(paste0("Mean stderr: ", round(mean(coefficients[,2]))), " | "); }
cat(paste0("Is '", as.character(media1), "' greater than '", as.character(media2), "'? "))
rating <- scan("stdin", character(), n=1, quiet=TRUE)
switch(rating,
"1" = { comparisons <- rbind(comparisons, data.frame("Media.1"=media1, "Media.2"=media2, "win1"=1, "win2"=0)) },
"3" = { comparisons <- rbind(comparisons, data.frame("Media.1"=media1, "Media.2"=media2, "win1"=0, "win2"=1)) },
"2" = { comparisons <- rbind(comparisons, data.frame("Media.1"=media1, "Media.2"=media2, "win1"=0.5, "win2"=0.5))},
"p" = { estimates <- data.frame(Media=row.names(coefficients), Estimate=coefficients[,1], SE=coefficients[,2]);
print(comparisons)
print(warnings())
print(summary(updatedRankings))
print(estimates[order(estimates$Estimate),], row.names=FALSE) },
"s" = {},
"q" = { break; }
)
}
# results of all the questioning:
if(argv$verbose) { print(comparisons); }
suppressWarnings(updatedRankings <- BTm(cbind(win1, win2), Media.1, Media.2, ~ Media, id = "Media", data = comparisons))
coefficients <- BTabilities(updatedRankings)
if(argv$verbose) { print(rownames(coefficients)[which.max(coefficients[2,])]);
print(summary(updatedRankings))
print(sort(coefficients[,1])) }
ranking2 <- as.data.frame(BTabilities(updatedRankings))
ranking2$Media <- rownames(ranking2)
rownames(ranking2) <- NULL
if(!(argv$`no_scale`)) {
# if the user specified a bunch of buckets using `--quantiles`, parse it and use it,
# otherwise, take `--levels` and make a uniform distribution
levels <- max(2,min(100,argv$levels+1)) ## needs to be 2–100
quantiles <- if (!is.na(argv$quantiles)) { (sapply(strsplit(argv$quantiles, " "), as.numeric))[,1]; } else {
seq(0, 1, length.out=levels); }
ranking2$Quantile <- with(ranking2, cut(ability,
breaks=quantile(ability, probs=quantiles),
labels=1:(length(quantiles)-1),
include.lowest=TRUE))
df <- subset(ranking2[order(ranking2$Quantile, decreasing=TRUE),], select=c("Media", "Quantile"));
if (!is.na(argv$output)) { write.csv(df, file=argv$output, row.names=FALSE) } else { print(df); }
} else { # return just the latent continuous scores:
df <- data.frame(Media=rownames(coefficients), Estimate=coefficients[,1]);
if (!is.na(argv$output)) { write.csv(df[order(df$Estimate, decreasing=TRUE),], file=argv$output, row.names=FALSE); } else {
print(finalReport); }
}
cat("\nResorting complete")Background
Rating Inflation
In rating hundreds of media on a review site like GoodReads, Amazon, MyAnimeList, Uber/
This is unfortunate if you want to provide ratings & reviews to other people and indicate your true preferences; when I like on MALgraph and see that over half my anime ratings are in the range 8–10, then, my ratings having degenerated into a roughly 1–3 scale (crap/
What Distribution Should Ratings Be?
If the J-
A normal distribution has a certain esthetic appeal, and common, but that’s no justification. First, there’s not much reason to think that media ought to be normally-
A uniform distribution has similar problem. Its advantage, mentioned above, is that it contains more information: a normal distribution clumps ratings towards the middle with most falling in 5, with only a few reaching 1 or 10, while a uniform distribution maximizes the information—the worst 10% get rated 1, and so on to the best 10% getting rated 10. If you learn something is a 5 from a uniform distribution, you learn a lot more (it’s in the 50–60% range) than if it were a normal distribution (it’s 50±??%).
But, I think I would rather look at a critic’s ratings if they used a normal distribution instead of a uniform distribution. Why? Because, as I said, we usually try to watch good stuff, and if I see something rated a 10, I know it’s in the top percentile or two, while the uniform would be 5–10× less selective and a 10 there only tells me that it’s somewhere in the top decile, which is not too selective. If a critic thought something was in the top percentile, I will pay attention; but top decile is not too useful in helping me find things I may want.
So, I think our target distribution ought to maximize usefulness: it is just a summary of an unknowable underlying distribution, which we summarize pragmatically as ratings, and so statistics like ratings can only be right or wrong as defined by their intended use. On a rating website, we are not interested in making fine distinctions among mediocre or trash. We are looking for interesting new candidates to consider, we are looking for the best. A skew maximizes the provided information to the reader in the region of interest of likely recommendations. So our distribution ought to throw most of our ratings into an uninformative ‘meh’ bucket, and spend more time on the right-
What we want is a reverse J-
In terms of percentiles, with my ~458 anime watched/
0–50% (n = 231)
51–75% (n = 112)
76%–89% (n = 62)
90–93% (n = 16)
94–95% (n = 11)
96% (n = 7)
97% (n = 6)
98% (n = 6)
99% (n = 5)
>99.5% (n = 2)
That would correspond to using the option --quantiles '0 0.5 0.75 0.89 0.93 0.95 0.96 0.97 0.98 0.99 0.995'.
Rescaling
Interactive Ranking Through Sorting
For only a few ratings like 10 or 20, it’s easy to manually review and rescale ratings and resolve the ambiguities (which ‘8’ is worse than the other ’8’s and should be knocked down to 7?), but past that, it starts to become tedious and because judgment is so fragile and subjective and ‘choice fatigue’ begins to set in, I find myself having to repeatedly scan the list and ask myself “is X really better than Y…? hm…”. So unsurprisingly, for a large corpus like my 408 anime or 2059 books, I’ve never bothered to try to do this—much less do it occasionally as the ratings drift.
If I had some sort of program which could query me repeatedly for new ratings, store the results, and then spit out a consolidated list of exactly how to change ratings, then I might be able to, once in a great while, correct my ratings. This would help us re-
But it can’t ask me for absolute ratings, because if I was able to easily give uniformly-
How to get rankings: if there are 1,000 media, it’s impossible for me to explicitly rank a book ‘#952’, or ‘#501’. Nobody has that firm a grip. Perhaps it would be better for it to ask me to compare pairs of media? Comparison is much more natural, less fatiguing, and helps my judgment by reminding me of what other media there are and how I think of them—when a terrible movie gets mentioned in the same breath as a great movie, it reminds you why one was terrible and the other great. Comparison also immediately suggests an implementation as the classic comparison sort algorithms like Quicksort or Mergesort, where the comparison argument is an IO function which simply calls out to the user; yielding a reasonable 𝒪(n · log(n)) number of queries (and possibly much less, 𝒪(n), if we have the pre-
Noisy Sorting
The comparison-
That is, say, you have two sorted lists of 1,000 elements each and you compare the lowest element of one with the highest element of the other and the first is higher, then all 1,000 items in the first are higher than all 1,000 items in the second, that of the (21,000) = 499,500 possible pairwise comparisons, the sorting algorithms assume that not a single item is out of place, not a single pairwise comparison would be incorrect. This assumption is fine in programming, since the CPU is good at comparing bits for equality & may go trillions of operations without making a single mistake; but it’s absurd to expect this level of reliability of a human for any task, much less one in which we are clumsily feeling for the invisible movements of our souls in response to great artists.
Our comparisons of movies, or books, or music, are error-
No. It turns out that the “noisy sorting” setting is not that different from the comparison sort setting: the optimal asymptotic performance there too is 𝒪(n · log(n)). The penalty from the comparison error rate p just gets folded into the constants (since as p approaches 0, it gets easier and turns into regular comparison sort).
Some references:
1961, “Inconsistencies in a schedule of paired comparisons”
et al 1994, “Selection in the presence of noise: The design of playoff systems”
et al 1994, “Computing with noisy information”
1999, “Parameter estimation in large dynamic paired comparison experiments”
2002, “Searching games with errors—fifty years of coping with liars”
2007, “Active exploration for learning rankings from clickthrough data”
Kenyon-
2007, “How to rank with few errors” et al 2008, “Aggregating inconsistent information: ranking and clustering”
2008, “Noisy sorting without resampling” /
2009, “Sorting from noisy information”2011, “Beat the mean bandit”
et al 2011, “Bayesian Active Learning for Classification and Preference Learning”
et al 2012, “The K-
armed dueling bandits problem” Busa- et al 2014, “Preference-
based rank elicitation using statistical models: The case of Mallows” et al 2015, “Online Rank Elicitation for Plackett-
Luce: A Dueling Bandits Approach” 2015, “Just Sort It! A Simple and Effective Approach to Active Preference Learning”
et al 2017, “Deep reinforcement learning from human preferences”; et al 2017, “OptionGAN: Learning Joint Reward-
Policy Options using Generative Adversarial Inverse Reinforcement Learning” et al 2017, “Analogical-
based Bayesian Optimization” et al 2017, “Spectral Method and Regularized MLE Are Both Optimal for Top-K Ranking”
et al 2018, “Comparison Based Learning from Weak Oracles”
et al 2018, “Model-
based learning from preference data”
Implementation
So we’d like a command-
The natural way to see the problem is to treat every competitor as having an unobserved latent variable ‘quality’ or ‘ability’ or ‘skill’ on a cardinal scale which is measured with error by comparisons, and then weaken transitivity to chain our comparisons: if A beats B and B beats C, then probably A beats C, weighted by how many times we’ve observed beating and have much precise our estimate of A-C’s latent variables are.
Paired or comparison-
There are at least two packages in R for dealing with Bradley-BradleyTerry2 and prefmod. The latter supports more sophisticated analyses than the former, but it wants its data in an inconvenient format, while BradleyTerry2 is more easily incrementally updated. (I want to supply my data to the library as a long list of triplets of media/prefmod wants a column for each possible media, which is awkward to start with and gets worse if there are 1,000+ media to compare.)
We start with two vectors: the list of media, and the list of original ratings. The original ratings, while lumpy and in need of improving, are still useful information which should not be ignored; if BT-2 were a Bayesian library, we could use them as informative priors, but it’s not, so instead I adopt a hack in which the program runs through the list, comparing each anime with the next anime, and if equal, it’s treated as a tie, and otherwise they are recorded as a win/
Then a loop asks the user n comparisons; we want to ask about the media whose estimates are most uncertain, and a way of estimating that is looking at which media have the biggest standard error. You might think that asking about only the two media with the current biggest standard error would be the best exploration strategy (since this corresponds to reinforcement learning and multi-
After all the questions are asked, a final estimation is done, the media are sorted by rank, and mapped back onto the user-
Why Not Bayes?
Bayesian Improvements
There are two problems with this approach:
it does not come with any principled indication of uncertainty,
and as a consequence, it may not be asking the optimal questions.
The first is the biggest problem because we have no way of knowing when to stop. Perhaps after 10 questions, the real uncertainty is still high and our final ratings will still be missorted; or perhaps diminishing returns had set in and it would have taken so many more questions to stamp out the error that we would prefer just to correct the few erroneous ratings manually in an once-
Since BF-2 is fundamentally a frequentist library, it will never give us this kind of answer; all it has to offer in this vein are p-
The Bradley-btstan respectively. Although it doesn’t implement the extended B-T with ties, Albert’s example is easily adapted, and we could try to infer rankings like thus:
comparisons2 <- comparisons[!(comparisons$win1==0.5),]
teamH = as.numeric(comparisons2$Media.1)
teamA = as.numeric(comparisons2$Media.2)
y = comparisons2$win1
n = comparisons2$win1 + comparisons2$win2
N = length(y)
J = length(levels(comparisons$Media.1))
data = list(y = y, n = n, N = N, J = J, teamA = teamA, teamH = teamH)
data ## reusing the baseball data for this example:
# $y
# [1] 4 4 4 6 4 6 3 4 4 6 6 4 2 4 2 4 4 6 3 5 2 4 4 6 5 2 3 4 5 6 2 3 3 4 4 2 2 1 1 2 1 3
#
# $n
# [1] 7 6 7 7 6 6 6 6 7 6 7 7 7 7 6 7 6 6 6 6 7 7 6 7 6 7 6 6 7 6 7 6 7 7 6 6 7 6 7 6 7 7
#
# $N
# [1] 42
#
# $J
# [1] 7
#
# $teamA
# [1] 4 7 6 2 3 1 5 7 6 2 3 1 5 4 6 2 3 1 5 4 7 2 3 1 5 4 7 6 3 1 5 4 7 6 2 1 5 4 7 6 2 3
#
# $teamH
# [1] 5 5 5 5 5 5 4 4 4 4 4 4 7 7 7 7 7 7 6 6 6 6 6 6 2 2 2 2 2 2 3 3 3 3 3 3 1 1 1 1 1 1
model1 <- "model {
for (i in 1:N){
logit(p[i]) <- a[teamH[i]] - a[teamA[i]]
y[i] ~ dbin (p[i], n[i])
}
for (j in 1:J){
a[j] ~ dnorm(0, tau)
}
tau <- pow(sigma, -2)
sigma ~ dunif (0, 100)
}"
library(runjags)
j1 <- autorun.jags(model=model1, monitor=c("a", "y"), data=data); j1(This reuses the converting-
Optimal Exploration
So moving on, with a MCMC implementation, we can look at how to optimally sample data. In this case, since we’re collecting data and the user can stop anytime they feel like, we aren’t interesting in loss functions so much as maximizing information gain—figuring out which pair of media yields the highest “expected information gain”.
One possible approach would be to repeat our heuristic: estimate the entropy of each media, pick the one with the highest entropy/
Formal EI is not too well documented at an implementation level, but the most general form of the algorithm seems to go: for each possible action (in this case, each possible pair of media), draw a sample from the posterior as a hypothetical action (from the MCMC object, draw a score estimate for both media and compare which is higher), rerun & update the analysis (add the result of the comparison to the dataset & run the MCMC again on it to yield a new MCMC object), and on these two old and new MCMC objects, calculate the entropy of the distributions (in this case, the sum of all the logs of probability of the higher in each possible pair of media being higher?), and subtract new from old; do that ~10 times for each so the estimated change in entropy is reasonably stable; then find the possible action with the largest change in entropy, and execute that.
Computational Requirement
Without implementing it, this is infeasible. From a computational perspective: there are (2media) possible actions; each MCMC run takes ~1s; there must be 10 MCMC runs; and the entropy calculations require a bit of time. With just 20 media, that’d require ((220) × 10 × 1) /
We could run these in parallel2, could try to cut down the MCMC iterations even more and risk the occasional non-resorter harder to install and much more likely that there will be opaque runtime errors.
If asking the user to compare two things was rare and expensive data, if we absolutely had to maximize the informational value of every single question, if we were doing hardcore science like astronomical observations where every minute of telescope time might be measured in thousands of dollars, then a runtime of 1 hour to optimize question choice would be fine. But we’re not. If the user needs to be asked 5 additional questions because the standard-
See Also
External Links
-
I think this could be done by sampling the posterior: draw a random sample of the estimated score from the posterior for each media and discretize them all down; do this, say, 100 times; compare the 100 discretized and see how often any media changes categories. If every media is in the same bucket, say, in 95 of those samples—The Godfather is in the 5-star bucket in 95 of the 100 samples and in the 4-star bucket in the other 5—then the remaining uncertainty in the estimated scores must be small.
-
We could also speculatively execute while waiting for the user to make the current rating, as there are only 3 possible choices (better/
worse/ tie) and we have to compute 1 of them eventually anyway; if we compute the optimal next choice for all 3 hypothetical choices so we already know what next to ask, then as far as the user is concerned, it may appear instantaneous with zero latency.
Backlinks
GPT-2 Preference Learning for Music Generation (full context):
eg. my simple interactive R tool for ranking, just re-
estimates the entire B-T model each interaction, rather than attempt any caching or incremental updating to a stored model, because it takes a fraction of a second to fit. A fully Bayesian model can be fit via MCMC in a few seconds, which is negligible in a DRL context. The Explore-
Exploit Dilemma in Media Consumption (full context):use the rating resorter to convert my MAL ratings into a more informative uniform distribution
About This Website (full context):
An additional useful bit of metadata would be distinction between things which are trivial and those which are about more important topics which might change your life. Using my interactive sorting tool Resorter, I’ve ranked pages in deciles from 0–10 on how important the topic is to myself, the intended reader, or the world. For example, topics like embryo selection for traits such as intelligence or evolutionary pressures towards autonomous AI are vastly more important, and be ranked 10, than some poems or a dream or someone’s small nootropics self-
experiment, which would be ranked 0–1. The Most ‘Abandoned’ Books on GoodReads (full context):
Flexibly modeling the covariates with splines. Anyway! With those caveats in mind, let’s fit a baroque model. The binomial & prior &
Titleare all as before.Authormust be treated likeTitle, as a random effects, because most authors are represented only once and are uninformative.Ratingcould be treated as just a regular additive variable, but ratings websites tend to have odd distributions of Likert scales, giving ‘J-shaped’ ratings, so I will instead exploit brms’s fancy spline feature to let ratings be modeled as some sort of smooth curve like a quadratic curve, or if no curves turn out to be justified by the data, it’ll fall back to a regular old line/additive variable. Similarly, for Year.percentile: we could treat it as a regular additive variable, but there’s no good reason to think the relationship between publication year & abandonment is linear, so we’ll let the spline figure it out. Finally, because theAuthoradds in 835 levels, and we have 2 new spline variables, this model will be substantially harder to fit to convergence, and will need to run for much longer; I ran it overnight on my AMD Threadripper.Research Ideas (full context):
If we wanted to score them all or just rank them, we would have to try something more complicated, like treat it as a sorting problem and use a lot of pairwise choice completions to infer a noisy sorting—requiring 𝒪(n · log(n)) completions in the best-
case scenario of little LLM noise, but possibly a lot more.5 Self-
Blinded Mineral Water Taste Test (full context):In rating very subtle difference in flavor, the usual method is binary forced-
choice comparisons, as they cannot be rated on their own (they just taste like water). So the measured data would be the result of a comparison, better/ worse or win/ loss. Fisher’s original “lady tasting tea” experiment used permutation tests, but he was only considering two cases & was testing the null hypothesis, while I have 8 waters where I am reasonably sure the null hypothesis of no difference in taste is indeed false and I am more interested in how large the differences are & which is best, so the various kinds of permutation or chi- squared tests in general do not work. The analogy to sporting competitions suggests that the paradigm here should be the Bradley- Terry model which is much like chess’s Elo rating system in that it models each competitor (water) as having a performance variable (flavor) on a latent scale, where the difference between one competitor’s rating and another translates into the probability it will win a comparison. (For more detailed discussion of the Bradley-Terry model, see references in Resorter.) To account for ties, the logistic distribution is expanded into an ordered logistic distribution with cutpoints to determine whether the outcome falls into 1 of 3 ranges (win/ tie/ loss).
Similar Links
Rank-
Smoothed Pairwise Learning In Perceptual Quality Assessment Pick-
a-Pic: An Open Dataset of User Preferences for Text- to-Image Generation GPT-2 Preference Learning for Music Generation § Bradley-
Terry Preference Learning GPT-2 Preference Learning for Music Generation § Optimization by Backprop, Not Blackbox
Statistical Notes § Dealing With All-
Or-Nothing Unreliability of Data Statistical Notes § Inferring Mean IQs From SMPY/
TIP Elite Samples A/
B testing long- form readability on Gwern.net § Covariate Impact On Power Rating the ratings: Assessing the psychometric quality of rating data