Darknet Market Archives (2013–2015)
Mirrors of ~89 Tor-
Dark Net Markets (DNM) are online markets typically hosted as Tor hidden services providing escrow services between buyers & sellers transacting in Bitcoin or other cryptocoins, usually for drugs or other illegal/
regulated goods; the most famous DNM was Silk Road 1, which pioneered the business model in 201114ya. From between 2013–2015, I scraped/
mirrored on a weekly or daily basis all existing English- language DNMs as part of my research into their usage, lifetimes/ characteristics , & legal riskiness; these scrapes covered vendor pages, feedback, images, etc. In addition, I made or obtained copies of as many other datasets & documents related to the DNMs as I could.This uniquely comprehensive collection is now publicly released as a 50GB (~1.6TB uncompressed) collection covering 89 DNMs & 37+ related forums, representing <4,438 mirrors, and is available for any research.
This page documents the download, contents, interpretation, and technical methods behind the scrapes.
Dark net markets have thrived since June 201114ya when Adrian Chen published his famous Gawker article proving that Silk Road 1 was, contrary to my assumption when it was announced in January/
This idyllic period ended with the raid on SR1 in October 201312ya, which ushered in a new age of chaos in which centralized markets battled for dominance, the would-
And so, starting with the SR1 forums, which had not been taken down by the raid (to help the mole? I wondered at the time), I began scraping all the new markets, doing so weekly and sometimes daily starting in December 201312ya. These are the results.
Download
The full archive is available for download in multiple ways:
from the Internet Archive as a
.torrent1 (item page; full file listing).This is the primary method, but possibly not the most convenient. (If the download does not start, it may be a Torrent client problem related to Getright-
webseeding-support ; if the torrent does not work, all files can be downloaded normally over HTTP from the IA item page, but if possible, torrents are recommended for reducing the bandwidth burden & error-checking.) The ‘padding’ and
tormarket-elpresidente.tar.mp3/tormarket-elpresidente.tar.oggfiles appear to be spurious and/or for IA internal use- only, and can be safely deleted. via Gwern.net over HTTP; see the DNM Archives file directory
Research
Possible Uses
Here are some suggested uses:
providing information on vendors across markets like their PGP key and feedback ratings
identifying arrested and flipped sellers (eg. the Weaponsguy sting on Agora)
individual drug and category popularity
total sales per day, with consequent turnover and commission estimates; correlates with Bitcoin or DNM-
related search traffic, subreddit traffic, Bitcoin price or volume, etc seller lifetimes, ratings, over time and by product sold
losses to DNM exit scams, or seller exit scams
reactions to exogenous shocks like Operation Onymous
survival analysis, and predictors of exit-
scams (early finalization volume; site downtime; new vendors; etc) topic modeling of forums
compilations of forum posts on lab tests estimating purity and safety
compilations of forum-
posted Bitcoin addresses to examine the effectiveness of market tumblers stylometric analysis of posters, particular site staff (what is staff turnover like? do any markets ever change hands?)
deanonymization and information leaks (eg. GPS coordinates in metadata, usernames reused on the clearnet, valid emails in PGP public keys)
security practices: use of PGP, lifetime of individual keys, accidental posts of private rather than public keys, malformed or unusable public keys, etc
anthologies of real-
world photos of particular drugs compiled from all sellers of them simply browsing old listings, remembering the good times and bad times, the fallen and the free
Citing
Please cite this resource as:
Gwern Branwen, Nicolas Christin, David Décary-
Hétu, Rasmus Munksgaard Andersen, StExo, El Presidente, Anonymous, Daryl Lau, Sohhlz, Delyan Kratunov, Vince Cakic, Van Buskirk, Whom, Michael McKenna, Sigi Goode. “Dark Net Market archives, 2011–42015”, 2015-07-12. Web. [access date] /DNM-archives @misc{dnmArchives, author = {Gwern Branwen and Nicolas Christin and David Décary-Hétu and Rasmus Munksgaard Andersen and StExo and El Presidente and Anonymous and Daryl Lau and Sohhlz, Delyan Kratunov and Vince Cakic and Van Buskirk and Whom and Michael McKenna and Sigi Goode}, title = {Dark Net Market archives, 2011–2015}, howpublished= {\url{https://gwern.net/dnm-archive}}, url = {https://gwern.net/dnm-archive}, type = {dataset}, year = {2015}, month = {July}, timestamp = {2015-07-12}, note = {Accessed: DATE} }
Donations
A dataset like this owes its existence to many parties:
the DNMs could not exist without volunteers and nonprofits spending the money to pay for the bandwidth used by the Tor network; these scrapes collectively represent terabytes of consumed bandwidth. If you would like to donate towards keeping Tor servers running, you can donate to Torservers.net or the Tor Project itself
the Internet Archive hosts countless amazing resources, of which this is but one, and is an unique Internet resource; they accept many forms of donations
collating and creating these scrapes has absorbed an enormous amount of my time & energy due to the need to solve CAPTCHAs, launch crawls on a daily or weekly basis, debug subtle glitches, work around site defenses, periodically archive scrapes to make disk space available, provide hosting for some scrapes released publicly etc (my arbtt time-
logs suggest >200 hours since 201312ya); I thank my supporters for their patience during this long project.
Contents
There are ~89 markets, >37 forums and ~5 other sites, representing <4,438 mirrors of >43,596,420 files in ~49.4GB of 163 compressed files, unpacking to >1548477yaGB; the largest single archive decompresses to <250GB. (It can be burned to 3 25GB BDs or 2 50GB BDs; if the former, it may be worth generating additional FEC.)
These archives are xz-YYYY-MM-DD) crawl using wget, with the default directory/par2, QuickPar, Par Buddy, MultiPar or others depending on one’s OS.
If you don’t want to uncompress all of a particular archive, as they can be large, you can try extracting specific files using archiver-
tar --verbose --extract --xz --file='silkroad2-forums.tar.xz' --no-anchored --wildcards '*topic=49187*'Kaggle versions:
“Dark Net Marketplace Data (Agora 2014–2015): Includes over 100,000 unique listings of drugs, weapons and more” (mirror)
“Drug Listing Dataset: Drug listing dataset from several darknet marketplaces”, Mun Hou Won (CSV of 1776249ya /
Abraxas / Agora / Evolution / Nucleus / Outlaw Market / Silk Road 2 / The Marketplace; mirror) Other: “Exploration and Analysis of Darknet Markets”, Daniel Petterson
Overall Coverage
Most of the material dates 2013–2015; some archives sourced from other people (before I began crawling) may date 2011–201213ya.
Specifically:
Markets:
1776
Abraxas
Agape
Agora
Alpaca
AlphaBay
Amazon Dark
Anarchia
Andromeda
Area51
Armory2
Atlantis
BlackBank Market
Black Goblin
BlackMarket Reloaded
Black Services Market
Bloomsfield
Blue Sky Market
Breaking Bad
bungee54
BuyItNow
Cannabis Road 1
Cannabis Road 2
Cannabis Road 3
Cantina
Cloud9
Crypto Market /
Diabolus DarkBay
Darklist
Darknet Heroes
DBay
Deepzon
Doge Road
Dream Market
Drugslist
East India Company
Evolution
FreeBay
Freedom Marketplace
Free Market
GreyRoad
Havana/
Absolem Haven
Horizon
Hydra
Ironclad
Kiss
Middle Earth
Mr Nice guy 2
Nucleus
Onionshop
Outlaw Market
Oxygen
Panacea
Pandora
Pigeon
Pirate Market
Poseidon
Project Black Flag
Sheep
Silk Road 1
Silk Road 2
Silk Road Reloaded (I2P)
Silkstreet
Simply Bear
The BlackBox Market
The Majestic Garden
The Marketplace
The RealDeal
Tochka
TOM
Topix 2
TorBay
TorBazaar
TorEscrow
TorMarket
Tortuga 2
Underground Market
Utopia
Vault43
White Rabbit
Zanzibar Spice
Forums:
Abraxas
Agora
Andromeda
Black Market Reloaded
BlackBank Market
bungee54
Cannabis Road 2
Cannabis Road 3
DarkBay
Darknet heroes
Diabolus
Doge Road
Evolution
Gobotal
GreyRoad
Havana/
Absolem Hydra
Kingdom
Kiss
Mr Nice Guy 1
Nucleus
Outlaw Market
Panacea
Pandora
Pigeon
Project Black Flag
Revolver
Silk Road 1
Silk Road 2
TOM
The Cave
The Hub
The Majestic Garden
The RealDeal
TorEscrow
TorBazaar
Tortuga 1
Underground Market
Unitech
Utopia
Miscellaneous:
Assassination Market
Cryuserv
DNM-
related documents3 DNStats
Grams
Pedofunding
SR2doug’s leaks
Missing or incomplete
BMR
SR1
Blue Sky
TorMarket
Deepbay
Red Sun Marketplace
Sanitarium Market
EXXTACY
Mr Nice Guy 2
Interpreting & Analyzing
Scrapes can be difficult to analyze. They are large, complicated, redundant, and highly error-
No matter how much work one puts into it, one will never get an exact snapshot of a market at a particular instant: listings will go up or down as one crawls, vendors will be banned and their entire profile & listings & all feedback vanish instantly, Tor connection errors will cause a nontrivial % of page requests to fail, the site itself will go down (Agora especially), and Internet connections are imperfect. Scrapes can get bogged down in a backwater of irrelevant pages, spend all their time downloading a morass of on-
So any analysis must take seriously the incompleteness of each crawl and the fact that there is a lot and always will be a lot of missing data, and do things like focus on what can be inferred from ‘random’ sampling or explicitly model incompleteness by using markets’ category-
The contents cannot be taken at face-
Knowing this, analyses should have some strategy to deal with missingness. There are a couple tacks:
attempt to exploit “ground truths” to explicitly model and cope with varying degrees of missingness; there are a number of ground-
truths available in the form of leaked seller data (screenshots & data), databases (leaked, hacked), official statements (eg. the FBI’s quoted numbers about Silk Road 1’s total sales, number of accounts, number of transactions, etc) For one validation of this set of archives, see 2019’s “On the Resilience of the Dark Net Market Ecosystem to Law Enforcement Intervention”, which is able to compare the SR2 scrapes to data extracted from SR2 by UK law enforcement post-
seizure, and finds that any scrape is incomplete (as expected) but that scrapes in general appear to be incomplete in similar ways and usable for analysis. For another attempt at validating, see 2015’s “Measuring the Longitudinal Evolution of the Online Anonymous Marketplace Ecosystem”, which compares crawl-derived estimates to SR1 sales records produced at Ross Ulbricht’s trial (CSV/ discussion), sales figures in the Blake Benthall SR2 criminal complaint, and a Agora seller’s leaked vendor profile; in all cases, the estimates are reasonably close to the ground- truth. assume missing-
at-random and use analyses insensitive to that, focusing on things like ratios work with the data as is, writing results such that the biases and lower-
bounds are explicit & emphasized
Individual Archives
Some of the archives are unusual and need to be described in more detail.
Aldridge & Décary-Hetu SR1
The September SR1 crawl is processed data stored in SPSS .sav Data Files. There are various libraries available for reading this format (in R, using the foreign library like library(foreign); sellers <- read.spss("Sellers---2013-09-15.sav", to.data.frame=TRUE).)
Alpha2017 (McKenna & Goode)
A crawl of AlphaBay 2017-01-26–2017-01-28 and data extraction (using a Python script) provided by Michael McKenna & Sigi Goode. They also tried to crawl AB’s historical inactive listings in addition to the usual live/
Due to IA upload problems, currently hosted separately.
DNStats
DNStats is a service which periodically pings hidden services and records the response & latency, generating graphs of uptime and allowing users to see how long a market has been down and if an error is likely to be transient. The owner has provided me with three SQL exports of the ping database up to 2017-03-25; this database could be useful for comparing downtime across markets, examining the effect of DoS attacks, or regressing downtime against things like the Bitcoin exchange rate (presumably if the markets still drive more than a trivial amount of the Bitcoin economy, downtime of the largest markets or market deaths should predict falls in the exchange rate).
For example, to graph an average of site uptime per day and highlight as an exogenous event Operation Onymous, the R code would go like this:
dnmUptime <- read.delim("dnstats-20150712.sql", na.strings="NULL",
nrows=6000000, colClasses=c("factor", "factor", "factor", "integer",
"factor", "numeric", "numeric", "POSIXct"))
markets <- dnmUptime[dnmUptime$type==1,] # type 1 = markets
dnmUptime <- NULL # save RAM due to dataset size
markets$Date <- as.Date(markets$timestamp)
markets$Up <- markets$httpcode == 200
daily <- aggregate(Up ~ Date + sitename, markets, mean)
library(ggplot2)
qplot(Date, sitename, color=Up, data=daily) + geom_vline(xintercept=as.Date("2014-11-05"), color="red")The service is a useful one and accepts donations: 1DNstATs59JANuXjbpS5ngWHqvApAhYHBS.
Grams
Grams (http://) (subreddit) was a service primarily specializing in searching market listings; they can pull listings from API exports provided by markets (Evolution, Cloud9, Middle Earth, Bungee54, Outlaw), or they may use their own custom crawls (the rest). They have generously given me near-
first:
1776
Abraxas
ADM
Agora
Alpaca
AlphaBay
BlackBank
Bungee54
Cloud9
Evolution
Haven
Middle Earth
NK
Outlaw
Oxygen
Pandora
Silkkitie
Silk Road 2
TOM
TPM
second archive:
Abraxas
Agora
AlphaBay
Dream Market
Hansa
Middle Earth
Nucleus
Oasis
Oxygen
RealDeal
Silkkitie
Tochka
Valhalla
The Grams archive has three virtues:
while it doesn’t have any raw data, the CSVs are easy to work with. For example, to read in all the Grams SR2 crawls, then count & graph total listings by day in R:
DIR <- "blackmarket-mirrors/archive/grams" # Grams's SR2 crawls are named like "grams/2014-06-13/SilkRoad.csv" gramsFiles <- list.files(path=DIR, pattern="SilkRoad.csv", all.files=TRUE, full.names=TRUE, recursive=TRUE) # schema of SR2 crawls eg: ## "hash","market_name","item_link","vendor_name","price","name","description","image_link","add_time", \ ## "ship_from", ## "2-11922","Silk Road 2","http://silkroad6ownowfk.onion/items/220-fe-only-tw-x-mb","$220for28grams", \ ## "0.34349900", "220 FE Only TW X MB","1oz of the same tw x mb as my other listing FE only. Not shipped \ ## until finalized. Price is higher for non FE listing.","","1404258628","United States",... # most fields are self-explanatory; 'add_time' is presumably a Unix timestamp # read in each CSV, note what day it is from, and combine into a single data-frame: grams <- data.frame() for (i in 1:length(gramsFiles)) { log <- read.csv(gramsFiles[i], header=TRUE) log$Date <- as.Date(gsub("/SilkRoad.csv", "", gsub(paste0(DIR,"/"), "", gramsFiles[i]))) grams <- rbind(grams,log) } totalCounts <- aggregate(hash ~ Date, length, data=grams) summary(totalCounts) # Date hash # Min. :2014-06-09 Min. : 2846.00 # 1st Qu.:2014-07-05 1st Qu.: 9584.25 # Median :2014-08-26 Median :10527.50 # Mean :2014-08-21 Mean : 9651.44 # 3rd Qu.:2014-09-29 3rd Qu.:11165.00 # Max. :2014-11-07 Max. :19686.00 library(ggplot2) qplot(Date, hash, data=totalCounts) # https://i.imgur.com/ucPMvJQ.pngOther included datasets which are in structured formats that may be easier to deal with for prototyping: the Aldridge & Décary-
2013 SR1 crawl; the SR1 sales spreadsheet (original is a PDF but I’ve created an usable CSV of it); the BMR feedback dumps are in SQL, as is DNStats and et al 2013’s public data (but note the last is so heavily redacted & anonymized as to support few analyses); and Daryl Lau’s SR2 work may be in a structured format. the crawls were conducted independent of other crawls and they can be used to check each other
the market data sourced from the APIs can be considered close to 100% complete & accurate, which is rare
The main drawbacks are:
the largest markets can be split across multiple CSVs (eg.
EVO.csv&EVO2.csv), complicating reading the data in somewhatthe export each time is of the current listings, which means that different days can repeat the same identical crawl data if there was not a successful crawl by Grams in between
exports are not available for every day, and some gaps are large. The 2015-01-09 to 2015-02-21 gap is due to a broken Grams export during this period before I noticed the problem and requested it be fixed; other gaps may be due to transient errors with the cron job:
@daily ping -q -c 5 google.com && torify wget --quiet --continue "http://grams7enufi7jmdl.onion/gwernapi/$SECRETKEY" -O ~/blackmarket-mirrors/grams/`date '+\%Y-\%m-\%d'`.zipso if my Internet was down, or Grams was down, or the download was corrupted halfway through, then there would be nothing that day.
Kilos
The owner of Kilos, a DNM search engine much like Grams, released a CSV on 2020-01-13 of 235,668 review scraped from 6 DNMs (Apollon, CannaHome, Cannazon, Cryptonia, Empire, & Samsara):
The data is in the format
site,vendor,timestamp,score,value_btc,comment
Site,vendor, andcommentare strings.Siteandvendorare both alphanumeric, whilecommentmay have punctuation and whatnot. Line breaks are explicit “\n” in thecommentfield, and thecommentfield has quotation marks around it to make it easier to sort through. All the data uses Latin characters only, no unicode.timestampis an integer indicating the number of seconds since the Unix epoch. Score is 1 for positive review, 0 for neutral review, and −1 for negative review.value_btcis the bitcoin value of the product being reviewed, calculated at the time of the review.There are some slight problems with the data set as a result of the pain that is scraping these marketplaces. All reviews from Cryptonia market have their timestamp as 0 because I forgot to decode the dates listed and just used 0 as a placeholder. Cryptonia reviews’ score variable is unreliable, as I accidentally rewrote all scores to 0 on the production database. To correct for this, I rewrote the scores to match a sentiment analysis of the review text, but this is not a perfect solution, as some reviews are classified incorrectly. eg. “this shit is the bomb!” might be classified negatively despite context telling us that this is a positive review.
There are a decent number of duplicates, some of which are proper (eg. “Thanks” as a review appears many many times) and some of which are improper (detailed reviews being indexed multiple times by mistake).
Information Leaks
Diabolus/Crypto Market
Diabolus/
Simply Bear
Upon launch, the market Simply Bear made the amateur mistake of failing to disable the default Apache / page, which shows information about the server such as what HTML pages are being browsed and the connecting IPs. Being a Tor hidden service, most IPs were localhost connections from the daemon, but I noticed the administrator was logging in from a local IP (the 192.168.1.x range) and curious whether I could de-/ every minute or so, increasing the interval as time passed. After two or three days, no naked IPs had appeared yet and I killed the script.
TheRealDeal
TheRealDeal was reported on Reddit in late June 2015 to have an info leak where any logged-
Modafinil
As part of my interest in the stimulant modafinil, I have been monthly collecting by hand scrapes of all modafinil/
Abraxas
Agora
Alpaca
AlphaBay
Andromeda
Black Bank
Blue Sky
Cloud-
Nine Crypto/
Diabolus Diabolus
Dream
East India Company
Evolution
Haven
Hydra
Middle Earth
Nucleus
Outlaw
Oxygen
Pandora
Sheep
SR2
TOM
Pedofunding
A crowdfunding site for child pornography, “Pedofunding”, was launched in November 2014. It seemed like possibly the birth of a new DNM business model so I set up a logged-
Silk Road 1 (SR1)
Sources:
“Down the silk rabbit hole” (source), Delyan Kratunov
appendix to Van Buskirk et al
“Traveling the Silk Road: Datasets” (recompressed), supporting information for et al 2013, “Traveling the Silk Road: A measurement analysis of a large anonymous online marketplace”
2013 scrape provided me by anonymous
source data for “Not an ‘Ebay for Drugs’: The Cryptomarket ‘Silk Road’ as a Paradigm Shifting Criminal Innovation”, Aldridge & Décary-
2014
SR1F
Files: silkroad1-forums-20130703-anonymous.tar.xz, silkroad1-forums-20131103-gwernrasmusandersen.tar.xz, silkroad1-forums-anonymous.tar.xz, silkroad1-forums-stexo.tar.xz, silkroad1-forums.tar.xz.
These archives of the Silk Road 1 forums is composed of 3 parts, all created during October 201312ya after Silk Road 1 was shut down but before the Silk Road 1 forums went offline some months later:
StExo’s archive, released anonymously
This excludes the Vendor Roundtable (VRT) subforum, and is believed to have been censored in various respects such as removing many of StExo’s own posts.
Moustache’s archived pages
Unknown source, may be based on StExo archives
My & qwertyoruiop’s consolidated scrape
After the SR1 bust and StExo’s archiving, I began mirroring the SR1F with
wget, logged in as a vendor with access to the Vendor Roundtable; unfortunately due to my inexperience with the forum software Simple Machines, I did not know it was possible to revoke your own access to subforums withwget(due to SMF violating HTTP standards which requireGETrequests to be side-effect-free of things like ‘delete special permissions’) and failed to blacklist the revocation URL. Hence the VRT was incompletely archived. I combined my various archives into a single version.Simultaneously, qwertyoruiop was archiving the SR1F with a regular user account and a custom Node.js script. I combined his spider with my version to produce a final version with reasonable coverage of the forums (perhaps 3⁄4s of what was left after everyone began deleting & censoring their past posts).
David Décary-
SR2
Sources:
in January 201411ya, Sohhlz made & distributed a scrape of SR2 vendor pages akin to StExo’s SR1 vendor dump
“Analyzing Trends in the Silk Road 2.0” (source), Daryl Lau
SR2Doug
In 2015, a pseudonym claiming to be a SR2 programmer offered for sale, using the Darkleaks protocol, what he claimed was the username/
Copyright
The copyright status of crawls of websites, particularly ones engaged in illegal activities, is unclear.
to the extent I hold any copyright in the contents, I release my work under the Creative Commons CC0 “No Rights Reserved” license
the SR1 et al 2013 dataset is licensed under the CC BY-NC
other authors may reserve other rights
(I request that users respect the spirit of this archive and release their own source code & derived datasets to public, but I will not legally demand it.)
Previous Releases
Some of these archives have been released publicly before and are now obsoleted by this torrent:
How To Crawl Markets
The bulk of the crawls are my own work, and were generally all created in a similar way.
My setup was a Debian testing Linux system with Tor, Privoxy, and Polipo installed. For browsing, I used Iceweasel; useful FF extensions included LastPass, Flashblock & NoScript, Live HTTP Headers, Mozilla Archive Format, User Agent Switcher & switchproxytype, and RECAP. See the Tor guides.
when a new market opens, I learn of it typically from Reddit or The Hub, and browse to it in Firefox configured to proxy through
127.0.0.1:8123(Polipo)create a new account
The username/
password are not particularly important but using a password manager to create & store strong passwords for throwaway accounts has the advantage of making it easier to authenticate any hacks or database dumps later. (Given the poor security record of many markets, it should go without saying that you should not use your own username or any password which is used anywhere else.) I locate various ‘action’ URLs: login, logout, ‘report vendor’, ‘settings’, ‘place order’, ‘send message’, and add the URL prefixes (sometimes they need to be regexps) into
/; Privoxy, a filtering proxy running onetc/ privoxy/ user.action 127.0.0.1:8118, will then block any attempt to download URLs which match those prefixes/regexps A good blacklist is critical to avoid logging oneself out and immediately ending the crawl, but it’s also important to avoid triggering any on-
site actions which might cause your account to be banned or prompt the operators to put in anti- crawl measures you may have a hard time working around. A blacklist is also invaluable for avoiding downloading superfluous pages like the same category page sorted 15 different ways; Tor is high latency and you cannot afford to waste a request on redundant or meaningless pages, which there can be many of. Simple Machine Forums are particularly dangerous in this regard, requiring at least 39 URLs blacklisted to get an efficient crawl, and implementing many actions as simply HTTP links that a crawler will browse (for example, if you have managed to get access to a private subforum on a SMF, you will delete your access to it if you simply turn a crawler like wget or HTTrack loose, which I learned the hard way). where possible, configure the site to simplify crawling: request as many listings as possible on each page, hide clutter, disable any options which might get in the way, etc.
Forums often default to showing 20 posts on a page, but options might let you show 100; if you set it to display as much as possible (maximum number of posts per page, subforums listed, etc), the crawls will be faster, save disk space, and be more reliable because the crawl is less likely to suffer from downtime. So it is a good idea to go into the SMF forum settings and customize it for your account.
in Firefox, I export a
cookies.txtusing the FF extension Export Cookies. (I also recommend NoScript to avoid JavaScript shenanigans, Live HTTP Headers to assist in debugging by showing the HTTP headers and requests FF is actually sending to the market, and User Agent Switcher to lock your FF into showing a consistent TorBrowser user-agent )with a valid cookie in the
cookies.txtand a proper blacklist set up, mirrors can now be made with wget, using commands like thus:alias today="date '+%F'" # prints out current date like "2015-07-05" cat ~/blackmarket-mirrors/user-agent.txt ## Mozilla/5.0 (Windows NT 6.1; rv:31.0) Gecko/20100101 Firefox/30.0 cd ~/blackmarket-mirrors/cryptomarket/ grep -F --no-filename '.onion' ~/cookies.txt ~/`today`/cookies.txt > ./cookies.txt http_proxy="localhost:8118" wget --mirror --tries=5 --retry-connrefused --waitretry=1 --read-timeout=20 --timeout=15 --tries=10 --load-cookies=cookies.txt --keep-session-cookies --max-redirect=1 --referer="http://cryptomktgxdn2zd.onion" --user-agent="$(cat ~/blackmarket-mirrors/user-agent.txt)" --append-output=log.txt --server-response 'http://cryptomktgxdn2zd.onion/category.php?id=Weed' mv ./cryptomktgxdn2zd.onion/ `today` mv log.txt ./`today`/ rm cookies.txtTo unpack the commands:
the
grep -Finvocation minimizes the size of the local cookies.txt and helps prevent accidental release of a full cookies.txt while packing up archives and sharing them with other peoplewget:
we direct it to download only through Privoxy in order to benefit from the blacklist.
wget blacklist failure
wget has a blacklist option but it does not work, because it is implemented in a bizarre fashion where it downloads the blacklisted URL (!) and only then deletes it; this is a known >21-
year-old bug in wget . For other crawlers, this behavior should be double-checked so you don’t wind up inadvertently logging yourself out of a market and downloading gigabytes of worthless front pages. we throw in a number of options to encourage wget to ignore connection failures and retry; hidden servers are slow and unreliable
we load the cookies file with the authentication for the market, and in particular, we need
--keep-session-cookiesto keep around all cookies a market might give us, particularly the ones which change on each page load.--max-redirect=1helps deal with a nasty market behavior where when one’s cookie has expired, they then quietly redirect, without errors or warnings, all subsequent page requests to a login page. Of course, the login page should also be in the blacklist as well, but this is extra insurance and can save one round-trip’s worth of time, which will add up. (This isn’t always a cure, since a market may serve a requested page without any redirects or error codes but the content will be a transcluded login page; this apparently happened with some of my crawls such as Black Bank Market. There’s not much that can be done about this except some sort of post- download regexp check or a similar post- processing step.) some markets seem to snoop on the “referer” part of a HTTP request specifying where you come from; putting in the market page seems to help
the user-
agent, as mentioned, should exactly match however one logged in, as some markets record that and block accesses if the user- agent does not match exactly. Putting the current user- agent into a centralized text file helps avoid scripts getting out of date and specifying an old user- agent
logging of requests and particularly errors is important;
--server-responseprints out headers, and--append-outputstores them to a log file. Most crawlers do not keep an error log around, but this is necessary to allow investigation of incompleteness and observe where errors in a crawl started (perhaps you missed blacklisting a page); for example, “Evaluating drug trafficking on the Tor Network: Silk Road 2, the sequel”, 2015, failed to log errors in their few HTTrack crawls of SR2, and so wound up with a grossly incomplete crawl which led to nonsense conclusions like 1–2% of SR2’s sales were drugs. (I speculate the HTTrack crawl was stuck in the ebooks section, which was always clogged with spam, and then SR2 went down for an hour or two, leading to HTTrack’s default behavior of quickly erroring out and finishing the crawl; but the lack of logging means we may never know what went wrong.)
once the wget crawl is done, then we name it whatever day it terminated on, we store the log inside the mirror, and clean up the probably-
now-expired cookies, and perhaps check for any unusual problems.
This method will permit somewhere around 18 simultaneous crawls of different DNMs or forums before you begin to risk Privoxy throwing errors about “too many connections”. A Privoxy bug may also lead to huge logs being stored on each request. Between these two issues, I’ve found it helpful to have a daily cron job reading rm -rf / so as to keep the logfile mess under control and occasionally start a fresh Privoxy.
Crawls can be quickly checked by comparing the downloaded sizes to past downloads; markets typically do not grow or shrink more than 10% in a week, and forums’ downloaded size should monotonically increase. (Incidentally, that implies that it’s more important to archive markets than forums.) If the crawls are no longer working, one can check for problems:
is your user-
agent no longer in sync? does the crawl error out at a specific page?
do the headers shown by wget match the headers you see in a regular browser using Live HTTP Headers?
has the target URL been renamed?
do the URLs in the blacklist match the URLs of the site, or did you log in at the right URL? (for example, a blacklist of “www.abraxas…onion” is different from “abraxas…onion”; and if you logged in at an onion with
www.prefix, the cookie may be invalid on the prefix-free onion) did the server simply go down for a few hours while crawling? Then you can simply restart and merge the crawls.
has your account been banned? If the signup process is particularly easy, it may be simplest to just register a fresh account each time.
Despite all this, not all markets can be crawled or present other difficulties:
Blue Sky Market did something with HTTP headers which defeated all my attempts to crawl it; it rejected all my wget attempts at the first request, before anything even downloaded, but I was never able to figure out exactly how the wget HTTP headers differed in any respect from the (working) Firefox requests
Mr Nice Guy 2 breaks the HTTP standard by returning all pages gzip-
encoded, whether or not the client says it can accept gzip- encoded HTML; as it happens, wget cannot read gzip- encoded HTML and parse the page for additional URLs to download, and so mirroring breaks AlphaBay, during the DoS attacks of mid-2015, began doing something odd with its HTTP responses, which makes Polipo error out; one must browse AlphaBay after switching to Privoxy; Poseidon also did something similar for a time
Middle Earth rate-
limits crawls per session, limiting how much can be downloaded without investing a lot of time or in a CAPTCHA- breaking service Abraxas leads to peculiarly high RAM usage by wget, which can lead to the OOM killer ending the crawl prematurely
See also the comments on crawling in “Measuring the Longitudinal Evolution of the Online Anonymous Marketplace Ecosystem”, 2015, and et al 2020.
Crawler Wishlist
In retrospect, had I known I was going to be scraping so many sites for 3 years, I probably would have worked on writing a custom crawler. A custom crawler could have simplified the blacklist part and allowed some other desirable features (in descending order of importance):
CAPTCHA library: if CAPTCHAs could be solved automatically, then each crawl could be scheduled and run on its own.
The downside is that one would need to occasionally manually check in to make sure that none of the possible problems mentioned previously have happened, since one wouldn’t be getting the immediate of noticing a manual crawl finishing suspiciously quickly (eg. a big site like SR2 or Evolution or Agora should take a single-
threaded normal crawl at least a day and easily several days if images are downloaded as well; if a crawl finishes in a few hours, something went wrong). supporting parallel crawls using multiple accounts on a site
optimized tree traversal: ideally one would download all category pages on a market first, to maximize information gain from initial crawls & allow estimates of completeness, and then either randomly sample items or prioritize items which are new/
changed compared to previous crawls; this would be better than generic crawlers’ defaults of depth or breadth- first removing initial hops in connecting to the hidden service, speeding it up and reducing latency (does not seem to be a config option in Tor daemon but I’m told something like this is done in Tor2web)
post-
download checks: a market may not visibly error out but start returning login pages or warnings. If these could be detected, the custom crawler could log back in (particularly with CAPTCHA- solving) or at least alert the user to the problem so they can decide whether to log back in, create a new account, slow down crawling, split over multiple accounts, etc
Other Datasets
One publicly available full dataset is:
Sarah Jamie 2016, “Dark Web Data Dumps” (Valhalla Marketplace scrapes, as of 2016-12-11)
A number of other datasets are known to exist but are unavailable or available only in restricted form, including:
law enforcement scrapes (see the Force briefing), seized server images
Interpol (eg. “Pharmaceutical Crime on the Darknet: A study of illicit online marketplaces”, February 2015; based on monthly scrapes by INTERPOL IGCI June 2014–December 201411ya; this is probably drawing on their ongoing comprehensive scraping activities)
Europol (eg. European Monitoring Center for Drugs and Drug Addiction’s (EMCDDA) 2017 report “Drugs and the darknet: Perspectives for enforcement, research and policy” which also draws on Christin)
National Drug and Alcohol Research Center (NDARC) in Sydney, Australia; Australian vendor focused crawls, non-
release may be due to concerns over Australian police interest in them as documentation of sales volume to use against the many arrested Australian sellers unknown Princeton grad student
Christin CMU group: uncensored SR1 crawls (available on request via IMPACT), large number of other markets crawled 2012–2015 (see 2015; Europol; 2018; possibly 2018; “Plug and Prey? Measuring the Commoditization of Cybercrime via Online Anonymous Markets”, van et al 2018; and “Enabling Learning in Resilient Adaptive Systems: From Network Fortification to Mindful Organising”, 2019 )
The 2015 dataset is available in a censored form publicly, and the uncensored dataset is available on request to qualified researchers via IMPACT. Similarly, there are anonymized & non-
anonymized versions of their in-depth AlphaBay crawls used in 3 papers. The group notes Upcoming data (as of July 2018): We are monitoring a number of other markets as of 2018. We expect to make this data available in 2019, with a six-
month to one- year delay. et al 2019 (“Adversarial Matching of Dark Net Market Vendor Accounts”) uses additional data from “Dream, Berlusconi, Valhalla, and Traderoute”. Gãnán et al 2020 use the IMPACT dataset to study “Agora, Alphabay, BlackMarket Reloaded, Evolution, Hydra, Pandora, Silk Road 1 and Silk Road 2 from 201114ya to May 2017, and consists of 44,671 listings and 564,204 transactions made on digital goods, grouped in 17 categories.”
“Analysis of the supply of drugs and new psychoactive substances by Europe-
based vendors via darknet markets in 2017–2018” , 2019 uses a rewritten crawler and does analysis of presumably the same dataset but gives a time period:…we collected 35 scrapes of four markets—Dream Market, Traderoute, Valhalla, and Berlusconi Market—between summer 2017 and summer 2018
Digital Citizens Alliance (?)
2015 (claimed NDA prevents sharing SR2 crawls, despite serious anomalies & absurd results in published analysis)
2015/
2019: HTTrack- based scrapes of SR2 marijuana listings November 201312ya to October 2014 et al 2016 (AlphaBay/
Nucleus/ East India Company, monthly crawls for 4 months in 2015) Project CASSANDRA: et al 2017, et al 2018 (22 DNMs, every 2 months from October 2015 to 2016)
“The Economic Functioning of Online Drug Markets”, et al 2017 (CEP crawls: SR1 2013-08-01, SR2 2013-12-02–November 201411ya, Agora December 201312ya, Evolution January 201411ya, Nucleus November 201411ya)
“Dark Market Regression: Calculating the Price Distribution of Cocaine from Market Listings”, David Everling (2017-07-14–2017-07-21, Dream Market; used in 2019)
“Sex, Drugs, and Bitcoin: How Much Illegal Activity Is Financed Through Cryptocurrencies?”, et al 2019 (active DNMs 2016–2017)
“Challenging the techno-
politics of anonymity: the case of cryptomarket users” , 2017 (1 major but unspecified DNM forum, crawled March–May 2015 covering ~2 years of forum posts)DATACRYPTO: Paquet-
2016 /Paquet- et al 2018 (AlphaBay: September 2015–February 2016); et al 2018: “31 cryptomarkets in operation between 2013-09-12 and 2016-07-18, including all the largest English language sites (Alphabay, Nucleus, Dream market, Agora, Abraxas, Evolution, Silk Road 2 (Silk Road Reloaded), and SR1).”et al 2017, “Coordination problems in cryptomarkets: Changes in cooperation, competition and valuation”: SR2 seller profiles snapshot, 2014-09-15
2018, “Comparing cryptomarkets for drugs. A characterisation of sellers and buyers over time” (AlphaBay: September 2015–August 2016)
2018, “Leaving on a jet plane: the trade in fraudulently obtained airline tickets” (unknown “blackmarket”, “December 201411ya to August 2016”, but probably using the Christin/
Soska crawl and AlphaBay)et al 2018, “A Framework for More Effective Dark Web Marketplace Investigations”
2018, “Flying in Cyberspace: Policing Global Travel Fraud”: unspecified large DNM 2014–2016 (Evolution?)
2018, “This place is like the jungle: discussions about psychoactive substances on a cryptomarket” (July 2016, single AlphaBay forum scrape)
2018, “Dark Web Markets: Turning the Lights on AlphaBay” (June-
September 2017 AlphaBay market scrapes) et al 2018, “Identifying, Collecting, and Presenting Hacker Community Data: Forums, IRC, Carding Shops, and DNMs” (51 forums/
13 IRC channels/ 12 DNMS (0day/ Alphabay/ Apple Market/ Dream Market/ French Deep Web/ Hansa/ Minerva/ Russian Silk Road)/ 26 carding shops, 2016–2018) 2018, “Darknet Markets: Competitive Strategies In The Underground Of Illicit Goods” (Berlusconi Market (2018-06-02, 2018-06-29–2018-06-30), Dream Market (2018-03-06–2018-04-06, 2018-06-30–2018-07-03, 2018-08-19–2018-08-20), Empire Market (2018-06-06), & Olympus Market (2018-06-06))
et al 2018, “Drogues sur Internet: Etat des lieuxsur la situation en Suisse” (Google Translate: “This is data collected by several police forces during the seizure and closing of two important crypto markets of the Silk Road 2.0 era and Pandora. The data on Swiss buyers was transmitted to us anonymously. The data concerns 724 purchases made between November 26, 201312ya and August 12, 201411ya. 11 of them are from the Pandora platform, while the other 713 come from the Silk Road cryptomarket 2.0.”)
et al 2019, “Data Capture & Analysis of Darknet Markets”, Australian National University’s Cybercrime Observatory; Apollon Market, 2018-12-17–2019-02-25.
et al 2019, “Your Style Your Identity: Leveraging Writing and Photography Styles for Drug Trafficker Identification in Darknet Markets over Attributed Heterogeneous Information Network”: Valhalla/
Dream Market, “weekly snapshots from June 2017 to August 2017” 2019, “Darknet Drug Markets In A Swedish Context: A Descriptive Analysis Of Wall Street Market And Flugsvamp 3.0”: “Wall Street Market and Flugsvamp 3.0, in March of 2019”
et al 2019, “Python Scrapers for Scraping Cryptomarkets on Tor”: one-
off scrapes of 7 markets (Dream/ Berlusconi/ Wall Street/ Valhalla/ Empire/ Point Tochka/ Silk Road 3.1) 2019, “Scamming and the Reputation of Drug Dealers on Darknet Markets”: Hansa, March 2017
Červený & van Ours 2019, “Cannabis Prices on the Dark Web” (AlphaBay: first two weeks of “early October 2015”)
et al 2019, “Identifying High-
Impact Opioid Products and Key Sellers in Dark Net Marketplaces: An Interpretable Text Analytics Approach” (AlphaBay 2016–2017 & Dream Market 2016–2018)2019, “On the Resilience of the Dark Net Market Ecosystem to Law Enforcement Intervention” (SR2, raided server images; Bradley received copies of the SR2 server data from an unspecified UK LE agency); 2019, “A Qualitative Evaluation of Two Different Law Enforcement Approaches on Dark Net Markets” (Reddit datasets)
et al 2019, “An Analysis Framework for Product Prices and Supplies in Darknet Marketplaces” (“Dream Market, Wall Street Market, and Tochka Market…We performed the data collection over a duration of a little more than 7 weeks, starting from September 18, 2018 until November 19, 2018”)
et al 2019, “Identifying Hidden Buyers in Darknet Markets via Dirichlet Hawkes Process” (Dream Market, Wall Street Market, & Empire Market; single 2019 crawl?)
et al 2019, “Producing Trust Among Illicit Actors: A Techno-
Social Approach to an Online Illicit Market” (The Majestic Garden; selected forum posts, 2017–2018?)et al 2020, “Knowledge Sharing Network in a Community of Illicit Practice: A Cybermarket Subreddit Case” (/
r/ AlphaBay scrape from a “cybersecurity firm”, June 2016–July 2017) 2020, “Dishing the Deets: How dark-
web users teach each other about international drug shipments” (“Data collected for this research was obtained from two forums and one cryptomarket between the period of November 2017 and April 2018.”: /r/ DNM, Dread, & Dream Market respectively) 2019, “Characterization of illegal dark web arms markets” (Berlusconi, weapon lists, May–June 2019)
et al 2020, “Reputation transferability across contexts: Maintaining cooperation among anonymous cryptomarket actors when moving between markets” (aside from using this & Soska, “We collected data from AlphaBay in June and July 2017, shortly before the cryptomarket was seized.”)
2020, “Open Market or Ghost Town? The Curious Case of OpenBazaar” (OpenBazaar crawls: June 25, 2018–September 3, 2019)
et al 2020, “A Market in Dream: the Rapid Development of Anonymous Cybercrime” (Dream Market: 2018-10-30–2019-03-01)
et al 2019, “Anonymous market product classification based on deep learning” (“In order to conduct research on the anonymous trading market, a one-
month crawler was used, and anonymous market data was collected by OnionScan.”) et al 2018, “Technical Note: Characterizing the online weapons trafficking on cryptomarkets” (“Weapons related webpages from nine cryptomarkets were manually duplicated in February 2016…The selected markets are: Aflao marketplace (AFL), AlphaBay (ALB), Dr D’s multilingual market (DDM), Dream market(DMA), French Darknet (FRE), The Real Deal (TRD), Oasis (OAS), Outlaw market (OUT), Valhalla (aka Silkkitie) (VAL).”)
2019, “‘We must work together for the good of all’: An examination of conflict management on two popular cryptomarkets” (Tochka Free Market (TFM)/
Wall Street Market (WSM) forum posts, vendor profiles, reviews, and market rules, unspecified 2019 (?) date) 2019, “Repeat Buying Behavior of Illegal Drugs on Cryptomarkets” (single scrape, AlphaBay July 2017?)
2020: announcement (Grams DNM search engine successor?)
et al 2020, “Fentanyl availability on darknet markets” (“Data were collected over 84 days (from 2 January to 2019-03-27) from 64 ‘scrapes’ of six omnibus darknet markets: Berlusconi, Dream Market, Empire, Tochka, Valhalla (‘Silkkitie’) and Wall Street.”)
et al 2021, “Impact of darknet market seizures on opioid availability” (“Data were collected over 352 days, from 2 January to 20 December 2019 [2019-01-02–2019-12-20] (excluding weekends), combining 251 scrapes from initially 8 darknet markets: Apollon, Empire, Dream, Nightmare, Tochka (also known as Point), Berlusconi, Valhalla (also called Silkitie), and Wall Street. In April three ‘new’ markets (Agartha, Dream Alt and Samsara) were added after Wall Street and Valhalla were seized by law enforcement and Dream voluntarily closed. In July Cryptonia was added as a substitute for Nightmare, which closed in an exit scam (where a business stops sending orders but continues to accept payment for new orders). Cryptonia operated until a planned (voluntary) closure in November.”)
To estimate the scale of encryption-
signing, information hub activity, and seller migration, I downloaded and extracted data from key original sources using pythonandwget. For the encryption-signing analysis, I collected data from the discussion forums associated with 5 cryptomarkets: Silk Road and Silk Road 2; BlackMarket (another early cryptomarket); and the 2 largest cryptomarkets in 2014–2015: Agora and Evolution. I supplemented collected files with data from public archives (2016). For the analysis of information hub activity, I collected data from 3 market- independent forums, and visitor data from 2 additional websites were shared with me by their operators. Last, I collected data on post- in medio March 2015. (Most of these data are available atintervention trade and seller migration from the 3 largest markets after Silk Road was shut down: Silk Road 2, Evolution, and Agora. I collected these data daily, from October 201411ya until September 2015. Agora lasted throughout the period, but Silk Road 2 was shut down in early November 201411ya, and Evolution closed darkdata.bc.eduor upon request.)et al 2020, “Exploring the use of Zcash cryptocurrency for illicit or criminal purposes” (uses DWO, RAND’s ongoing “Dark Web Observatory”)
et al 2020, “A tight scrape: methodological approaches to cybercrime research data collection in adversarial environments” (“Concretely, we have been crawling various cybercrime communities for more than four years, including web forums…We have scraped 26 forums (described in Table 1), around 300 chat channels across Discord and Telegram, and an archive of files.”)
et al 2020, “Listed for sale: Analyzing data on fentanyl, fentanyl analogs and other novel synthetic opioids on one cryptomarket” (eDarkTrends scrape of Dream Market: 2018-03-22–2019-01-26)
2020, “The responsiveness of criminal networks to intentional attacks: Disrupting darknet drug trade” (“Data for our study come from one of the largest currently operating darknet drug markets, Silk Road 3.1. They contain information on 16,847 illicit drug transactions between 7,126 buyers and 169 vendors, representing the entire population of drug transactions on the Silk Road 3.1 during its first 14 months of activity”)
et al 2020, “Illicit drug prices and quantity discounts: A comparison between a cryptomarket, social media, and police data” (“Data from Flugsvamp 2.0 was collected in collaboration with the DATACRYPTO project (see Décary-
2015 ) between May and September in 2018, yielding 826 advertisements. Flugsvamp 2.0 provided specified categories for drug types and prices, but we also verified and coded them manually.”)2020, “Essays in Demand Estimation: Illicit Drugs and Commercial Mushrooms” (Agora, 2014-11-04–2015-09-05)
Barr-2020, “Phishing With a Darknet: Imitation of Onion Services” (spidering of all Tor hidden services, May–July 2019)
2020, “A Crime Script Analysis of Counterfeit Identity Document Procurement Online”
2020, “Voting for Authorship Attribution Applied to Dark Web Data” (forums: DNM Avengers, The Majestic Garden, The Hub, Dread; ~2019-10–2019-12)
et al 2021, “Dark Web Marketplaces and COVID-
19: before the vaccine” (2020-01-01–2020-11-16; Flashpoint Intelligence commercial crawls of Atshop, Black Market Guns, CanadaHQ, Cannabay, Cannazon, Connect, Cypher, DarkBay, DBay, DarkMarket, Darkseid, ElHerbolario, Empire, Exchange, Genesis, Hydra, MEGA Darknet, MagBO, Monopoly, Mouse In Box, Plati.market, Rocketr, Selly, Shoppy.gg, Skimmer Device, Tor Market, Torrez, Venus Anonymous, White House, Willhaben, Yellow Brick)2021, “Darknet Data Mining—A Canadian Cyber-
crime Perspective” (early 2020-07: EliteMarket, Icarus, AESAN)et al 2021, “Introducing A Dark Web Archival Framework” (MITRE, ongoing?)
van 2021, “Reputation in AlphaBay: the effect of forum discussions on the business success of cryptomarket sellers” (unpublished Dutch AlphaBay dataset)
et al 2021, “Extracting Threat Intelligence Related IoT Botnet From Latest Dark Web Data Collection” (ASAP, DarkMarket, DarkFox; Dread; 2021?)
et al 2023, “Keeping Pace With the Evolution of Illicit Darknet Fentanyl Markets: Using a Mixed Methods Approach to Identify Trust Signals and Develop a Vendor Trustworthiness Index” (2020–2022: Vice City, Versus, Cartel, ASAP)
et al 2023, “Hydra: Lessons from the World’s Largest Darknet Market” (Hydra: 1 April 2020–2 May 2020; uses the Christin scrapes and an anonymous third-
party scrape)
External Links
-
Internet Archive Upload Limits
Something that might be useful for those seeking to upload large datasets or derivatives to the IA: there is a mostly-
undocumented ~25GB size limit on its torrents (as of mid-2015). Past that, the background processes will no longer update the torrent to cover the additional files, and one will be handed valid but incomplete torrents. Without IA support staff intervention to remove the limit, the full set of files will then only be downloadable over HTTP, not through the torrent.
-
Not to be confused with the original Silk Road 1 weapons site which closed for lack of sales; this is a much later, independent site which was probably a scam.
-
eg. the Ross Ulbricht trial evidence exhibits; for the trial transcript, see Moustache.
-
It appears to be based on one or more of the SR1F scrapes in this archive, but amusingly, we don’t know which.
Backlinks
S.U.S. You’re SUS!—Identifying influencer hackers on dark web social networks:
INSPECT has been evaluated using CrimeBB dataset [and Kaggle and DNM Archives] comprising user profiles and activities within dark web forums to assess its effectiveness in identifying influential users on the dark web forums.
This project used data from the Internet Archive collection of publicly available darknet market scrapes 2011–42015 from et al 2015.
…For the analysis presented in this study, we made use of over 2.5 million posts drawn from over 150,000 accounts from 35 cybercriminal communities, drawn from the DNM Corpus: a large dataset collected 2013–2015. All the DNMs have English language as their main medium of communication. In particular, we targeted discussion fora within this collection, which acted as support areas for underground marketplaces dealing in a number of different illicit goods. Communities ranged from successfully established markets with thousands of accounts (though not all were always active posters) to small sites that never moved beyond a handful of initial accounts.
…§III. Data: A. Overview: For this analysis, we make use of the DNM Corpus: a large dataset collected 2013–2015 and publicly available. In particular, we targeted a discussion forum within this collection, the Evolution forum, which acted as support area for the eponymous underground marketplace dealing in a number of different illicit goods, especially drugs.
Get Rich or Keep Tryin': Trajectories in dark net market vendor careers:
In this paper, we leverage the use of PGP-
keys to map careers of dark net market vendors. We parse and analyze scraped data from over 90 dark net markets (2011–42015) [DNM Archives + Soska & Christin], and discern 2,925 unique careers. Method: I realize the aims of this research by using a buyer-
seller dataset from the Abraxas cryptomarket ( et al 2015). Given the differences between the topics and the research questions featured, this thesis employs a variety of methodological techniques: Cryptomarket Forums: Self-
advertisement and rumors on Silk Road :…For this thesis datasets are used that contain data on item listings that dealers sold on Silk Road 1 and forum conversations that were posted on the cryptomarket. The data on item listings was retrieved by 2013 between 3 February 201213ya and 24 July 201213ya. Data on forums was compiled by 2015, derived from the data collected by 2013. This thesis takes the research on reputation scores on item price and item sales of et al 2017 as a starting point, which is why the data from that study will be used. The relevant data for this thesis will be summarized in the section below. For more details on the data, the study of et al 2017 can be consulted. It will be mentioned when additional changes are made to the data.
Internet Search Tips (full context):
Crawling Websites: sometimes having copies of whole websites might be useful, either for more flexible searching or for ensuring you have anything you might need in the future. (example: “Darknet Market Archives (2013–2015)”).
Laws of Tech: Commoditize Your Complement (full context):
Archives of Grams listings are available.
Research Ideas (full context):
Dark net markets (2017-03-19): use a longitudinal crawl of DNM sellers to estimate survival curves, outstanding escrow + orders, and listed product prices /
type / language to try to predict exit scams. Danbooru2021: A Large-
Scale Crowdsourced & Tagged Anime Illustration Dataset (full context):This project is not officially affiliated or run by Danbooru, however, the site founder Albert (and his successor, Evazion) has given his permission for scraping. I have registered the accounts
gwernandgwern-botfor use in downloading & participating on Danbooru; it is considered good research ethics to try to offset any use of resources when crawling an online community (eg. DNM scrapers try to run Tor nodes to pay back the bandwidth), so I have donated $29$202015 to Danbooru via an account upgrade.The
sort –keyTrick (full context):I show how to do this with the standard Unix command-
line sorttool, using the so-called “ sort --keytrick”, and give examples of the large space-savings possible from my archiving work for personal website mirrors and for making darknet market mirror datasets where the redundancy at the file level is particularly extreme and the sort --keytrick shines compared to the naive approach.Darknet Market mortality risks (full context):
Historical archive of DNstats’s statistics is available in my DNM archives.
Silk Road 1: Theory & Practice (full context):
BlackMarket Reloaded, since the fall, has been marked by a pattern of arrogance, technical incompetence, dismissal of problems, tolerance for sellers keep buyer addresses & issuing threats, astounding tolerance for information leaks (all the implementation information, and particularly the VPS incident with the user data leak; mirror: 2), etc. We know his code is shitty and smells like vulnerabilities (programmer in 3 different IRC channels I frequent quoted bits of the leaked code with a mixture of hilarity & horror), yet somehow backopy expects to rewrite it better, despite being the same person who wrote the first version and the basic security principle that new versions have lots of bugs. (I’m not actually bothered by the DoS attacks; they’re issues for any site, much less hidden services.)
Archiving URLs (full context):
As an additional flourish, my local archives are efficiently cryptographically timestamped using Bitcoin in case forgery is a concern, and I demonstrate a simple compression trick for substantially reducing sizes of large web archives such as crawls (particularly useful for repeated crawls such as my DNM archives).
Miscellaneous (full context):
For readers who may think my phrasing is a bit hyperbolic, it’s worth mentioning that I had at this point spent several years researching darknet markets (although I had finished my work by releasing my darknet markets archive in 2015), and the FBI had in fact paid me a friendly (but unannounced) visit in March 2016.
Status Spill-
Over in Cryptomarket for Illegal Goods :The dataset contains 6,033 vendor profiles collected in January 2017. Using 3 generalized additive models (GAMs), we show that:
…Data & methods: Our study analysed a dataset of 114,385 items, 6,033 sellers, and 1270,000 reviews collected on AlphaBay’s darknet market 26–28 January 2017 by 2017. Most listings on the AlphaBay platform were included in the dataset, even if the items were not purchased. However, 1,636 pages from Tor could not be downloaded, resulting in around 700 missing listings, but these only accounted for 0.01% of all listings and were therefore unlikely to affect our results. We focused our analysis on cocaine listings for two main reasons. First, given the high price and potential dangers associated with the drug, consumers were expected to carefully examine the information in sellers’ listing descriptions. Second, the text mining technique used in this study required a certain degree of homogeneity in the text content. Therefore, we began by selecting all products that fell within the ‘cocaine’ category (5,485). Subsequently, we eliminated listings that lacked quantity information in their item descriptions (258). Lastly, we eliminated products that, despite being categorized as cocaine, were not genuine cocaine-
related items, such as ‘lidocaine’ and similar substances (956) as well as products for which the listed weight in grams was not clearly expressed (109). Consequently, the final dataset encompassed 4,160 cocaine listings by 714 distinct vendors. Archiving URLs (full context):
For example, instead of a big
local-archiverrun, I havearchiverrunwgeton each individual URL:screen -d -m -S "archiver" sh -c 'while true; do archiver ~/. (For private URLs which require logins, such as darknet markets,.urls.txt gwern@gwern.net "cd ~/ www && wget --unlink --continue --page-requisites --timestamping -e robots=off --reject .iso,.exe,.gz,.xz,.rar,.7z,.tar,.bin,.zip,.jar,.flv,.mp4,.avi,.webm --user-agent='Firefox/ 3.6' 120"; done' wgetcan still grab them with some help: installing the Firefox extension Export Cookies, logging into the site in Firefox like usual, exporting one’scookies.txt, and adding the option--load-cookies cookies.txtto give it access to the cookies.)Design Of This Website (full context):
Tags are a key way of organizing large numbers of annotations. In some cases, they replace sections of pages or entire pages, where there would otherwise be a hand-
maintained bibliography. For example, I try to track uses of the DNM Archive & Danbooru20xx datasets to help establish their value & archive uses of them; I used to hand- link each reverse- citation, while having to also tag/ annotate them manually. But with tags+transclusions, I can simply set up a tag solely for URLs involving uses of the dataset ( darknet-market/&dnm-archive ai/), and transclude the tag into a section. Now each URL will appear automatically when I tag it, with no further effort.anime/ danbooru
Similar Links
Examining the trends and operations of modern Dark-
Web marketplaces Price Formation of Illicit Drugs on Dark Web Marketplaces
A geographical analysis of trafficking on a popular darknet market
The Influence Of Technological Factors On Dark Web Marketplace Closure
Information Extraction from Darknet Market Advertisements and Forums