chan2vec
The methods and experiment results are described in the following paper:
Directions for running experiments from the paper and generating the https://transparency.tube/ channel classifications data are here.
Directions for doing basic classification with pre-existing embeddings can be found here.
Political Channel Discovery
Commands from the following docs were used for political channel discovery (some data of which is used in Dinkov experiments)
- Find candidate channels:
- experiments/docs/political_channel_discovery_init_round.txt
- experiments/docs/political_channel_discovery_round1.txt
- experiments/docs/political_channel_discovery_round2.txt
- experiments/docs/political_channel_discovery_round3.txt
- Final political channel classification, out-of-sample performance stats, and language prediction stats
- experiments/docs/channel_language_prediction.txt
- experiments/docs/political_channel_classification_knn_only.txt
- Channel discovery hold out analysis
- experiments/docs/political_channel_classification_coverage_analysis.txt
Note, for data collections, all scripts that require --ec2-ip-fp must have a file with the IPs of AWS instances that have been launched. An example can be found here: data_collection/configs/comment_scrape_instances.SAMPLE.txt
Dinkov Media Bias / Fact Check Experiment
The results of comparing chan2vec to Dinkov's model can be generated by running commands in the following doc
- experiments/docs/dinkov_political_preds.txt
Note, the Dinkov predictions were generated by modifying their code so that they would output predictions for individual channels.
Soft Tag Predictions
Experiment results and commands for scoring all out of sample are in this doc:
- experiments/docs/political_soft_tags.txt
Newly Discovered Political Channel Traffic Analysis
Experiment results can be found here:
- experiments/docs/political_soft_tags_traffic_analysis.txt
- experiments/docs/political_soft_tags_traffic_analysis_trends.txt
Transparency.tube Data
The latest data was generated using commands in the following doc:
- experiments/docs/latest_site_data_20201006.txt
Tag definitions can be found here:
Tag metrics from hold-one-out cross validation:
Tag | # Channels | Precision | Recall |
---|---|---|---|
AntiSJW | 271 | 0.786 | 0.827 |
PartisanRight | 250 | 0.746 | 0.832 |
PartisanLeft | 146 | 0.733 | 0.678 |
SocialJustice | 141 | 0.770 | 0.617 |
Conspiracy | 118 | 0.856 | 0.805 |
MainstreamNews | 96 | 0.693 | 0.823 |
ReligiousConservative | 69 | 0.696 | 0.232 |
Socialist | 49 | 0.683 | 0.837 |
AntiTheist | 47 | 0.857 | 0.766 |
Educational | 44 | 0.909 | 0.227 |
Libertarian | 41 | 0.739 | 0.415 |
MissingLinkMedia | 39 | 0.286 | 0.051 |
StateFunded | 39 | 0.850 | 0.436 |
WhiteIdentitarian | 37 | 0.676 | 0.676 |
QAnon | 34 | 0.784 | 0.853 |
Provocateur | 21 | 0.500 | 0.143 |
MRA | 21 | 0.818 | 0.429 |
LateNightTalkShow | 10 | 0.700 | 0.700 |
Revolutionary | 9 | 0.500 | 0.111 |
Political lean metrics from hold-one-out cross validation:
Political Lean | # Channels | Precision | Recall |
---|---|---|---|
Left | 263 | 0.891 | 0.779 |
Center | 236 | 0.633 | 0.644 |
Right | 415 | 0.863 | 0.923 |
The predictions are available here:
- data/site_preds/labels_20201006/all_political_soft_tags_20201006.txt
Columns are:
- Channel ID
- Probability the channel is political (all over 0.8)
- Soft tag or political lean
- Probability of soft tag or political lean (use threshold of 0.5)
Chan2vec KNN Basic Example
Install numpy and faiss
pip3 install faiss numpy
Download pre-existing emebddings (can reach out to the chan2vec author for these) and add to the specified locations below. Add "--use-gpu True" in order to speed up the command below, otherwise will take > 3 mins.
python3 chan2vec/python/chan2vec_knn.py \
--vec-fp data/pol_chan_disc/chan2vec_training_data/chan2vec_round3_channels_ds.vectors.txt \
--chan-info-fp data/pol_chan_disc/chan2vec_training_data/chan2vec_round3_channels_ds.chan_info.txt \
--label-fp data/datasets/tt_ds_20201031.is_pol.txt \
--score-chan-fp data/pol_chan_disc/chan2vec_training_data/chan2vec_round3_channels_ds.chan_info.channel_ids.txt \
--out-fp ./all_political_predictions.txt \
--num-neighbs 10 --bin-prob True
Get performance of model
python3 chan2vec/python/gen_pred_stats_bin.py \
--lab-fp data/datasets/tt_ds_20201031.is_pol.txt \
--score-fp ./all_political_predictions.txt \
--no-fold-lab-col True --pred-thresh 0.5
Output:
Num instances: 6615
AUC: 0.9906
Accuracy: 0.9587
Precision: 0.8153
Recall: 0.9708
The columns for chan-info-fp are:
- Channel ID
- Assigned int for channel ID
- Channel Name
- Scraped comment subscriptions
- Total subscriptions
The larger the number of "scraped comment subscriptions", the more useful the channel embedding is likely to be for a given task. For political channel classification we filter out all channels with less than 20 "scraped comment subscriptions.