Elastic search aggregation for top domains
Create an ES aggregation
Significant terms analysis
Find all urls
Provide a mechanism for flagging top level demands, being careful not to ban good domains
Flagging url results in the content getting removed, not deleted
- added scoped label 
- changed the description 
- added to epic &102 
- assigned to @ramialbatal 
- assigned to @brianhatchet 
- DeveloperNeed to talk to @ramialbatal about what to do with the data when we ban content in terms of full text analysis 
- Developer@brianhatchet here are my thoughts: URL queueThe admins as any Minds user are able to report channels as spam.The admins are also able to report a URL/Domain as a spam. - Once a URL/Domain is reported, it will be immediately sent to a URL queue.
- Once a channel is reported, we need to automatically scan this channel and extract the URLs and sending to a URL queue.
 UI and data to retrieve- Each time an admin load the moderation page of the potential spam URLs he will see the following information and metrics:
- URL
- Is_Spam
- date of last occurrence
- number occurrences (all time)
- number of occurrences (in the 7 days preceding the last occurrence)
- number of channels mentioned this URL (all time)
- number of channels mentioned this URL (in the 7 days preceding the last occurrence)
 
- 4 radio boxes should be displayed beside each URL:
- Flag the URL as Spam.
- Flag the Domain as Spam.
- Flag the URL as healthy.
- Flag the full domain as healthy.
 
- Beside the 4 radio boxes we should have a "submit" button.
 How to display the list of URLs?The URLs can be ordered by decreasing order of a score. This score can be a combination of the 4 metrics above. Score = (alpha * number of days since last occurrence) + (beta * number of occurrences in the 7 days preceding the last occurrence) + (gamma * number all occurrences) * (delta * number of channels mentioned this URL in the the 7 days preceding the last occurrence) + (epsilon * number of all channels mentioned this URL)Suggested values of parameters (we will change them later based on the feedback from the Admins): - alpha = 25
- beta = 5
- gamma = 1
- delta = 10
- epsilon = 5
 Actions- Once the admins clicks on submit button:
- the metrics mentioned above and the URL will be stored in an Elasticsearch or Cassandra along with admin decision. This is necessary for two reasons:
- for ML training
- to avoid displaying the same URLs/Domains to the admins in the future.)
 
- any post/comment/blog containing a URL or a domain marked as Spam will be removed (not deleted).
- A channels or groups that shared a spam URL/Domain will be forwarded to the Admins for a Spam check.
 
- the metrics mentioned above and the URL will be stored in an Elasticsearch or Cassandra along with admin decision. This is necessary for two reasons:
 
- changed milestone to %Sprint::02/26 Calculated Cricket 
- assigned to @markeharding 
- DeveloperNeed to break this down into cards for implementing Rami's suggestions. Any thoughts on this @markeharding 
- Developer@brianhatchet as a temporary solution, I have a simple code that is extracting the popular URLs and I can run it a couple of times per week, and see if there is some suspicious URLs there. Except if we have resources to implement a better solution like the one I mentioned above. 
