Elastic search aggregation for top domains
Create an ES aggregation
Significant terms analysis
Find all urls
Provide a mechanism for flagging top level demands, being careful not to ban good domains
Flagging url results in the content getting removed, not deleted
added scoped label
changed the description
added to epic &102
assigned to @ramialbatal
assigned to @brianhatchet
- Developer
Need to talk to @ramialbatal about what to do with the data when we ban content in terms of full text analysis
- Developer
@brianhatchet here are my thoughts:
URL queue
The admins as any Minds user are able to report channels as spam.The admins are also able to report a URL/Domain as a spam.
- Once a URL/Domain is reported, it will be immediately sent to a URL queue.
- Once a channel is reported, we need to automatically scan this channel and extract the URLs and sending to a URL queue.
UI and data to retrieve
- Each time an admin load the moderation page of the potential spam URLs he will see the following information and metrics:
- URL
- Is_Spam
- date of last occurrence
- number occurrences (all time)
- number of occurrences (in the 7 days preceding the last occurrence)
- number of channels mentioned this URL (all time)
- number of channels mentioned this URL (in the 7 days preceding the last occurrence)
- 4 radio boxes should be displayed beside each URL:
- Flag the URL as Spam.
- Flag the Domain as Spam.
- Flag the URL as healthy.
- Flag the full domain as healthy.
- Beside the 4 radio boxes we should have a "submit" button.
How to display the list of URLs?
The URLs can be ordered by decreasing order of a score. This score can be a combination of the 4 metrics above.
Score = (alpha * number of days since last occurrence) + (beta * number of occurrences in the 7 days preceding the last occurrence) + (gamma * number all occurrences) * (delta * number of channels mentioned this URL in the the 7 days preceding the last occurrence) + (epsilon * number of all channels mentioned this URL)
Suggested values of parameters (we will change them later based on the feedback from the Admins):
- alpha = 25
- beta = 5
- gamma = 1
- delta = 10
- epsilon = 5
Actions
- Once the admins clicks on submit button:
- the metrics mentioned above and the URL will be stored in an Elasticsearch or Cassandra along with admin decision. This is necessary for two reasons:
- for ML training
- to avoid displaying the same URLs/Domains to the admins in the future.)
- any post/comment/blog containing a URL or a domain marked as Spam will be removed (not deleted).
- A channels or groups that shared a spam URL/Domain will be forwarded to the Admins for a Spam check.
- the metrics mentioned above and the URL will be stored in an Elasticsearch or Cassandra along with admin decision. This is necessary for two reasons:
changed milestone to %Sprint::02/26 Calculated Cricket
assigned to @markeharding
- Developer
Need to break this down into cards for implementing Rami's suggestions. Any thoughts on this @markeharding
- Developer
@brianhatchet as a temporary solution, I have a simple code that is extracting the popular URLs and I can run it a couple of times per week, and see if there is some suspicious URLs there. Except if we have resources to implement a better solution like the one I mentioned above.