(cache)Elastic search aggregation for top domains (#1236) · Issues · Minds / Minds Backend - Engine

added scoped label

changed the description

added to epic &102

assigned to @ramialbatal

assigned to @brianhatchet

Need to talk to @ramialbatal about what to do with the data when we ban content in terms of full text analysis

@brianhatchet here are my thoughts:

URL queue

The admins as any Minds user are able to report channels as spam.

The admins are also able to report a URL/Domain as a spam.

Once a URL/Domain is reported, it will be immediately sent to a URL queue.
Once a channel is reported, we need to automatically scan this channel and extract the URLs and sending to a URL queue.

UI and data to retrieve

Each time an admin load the moderation page of the potential spam URLs he will see the following information and metrics:
- URL
- Is_Spam
- date of last occurrence
- number occurrences (all time)
- number of occurrences (in the 7 days preceding the last occurrence)
- number of channels mentioned this URL (all time)
- number of channels mentioned this URL (in the 7 days preceding the last occurrence)
4 radio boxes should be displayed beside each URL:
- Flag the URL as Spam.
- Flag the Domain as Spam.
- Flag the URL as healthy.
- Flag the full domain as healthy.
Beside the 4 radio boxes we should have a "submit" button.

How to display the list of URLs?

The URLs can be ordered by decreasing order of a score. This score can be a combination of the 4 metrics above.

Score = (alpha * number of days since last occurrence) + (beta * number of occurrences in the 7 days preceding the last occurrence) + (gamma * number all occurrences) * (delta * number of channels mentioned this URL in the the 7 days preceding the last occurrence) + (epsilon * number of all channels mentioned this URL)

Suggested values of parameters (we will change them later based on the feedback from the Admins):

alpha = 25
beta = 5
gamma = 1
delta = 10
epsilon = 5

Actions

Once the admins clicks on submit button:
- the metrics mentioned above and the URL will be stored in an Elasticsearch or Cassandra along with admin decision. This is necessary for two reasons:
  - for ML training
  - to avoid displaying the same URLs/Domains to the admins in the future.)
- any post/comment/blog containing a URL or a domain marked as Spam will be removed (not deleted).
- A channels or groups that shared a spam URL/Domain will be forwarded to the Admins for a Spam check.

changed milestone to %Sprint::02/26 Calculated Cricket

assigned to @markeharding

Need to break this down into cards for implementing Rami's suggestions. Any thoughts on this @markeharding

@brianhatchet as a temporary solution, I have a simple code that is extracting the popular URLs and I can run it a couple of times per week, and see if there is some suspicious URLs there. Except if we have resources to implement a better solution like the one I mentioned above.

Elastic search aggregation for top domains

Linked issues 0

URL queue

UI and data to retrieve

How to display the list of URLs?

Actions

Linked issues
0