Data Streams are Glorious – IRCTHULU is Back Online

So, around the new year I disabled the feeds during an operation to shut off staff eyeballs for one of the networks that’s been targeting the users running the IRCTHULU runners.

Only.

I didn’t.  I built a new tool called Leptin, which, like synapse, pulls from the MQ, only it dumps to disk instead of the database.  Leptin will eventually be integrated into Synapse.

This was to give the runners a break while I focused on some work stuff without having to worry about someone finding something in the logs to identify the runners again.

Unfortunately during the development of Leptin I had to drop about 80,000 messages over a pretty stupid typo so we lost a couple days of logs.  I’ve got some safeguards in the code now that will prevent that from even being possible in the future.

As for Leptin, what’s especially cool about the design for this part is you can use the existing tool, nerve, to replay it back into the queue, and it’ll slurp it up.

Leptin itself is a bottleneck though right now.  I should have written it to be asynchonous like synapse as Pika in python is very, very slow.   So, you’ll see logs trickle in over the next few days to catch up to itself, which should provide a moving data stream offset enough to drive the staff crazy trying to analyze it.

 

Pattern Diagram for T-ORCH

I’ve finally decided on a usable pattern for general service orchestration in SOA.

I’ll be using this for the T-ORCH set of updates mentioned previously.  As well as probably in every other solution I ever create when I have a choice, to be honest.

Here’s a fabulous diagram made by a fabulous person:

You’ve got a controller, an API, and the service you want to control.  Behind the API is an MQ, a consumer service and a database.

Here’s how it works:

Controller

The controller registers a request or cancels a request.  It also can check on the state of a request.

API

The API does all kinds of stuff:  It reports on state for the controller, it receives request registrations and request cancellations from the controller, which it relays to the MQ, it polls the state of a request from the DB,  it sends new requests to the service being controlled, as well as sending request state updates from the service to the MQ.

SERVICE

The service, besides performing it’s normal function in the environment, gets new requests once they’ve been registered, it also acknowledges requests, and also marks requests as complete.

MQ

The MQ receives request registrations, request updates (state changes), including request cancellations from the API.

CONSUMER

The consumer relays same from the MQ and inserts them into the database.  This greatly simplifies the interactions in the whole design.

DB

The database receives creation, update, or cancellation requests from the consumer only.  It also provides the table used to report on request state to the API.

Request Lifecycle

The request lifecycle is:  registration, acknowledgment, and completion.

Open for Adspace

Would you like to sell adspace in the presenta example UI?

You got it.

Send me an email at punches.chris@gmail.com

*All revenue obtained will fund the surro linux project.

Identity Research Dataset Boost

I’ve got early drafts in for a new tool that will collect larger identity research data.

It basically connects to a server and scans through everything and collects the user profiles before disconnecting.  This should give us about a 60,000 user profile benefit.

Of course, I can’t run it.  I can only develop it and wait for someone to use it.

 

Sneak Peak at new orch features.

Data Feeds Need Control and Reporting Signal Channels in SOA

I’ll expand on this later after I’ve slept but this is the high level design for the next update for tenta (besides some obvious gotchas pertaining to field of vision obfuscation).

This will allow me to control and report on tenta clients so I know what feeds are going where, and even control feed client state.  It’ll also provide some great dashboarding capability.

Update 1:

During the buildout of a new component called “Leptin”, due to being tired and making the perfect typo, I had to drop about 8,000 messages in transport while I rewired the changes in place.  And that’s why you use a pre-prod, and that’s what I get for trying to cut costs with cowboy maneuvers.  More details to follow.

New Standalone: Distributed Endpoint Identity Generation Engine

Introducing DEIGE

I’ve got most pieces of this already built which I’ve been using for testing, but automated identity creation on IRC networks is super easy, even for highly restrictive environments.

They’re not going to change the IRC protocol, so the constraints of IRC protocol are givens.

Nickserv varies from network to network and depends on what services bots they use and configuration of them.  So some components will need to be network-dependent.

Given input:

  • email
  • host (meta, registered vs used)
  • user
  • ident
  • password

This should be able to operate as a single command that connects and does everything.

There are some problems to solve there:

N, R and E are separate hosts.

In addition to that, R and E need to be random and disparate between iterations for a system like this to really work.

The two problems introduced there are:

  • orchestration
  • endpoint creation

My two major pain points in all things.

Endpoint creation for R is relatively easy.  Endpoint creation for E is slightly more complicated as you need to be able to open ports on a host to do that, which requires root access.  For dynamic endpoint creation you’d almost need to generate the OS image and spawn dynamically.

I have an idea that I want to test.

Field of Visibility Change

 

Updated Field Of Visibility

Joins and Parts are Removed from API Returns

The data is still present in the backend so that identity research pages still work, they just aren’t displayed in the log viewer.  There is almost nothing lost by this and plenty gained.

IRCTHULU Logs

Metadata

  • The time the logs resume (correlates to joins in local log)
  • The time the logs stop after a kline (correlates to kline in local log)
  • [any entries present in local logs not present in ircthulu logs — mitigated by random capture delay at client level]

Their Local Logs

Data

  • joins (completely mitigated)
    • user
    • host
    • ident
    • realname
  • registration
    • email (if I reduce this to the last one I’ll be able to confirm they’re tracking users’ emails)
    • host
    • user
    • ident

Metadata

  • age of registration on join
  • profile of joined channels
  • vps provider profile

This Blog

This blog is a great source of information for predictive analytics now.  As it is intended to be.

What I Learned

Network Field of Visibility

There are defined datapoints visible in the network staff’s field of vision:

  • [IRCTHULU’s Logs]
  • [Their Local Logs]
    • [Registration Data]

Among that there is relevant metadata.

IRCTHULU Logs

Data

  • joins
    • user
    • host
    • ident

Metadata

  • The time the logs resume.
  • The time the logs stop after a kline.

Their Local Logs

Data

  • joins
    • user
    • host
    • ident
    • realname
  • registration
    • email
    • host
    • user
    • ident

Metadata

  • age of registration on join
  • profile of joined channels
  • vps provider profile

Conclusion

There will be varying levels of misanalysis this should need to depend on.  One notable in particular involves the scary police state perspective some of the staff will take which will involve klining random suspected users with no evidence or usable data, which I’ve seen them already start to do.  They’ll end up klining whole network blocks which will start to have a user impact.  This is to my advantage.  The hidden pressure that they place on themselves and their communities and users will manifest as curbed growth and even shrinkage in some communities.  There will be missed hits based on metadata manipulation on my end.  VPS usage on the study network will be almost none after a while.

This is a ‘good’ masked as a ‘bad’ because those people need to go and eventually will over it; will help identify them to network owners who generally do care about such things.  And if they don’t they’ll appropriately lose hard.  If their priorities are right I win.  If they aren’t I win.

I’m still analyzing their field of vision and ways to scramble it  but there’s pretty good progress after yesterday’s bait-and-tackle.

Needs

At this point I have identified a need to introduce a control channel in the message bus being used for the data feeds.  This should relay control commands to the tenta clients to shut on and off feeds.

There will also need to be a reporting channel that is processed differently than the feeds to give a wide view of client status so that I can start development on orchestration components.  Using syslog was a bad idea for this use case.

Deployment needs re-evaluated.

Identity generation needs re-evaluated.

New Year Trouble

Well, I’ve got good new and bad news.

The operation that was conducted pretty much all day today to break the feedback loop for Freenode and OFTC staff unveiled a minor but critical vulnerability in the data shape produced by tenta.

Now that it’s mostly over,  or mitigated at least, I can reveal the details.

Problem

Technically it’s not a bug, as the issue is in the “negative spaces” in the data that’s created.  When the tenta client joins it currently omits its own user from the logs.  This is actually bad, as, it can be used to root out the runners.  I’ll explain more below.

Certainty

I’ve been able to confirm that this is the method the staff were using to identify the “bait bots”.  I’d originally thought they were processing some server-side information, and I’m sure they did in some cases, but was able to conduct thorough A:B and isolation tests to verify that they are also cross-referencing local logs with presenta logs; this was found by making minor adjustments in their field of vision and then waiting for a bait to hook in a controlled manner repeatedly by comparing page views to klined bots in a predetermined manner after assessing what their visible data points were.  They were processing the joins listed in the presenta logs and also checking for missing user data there, and comparing to local logs.

Impact

This has to be fixed before we can use any more runner data.  I went ahead when I first suspected and deleted random rows from the database early on to obfuscate already existing data so we dont have to lose the whole database but I will not be turning the feeds back on until the next update to tenta.  All pooled data is useless without compromising the runners.

Otherwise, A Relative Success

In other, better news, the staff used approximately 6,086 IP Addresses total during the operation to view the logs.  I think we’ve just about got their loop compromised.

Here is a list of those IP addresses in case you’d like to do something similar if you host a rogue clone of IRCTHULU PRESENTA on your PHP-APACHE server — dropping this in an include should pretty must ghost out the whole TOR network, most vpn’s known for being abused, and almost all the relevant staff’s various proxies and owned IP addresses:

http://paste.silogroup.org/axohacugej.apache

The process for adding them to the ban list was automated about 10 minutes in, but, I needed to disable the banning for a good long stretch or they’d have caught on to what was really going on.  One of them was really smart and added in some well crafted characters to try to slide through a grep and I didn’t see what they were doing until about an hour in — whoever that was knew exactly what what was up.

There will still be some of them that can access, but,  it’s pretty straight forward now.  This will buy plenty of time since I can’t use runners until the Tenta update.  New version of Nerve will accompany to add the feature of clearing out the pooled messages on restart.

I’m pretty excited — this was a total blast.  This whole project’s been like that.

Recap

  • This operation did indeed confirm the OFTC and FNODE network is actively targeting the runners.
  • FNODE and OFTC Feedback Loop is mostly broken so they won’t be able to for much longer.
  • They did my bug testing and risk analysis for me today which identified the vulnerability they’d use to find the runners.
  • Unfortunately it was significant enough that I can’t turn them back on without compromising their identities.
  • I obtained excellent data leverage-able to conduct “further WTF”.  Which I will certainly be doing.