Inside the hood — machine learning enhanced real time alarms with ZoneMinder
A week ago, I posted a brief article on how I integrated machine learning with ZoneMinder for better alarm notifications. Since then I’ve received emails/questions on how things work, some of them from folks who know what they are doing (as I said earlier, this is not going to be ‘drop-in-and-it-works’) , so I thought I’ll write up this article on what I did.
Before I continue, here the sort of outdoor image I need to deal with. Best of luck trying to define pixel blobs and zones trying to detect objects of interest!
Key goals
My primary goals were the following:
- I want to enable notifications for alarms in zmNinja but don’t want rubbish pushed to my phone (currently the case with ZM’s detection algorithms which rely on blob/pixels groups)
- I only wanted to be notified of “vehicles” and “people” who come up to my driveway/house
- Needs to run at an acceptable speed on non GPU machines
- I needed this to be as real time as possible — a script running in the background and notifying me after the alarm completes isn’t of particular use
Enter the EventNotification server
Obviously, I decided to add this support to my EventNotification server (ES) since it already supports the full infrastructure needed to push to my phone as well as WebSocket interfaces.
General approach of hooking into ZM
zmeventnotification.ini now contains a hook
attribute. If you specify a script file there, every time the notification server detects an alarm (it interacts with ZM using shared memory and checks the shared_data
structure for events), it passes the event id
, monitor id
and monitor name
to this script. The script can do what it chooses and if it returns 0
then the event server will process the notification and send it out to listeners. If it returns 1
it will not. So the script is where all the action is. I then used this framework to do object detection and based on the results, either notify or not. Note that this does not affect what ZM records. ZM has no idea of object detection at the moment (which I am hoping will change soon). This integration is only for notifications.
Choice of Algorithms
Person/vehicle detection was my primary goal. All other types are gravy. I had the following options:
- HOG: Very fast, inaccurate
- Inception: Very accurate, very slow, many labels
- YoloV3: Accurate, not too slow, sufficient labels
- YoloV3 with TinyYolo configuration/weights: Reasonably accurate, almost as fast as HOG, sufficient labels
Fortunately, OpenCV supports all of these models.
My approach therefore became:
- When an alarm occurs in ZM:
- invoke the hook script — the hook script is a shell script that will first get an image to be analyzed (more on this later)
- the hook script will then pass this image to
detect_hog.py
that will use HOG ordetect_yolo.py
that will use YOLO/TinyYOLO. These python scripts pass back a string withdetected:
followed by the label(s) it detected - -> The wrapper script parses the return string, matching it with labels you are interested in, and accordingly returns a
1(fail)
or0(success)
based on which notifications are not processed/processed
Performance Notes
My ZM server doesn’t have a GPU. On a Intel Xeon 3.16GHz 4Core machine:
- HOG takes 0.24s
- YOLOv3 with tiny-yolo takes 0.32s
- YOLOv3 takes 2.4s
Given these scripts are triggered only when an alarm occurs, I think these numbers are perfectly usable for home systems. For systems with many more cameras (I have 8) or faster requirements, I’d strongly recommend using GPUs. But if neural nets are your thing and you are looking at blazing performance, read this.
From a memory consumption perspective, YOLOv3 takes 4GB to load the model while YOLOv3 with tiny weights takes around 1GB ( I haven’t tested these values to be accurate — I read them here). Yeah, a lot, I know. This is also why I exit the processes after detection instead of keeping them alive. This results in a longer processing time (the 2.4s includes load time) but relieves running memory pressure for when alarms don’t occur.
Architectural Changes to Event Notification Server
Prior to the 2.x release, the ES was a sequential event based operation with event polling occurring after processing of the current event was done. All good, but if you plan to use machine learning models that take time, locking up the main event loop is a bad idea. So I changed the architecture to fork and continue. However, this causes a problem: If you are using WSS
and the parent manages the connection, the child will get a copy of the SSL connection and when it exits, the parent’s SSL state for that connection will go out of sync and the remote connection will be terminated. Not good. To work around this issue, the websocket handling code was removed from the child and pipes are used to communicate from child to parent before it exits to send out a message. The new architecture is this:
Understanding how machine learning based model detection works with ZM
Step 1: detect_wrapper.sh
The right frame to analyze
When an alarm occurs, ZM starts recording right away. However, how do we know which frame to send for analysis? Note this is real time. The alarm is still being recorded. Fortunately, Isaac Connor implemented a lovely image wrapper in image.php
that takes away the complexity of retrieving images. However, my needs were more specific:
- I should be able to retrieve the first “alarmed” frame in an ongoing event
- I should be able to retrieve a frame with the “highest score” in an ongoing event
- I should be able to retrieve a frame by its ID
- I should be able to do all of this without having to call a separate login API because this is real time
The reason we need so many options is that which frame you pass for detection depends on your situation and how you have configured your detection zones. In my case, I always use first ‘alarmed’ frame but imagine if your zone maps to a garage door opening but you really want to pass an image of a person entering. This may be better suited for ‘highest’ score. Options are good. Test and decide.
To be able to support these new modes, changes were made to web/views/image.php
and web/includes/frame.php
. They were merged into master on Oct 11, 2018 (US ET date/time) so you may need to pull them manually if your build is older.
Right, so after all these changes, detect_wrapper.sh
basically grabs the image to analyze using:
wget ${PORTAL}/index.php?view=image&eid=$1&fid=${FID}&width=800&username=${USERNAME}&password=${PASSWORD}" -O /var/detect/image/$1.jpg
Where:
${PORTAL}
= your ZM portal,https://myserver:port/zm
$1
= event ID of the event that just triggered${FID}
= ‘alarm’ (aka first alarmed frame) OR ‘snapshot’ (aka max score) or a specific frame number
Note: If you use ‘alarm’, you need to either enable “frame” storage in the camera or wait for this PR to get merged.
Boo Yah! We now have the right image to analyze in /var/detect/images/<EventId>.jpg
I realize all we did so far is grab an image. But it’s the right image. Boo Yah! I’m easily excitable.
Step 2: detect_wrapper.sh
The right algorithm
By default, I provide 3 algorithms that you can select in this section. Simply put:
- If you use
DETECTION_SCRIPT
to point to the HOG script, that’s all you need - If you use
DETECTION_SCRIPT
to point to YOLOv3, select whether you want TinyYOLO configuration or the YOLOv3 configuration. The former is much faster but less accurate. See my notes on performance earlier. - You will need to download the configuration/weights/labels for these python scripts and place them in a
www-data/apache
readable location
In the same script, DETECT_PATTERN
are the patterns you want to be notified for. The default value is (person|car)
but you can change that to (person|car|giraffe)
or something else. Just make sure the labels are part of this data set. These are the labels that the model(s) I am using are trained to detect. Note that these are substring searches. So if the script detected multiple objects in addition to person and returned “person:99%,umbrella:86%” that would match as person matched.
Also note that ${IMAGE_PATH}
needs to be created before you run the script and needs to be www-data/apache
RW including directory write permissions!
Finally, the wrapper also uses the --delete
parameter to tell the detection scripts they should delete the image after analysis. You may want to remove this during testing if you want to see which images the wrapper pulls so you know which FID
mode is useful for you.
Step 3: detect_hog.py and detect_yolo.py
The smart ones
So far, we’re just taking about hooks and wrappers and how to grab the right image to analyze. The actual analysis happens in these files.
The first thing you should know is that these scripts require additional libraries. Given I already have the libs installed, I did not note down exactly which deps you will need but be ready to use python pip
to get the core modules. Note that you will need the latest version of OpenCV that supports the DNN
modules and the YOLOv3 integration, which I believe is >=3.4
detect_hog.py
uses OpenCVscv2.HOGDescriptor_getDefaultPeopleDetector()
that loads a pre-trained histogram of oriented gradients that is meant to look for people. It then useshog.MultiScale()
to analyze the image against this gradient to come up with a confidence score. A good explanation of the parameters of this function (and from where I copied the code) is here. Note that you’ll find significant improvements at the loss of speed as you tweak them. That being said, given I found YOLOv3 with tiny config to be almost as fast as HOG and much more accurate, I don’t use HOG. However, if you have memory constraints that don’t allow you to load the tinyYOLO model (1GB+) then you may have to use HOGdetect_yolo.py
uses OpenCV’s DNN module to run a neural net on the images. You can pass it different configs, weights and labels. By default, I provide instructions on how to get YOLOv3 standard and YOLOv3 tiny configs/weights/labels here. But you can use many other models. Briefly, the “weights” file is a pre-trained model (and why it’s so large). The “config” file is basically the neural net structure that can be fed those weights and the “labels” are the set of labels the model was trained for. In short, using these 3 files, you can load a neural net that is “already trained” to detect the labels and avoid the complete training steps (which takes a lot of time, and the original image).
Hope this gives you some more visibility and enjoy getting the alarms you deserve!
(As I was editing this article, I got an alarm that someone came by my front door. Hey with those many boxes, I’d love to know! I love how accurate it is. Note the time — I got the alarm at 1:03PM and the ZM time shows 1:03:24)