Advice

IoT orchestration – Managing the IoT fleet with chains of automation

The Device Chronicle interviewed Matthias Luescher, Principal Engineer, Schindler and developer of EDI on managing IoT device deployments, IoT orchestration, automation and the importance of OTA updates for IoT devices. 

Matthias has managed large IoT device deployment in excess of 200k devices so he knows a thing or two about the pitfalls of managing these devices. He also runs his own private projects where he can experiment with new innovations and learn. Where IoT projects are concerned, Matthias points out that the great limitation is connectivity: “There is  limited bandwidth, devices are not always available; enterprises have to manage what are often globally distributed fleets and you might have the great Chinese Firewall in between.” So in this case, it is important for enterprises to plan for the use of delta updates, update roll back schemes – such as Mender’s A/B partition – and update retries to ensure that they can make devices in their fleets sufficiently resilient in narrowband conditions such as 3G/4G cellular connectivity or satellite. 

GitOps, IoT orchestration and continuous deployment

Matthias is very invested in the promise of the integration of Gitops with IoT orchestration. Matthias’s perspective is that the IoT world is far behind what is happening in the server world and there is much to learn from the server world to port to IoT. Matthias believes that one very promising approach that can be applied to IoT from the DevOps world is continuous deployment. He says “The typical IoT solution is still a point and click tool. But if you look at software automation platforms, there is no clicking and the developers are working closely with those operators who are rolling the software images out to the fleet. Site reliability engineers are also immersed in dev ops. They are testing and rolling out and not depending on a handful of fleet managers.” 

However, Matthias admits that there are challenges with server native tools and this mainly concerns scalability. For instance, Matthias explains that the Ansible Tower automation platform can orchestrate a fleet, but it is designed to ordinarily serve a small scale of 200 servers. Mathias emphasizes “These tools are not really designed to be scalable to 200,000 IoT devices.”

Robustness in the IoT fleet

When managing IoT devices remotely, achieving robustness is a prime consideration. Matthias explains the major challenge: “If a software is performed on a device that is far away, and the device gets bricked (where a device fails and freezes upon updating) then you must go onsite and replace it, and at an average cost of €100 per device this could get very expensive for an enterprise very soon. If, on the other hand, something goes wrong in server software update deployment, then you can re provision it on the virtual server and you are ready to go again.” So with IoT devices, an A/B partition mechanism is required to ensure that there can be a software rollback to the previous version to prevent IoT devices from bricking.  

Matthias also points out another key difference between IoT devices and servers. He says “It is not one to one in an IoT device fleet as it is in the server world.” There are also many different LTE cellular networks that must be considered; There are specific localisations that are required for gateway devices in China, and there are many different hardware generations that must be supported. Some IoT hardware in service is many years old.

Continuous development meets IoT orchestration

Matthias believes that the increased use of Gitops for IoT would be a big step forward for a heterogeneous fleet with various management tools. The git repository hosted on Github, Bitbucket or Gitlab would act as a single source of truth for the software that would be provisioned to the IoT devices. The great benefit of using a git repository is that all changes can be tracked and users across the enterprise are collaborating on the same tools. They are all contributing to Git, so it provides an entry point where all users can speak the same language whether they are developers, site reliability engineers, QA engineers or managers.

IoT orchestration
IoT orchestration with GitOps

 

Matthias stresses that “Big bang releases are not the way to go, continuous deployment is the way to go.” Matthias also points out that automation in Gitops helps to prevent human errors, and should be more robust as it moves from the device to the canary device to the full fleet as it is fully reproducible. 

Git repository can be used quite effectively as the entry point for IoT fleet management. CI/CD tools can feed back to Gitlab and good automation means that you don’t have to touch point and click “below” systems anymore. IoT Gateways should pull the changes. The challenge is that a standard “server designed” orchestration tool would expect the device to be always on, but this is not necessarily the case with IoT devices which can be turned off entirely for short periods of time. Mathias stresses that continuous monitoring of the IoT fleet is also required to get feedback on device status so development and roll out can be improved. 

Role of OTA updates in the project

An OTA software updates solution such as Mender fits in well for robustness and automation with its API. Automation could be done by the orchestration platform which would directly interact with an OTA updates solution such as Mender. Matthias has built an example project using the orchestration platform Ansible in combination with OTA updates solution Mender. In this use case, Matthias can turn a device into a kiosk screen with a configuration command. He explains that “Configuration key value pairs are used in Mender and it puts a configuration artifact to the Raspberry Pi 4-based device. A playbook is cloned from Ansible to the device, and additional roles are pulled in from Ansible Galaxy and Github, and Debian APT operating system packages are applied to the device. Matthias says that both Github as Debian APT repository for the device operating system, both scale very well in a global environment.” 

In the project, Mender allows for offloading of the deployment to the device, and really helps if the device has limited bandwidth or if it is offline for any period of time. Mender’s A/B update allows for roll back to previous installation on a device and re provision it from scratch. LT operators are tricky to deal with but Mender checks for the connection and there is no connection it rolls the software back to the previous version. Devices can be mounted underground with unreliable cellular connectivity or a power plug could be accidentally pulled on the device on a construction site. Matthias stresses that “Bricks will happen, and if they do then the operations team will have to dispatch technicians to the site at least to reboot the device, depending on the role and mission-critical importance of the work that the device carries out. 

In a larger enterprise project in which Matthias is involved, a team operating the IoT fleet might all be interacting with Github. The fleet orchestration gets activated, the software update is requested, and some additional configurations are performed on Azure IoT Hub, fetch some additional values from a company ERP database and bring it all together. The CI/CD pipeline also interacts with Git and builds Debian packages and Mender artifacts. Everything is automated, and Mender helps to deal with the weak connectivity on the IoT devices and performs both system and application updates. Mender also provides critical feedback on the status of the devices. Matthias concludes “You need the feedback from the remote device to understand for the 50 software updates requested, how many of those updates failed? Mender plays a critical role in this regard. 

We wish Matthias well as he continues to explore improved forms of IoT fleet management through chains of automation. 

You can explore Matthias’s thoughts and works further on his personal blog

Recent Articles