Shit's gone wrong? You're in the right place!
This document is a 1-pager for dealing with issues with Hyperion services. See any of the following chapters for dealing with specific issues.
Issues with HyperionDB
Neo4j is down
If any of the Hyperion services are down, the first step is to open Portainer and check if the Docker containers for the associated service are stopped or just erroring out. In the event that Neo4j's docker container is down, you can attempt to stop the stack (click here), duplicate it, start the new stack, and remove the old one.
In order to access Portainer you need to be part of the oobtel-admins group on oobtel's GH org and then go to port.oobtel.network and authenticate with Github.
All the data for Hyperion is stored on NAS storage which will not be affected by the stack migration/dupliation. Make sure to NOT start both stacks at the same time, this can cause issues as both Neo4j engines will attempt to attach to the same data store.
Neo4j's Docker container is unhealthy
The docker-compose file for Neo4j contains a health check which uses the built in HTTP health check endpoint of Neo4j to check if the database is healthy. Consequently, the fact the container is unhealthy can be caused by multiple reasons including:
-
Most common: The DB is overloaded and queries are either failing or timing-out. To triage check the timing of recent ETL jobs and see if any of them either took very long or failed. Neo4j can report as unhealthy when it needs to handle many concurent requests that are using
MERGEstatements. If that is the case, try batching data loading or add delays between requsts. -
Sometimes happens: when restarted, Neo4j will recheck all the data againts the deployed constraints (i.e the schema). On one of the staging deployments this, for reasons I didn't have time to look into, failed even if the data was valid. I was not able to replicate this issue on other deployments, but an easy "fix" was to re-pull Neo4j with a lower version, start the DB, stop the DB and pull the newer Neo4j version again.
Issues with the Data Pipeline
ETL scripts are failing
If there are multiple errors being reported for an ETL script in Windmill, you need to check the cause of the failure directly in Windmill (although the error message sent by Windmill will contain the same stack trace). Specifically you need to:
- Check which Flow is causing the issues
- Check which part of the Flow caused the error
- Check the stack trace to see which part of the script caused the error
- Attempt to fix whatever the issue is and retest the script.
Some known issues that can happen include:
- OpenCTI-related pipelines failing due to OpenCTI API issues. This cannot be resolved on Windmill side since OpenCTI's API is not exactly stable at times. Best to implement an added retry mechanism and try to troubleshoot on the OpenCTI side.
- VirusTotal ingest pipelines failing due to non-ASCII characters being included in the JSON returned by the VT API and passed to the next stage of the flow. This can be resolved by stripping these characters or Base64 encoding the JSON before sending it to the next flow step.
You can find the docs on how to access Windmill here.
Windmill is down
If Windmill is down you can pretty much follow the same guidelines as the ones for Neo4j.
Windmill needs to be recreated from scratch
If Windmill needs to be recreated from scratch (or needs to be migrated to a new server), you have 2 options:
- If the PostgrSQL database powering Windmill is intact: you can attampt to attach the TrueNAS data volume to the new server and recreate the stack on the new server by using the attached volume. Make sure to use the
/mnt/truenas/mount path. - If PostgreSQL is not recoverable: you can restore the scripts (but not the historical run data) for Windmill by using the dedicated guide in the hyperion-etl-scripts repo.
Issues with the Binary Pipeline
Currently there are no recovery guides for the Binary Pipeline as its still in Alpha stage.