A common concern while exploring new architectural designs, concepts or technologies, is sourcing data which is similar enough to the live data you will be ingesting. Though there are tons of dummy datasets out there, they generally lack the specificity, depth or volume required for comprehensive testing of enterprise scale solutions — namely at the intersection of bandwidth, throughput and compute.
Sure, you’ve done the research and the documentation and sales reps have never been wrong (wink wink), but perhaps the solution you envision has crossed the threshold into the gray space between what’s ‘documented common practice’, and what you *know* is possible — or what your solutions providers claim to be possible. Alternatively, you could just be the type of architect that like to verify performance before
As a Principal Architect, POC warrior and avid homelab tinkerer, I cross this threshold more than a reasonable amount, but didn’t consider tackling it until I was forced to by a another homelab project which is also written in Go (more on this later).
Use Case
The most obvious use case is probably for my fellow POC warriors following along with the Big Data SIEM series — especially those whom may not have immediate access to a supported data source. Ideally, Babel can be used to benchmark your ETL, Storage and ingestion layer design recommendations.
Aside from this however, there are two primary scenarios that I run into frequently:
Professional Settings: In professional settings, we are often able to copy large sets of sample data from PRD to NPE environments for testing — often, but not always. Though the dataset is exactly what we’re developing a solution for, many team members will have “termed access” to the required resources for testing, and consultants are typically not allowed to bring these samples back to their own environments for further development. I’ve also (frequently) been in scenarios where the client had unreasonably low budgets for testing, and the company did I was working with did not want to invest a dime into R&D.
The Homelab: This is where things get a little more challenging, especially when it comes to security research. This is because we have to go out and collect the data ourselves or start putting money into subscription fees, which I’m not wanting to justify to the wife. With Babel, we have the ability to create as much data as needed, locally, to test our new projects.
Capabilities
Currently in version 1.0, Babel supports 48 of Fortinet’s Fortigate/FortiOS 7.4.0 Log message fields, which are generated either randomly (Ex: NAT IP, Source MAC, port numbers, etc) or selected from a list at random, as is the case with “Destination Country”, “Virtual Domains” and the like. Fields such as “OS Name” are selected randomly from options made available by the parent condition, such as “Device Type”, so that we do not get the output devtype=”Windows Server” osname=”Ubuntu”.
I chose to start with the Fortinet Fortigate log message fields, not out of loyalty to Fortinet, but due to the depth, detail and variety offered. They also had some great documentation and definitions which was a huge boost to help the project along the way.
Additionally, I’ve built 8 flags to tailor the output for a wider variety of use-cases. Currently, this includes the following:
- n : Number of log events to generate (default: 10)
- fmt : Output format: csv, json, raw (default: csv)
- f : Output filename for CSV (default: logs.csv)
- dir : Output directory location (default: project folder)
- r : Maximum events per second (0 = unlimited, default: 0)
- vB : Volume in Bytes(vB). Maximum volume in bytes before stopping (overrides -n)
- vT : Volume in Terabytes(vT). Maximum volume in terabytes before stopping (overrides -n)
- help : Shows the help message and exits
For both json and raw data formats, it’s important to note that Babel will create individual files, whereas a single CSV is created for the CSV format.
Planned Improvements
Improvement may not happen as rapidly as I would like, as I am currently working on two red team certifications as well as two homelab projects — all of which I hope to complete before the end of April. Regardless, I would break the planned improvements into 3 tiers:
Tier One: The first set of improvements will revolve around enabling Babel to operate as a continuous data source, rather than a data dump, for POC’s in enterprise cloud environments or for the homelab. This will include (at least) API access and enhanced rate limiting.
Tier Two: With the first set of improvements operational, I plan to add more data sources to run in parallel. As of right now, this includes the following:
– Zscaler DNS Logs: >25 Fields
– MISP Core Format: >40 Fields
Tier Three: The final tier is about adding focus to the randomized data, to create a more convincing log messages — or at least the ability for the development team to do so after quick modification. By this, I mean that controls will be in place to increase or decrease the rate at which targeted event occur, so that the data generated by Babel can also be used to test your SIEM.
Performance
As you will (probably) see in the main.go file: I did not invest time into a counter which prints “You made x amount of events in y seconds!” because this conflicts with future plans for continuous run time operations. However, Babel ran for less than 1 second to create a CSV file with 1,000,000 events, 48 header columns deep. The performance did not change when the output format was changed from raw data, to CSV or JSON.
Some of my friends and readers are probably thinking “so what?”. Well, I’m actually pretty excited about this, because when I had recently created 500GB of similar data (only 30 header columns) using a similar script in Python, and it took quite a while. How long — I don’t even know, because I left after a few minutes for lunch.
Closing notes
As mentioned above: Babel was initially intended to be a high performance “dummy data load utility” to help benchmark the performance of a security project I am building in the homelab. However, as I continued development, it became clear (as stated above) that this project has several possible use cases in both enterprise environments and for other security researchers. If you want to laugh, I actually wrote the program twice, the second time to facilitate more modular design for upcoming features and data sources.
Babel is published to my github account (https://github.com/TylerG01/Babel) and I hope that you’ll share your feedback with me via LinkedIn. Special thanks to Tyson Barber for feedback, brainstorming and sanity-checks along the way.