A Generalist AI Won't Fix Your Raw Logs Mess
There's a category of work in security that doesn't get much attention but everything depends on.
It's not detection engineering and it's not threat intel. It's getting your security data to speak the same language across every source you collect.
Every security tool, cloud platform, and SaaS vendor structures its logs differently. Different field names, different nesting, different conventions, all describing the same things. And if you want to correlate across any of them, someone has to translate.
If you've been following our series on Zoom as a security data source (Part 1 explored the telemetry, Part 2 built a location detection on top of it), this is the layer underneath: how to map raw vendor data into the schemas your tools expect. We’ll compare three that keep coming up in conversations with security teams: ECS, UDM, and OCSF.
But first, a topic worth addressing.
AI Is Only as Good as Its Understanding of the Data
Security AI is only as good as its understanding of the data underneath. And security data is a mess to understand.
Every vendor structures logs differently. Same concepts. Thousands of representations. IP addresses alone show up in hundreds of forms across security tools. Raw vendor logs? That number climbs into the thousands. IP is one entity. Security data has hundreds more (devices, users, processes, network sessions) each tangled in its own representations.
An AI agent hitting raw vendor logs has to make semantic judgment calls on every field. Is this the actor or the target? Is this a string or an enum? Get these wrong and it's not a formatting issue. It's an investigation quality issue. A human analyst will see a malformed field and ask for clarification. An AI agent won't. And when you need judgment on every field, you get different answers every time. Good luck reproducing that investigation.
Schemas are how the industry solves this today. They're where analysts and AI agents meet the data. Your SIEM is built around one. Your detections are coupled to it. Getting the schema mapping right isn't optional. Everything downstream depends on it.
Here's what separates a specialized normalization agent from a generalist pointed at raw logs: the generalist sees a field and has to infer what it means. Every time. A normalization agent trained on vendor-specific schemas knows before it encounters the data. It's translating from a known source structure to a known target.
This is why the "one AI agent does everything" pitch falls apart in security. You need agents scoped to specific jobs (normalization, detection, triage, investigation, remediation) each with real depth. The normalization layer gives the rest of the chain something consistent to reason over.
Now let's look at what that normalization work actually involves.
The Babel Problem
The problem shows up the moment you try to correlate across sources. Take a single Zoom meeting event and try to correlate it with sign-in activity from Okta and endpoint telemetry from CrowdStrike.
Zoom gives you user_id, email, ip_address, os, qos[].details.avg_latency.
Okta gives you actor.id, client.ipAddress, client.userAgent.
CrowdStrike gives you UserName, aip, event_simpleName.
Same concepts. Different field names. Different nesting. Different conventions.
If you want to correlate across these sources ("show me everything this user did across Okta, Zoom, and CrowdStrike"), you have two choices: write custom joins every time, hunting through each vendor's schema for equivalent fields. Or translate everything into a common language first and query once. The rest of this post is about the second option.
Meet the Three Schemas
We're going to map a single Zoom event into ECS, UDM, and OCSF. But before we get to the mapping, it's worth understanding what makes each schema different — not the spec details, but the philosophy behind them and the tradeoffs that come with each approach. These aren't competing products. They're different philosophies for expressing data understanding, each making different tradeoffs about where the hard judgment calls happen.
ECS: Search-First and Flexible
Cares most about: "Put it where analysts will look for it."
ECS uses flat dot-notation. Fields live in families: user.* for identity, source.* for network origin, host.* for device, event.* for classification.
The philosophy is search-first: put fields where analysts will look for them. If you're used to typing user.email: in Kibana, ECS makes that work across any data source.
ECS is also flexible. You pick the event classification. You decide whether to mirror fields (putting IP in both source.ip and client.ip for different query patterns). There's no schema police rejecting your data if you make unconventional choices.
That flexibility comes with a tradeoff: the same concept can validly live in multiple places, which is an ambiguity that both humans and agents have to resolve every time they query. Different teams or vendors can map the same source in different ways and both be "correct" by ECS standards. When that happens, you end up writing queries that account for multiple possible field locations for the same data, which is exactly the kind of complexity a schema is supposed to eliminate.
UDM: Role-Based and Structured
Cares most about: "Who did what to whom?"
UDM thinks in nouns: principal (who did it), target (what was affected), src, dst, observer. Every event is fundamentally an actor doing something to a target.
The philosophy is role-based: before you map a field, you have to answer "is this the actor or the target?" For a Zoom meeting, that's straightforward — the participant is the principal. For a firewall log, it gets more interesting.
UDM is more constrained than ECS. Event types come from an enum. Field paths are deeper and more verbose. The upside is semantic clarity: once you learn the principal/target model, both analysts and agents reason about any UDM data the same way. The cost is verbosity — field paths get long, and events that don't fit a clean actor/target model (like system health checks or passive telemetry) can require awkward workarounds.
OCSF: Classification-First and Validated
Cares most about: "What category of event is this?"
OCSF thinks in classes. Every event belongs to exactly one class: Authentication (3002), Network Activity (4001), API Activity (6003), etc. The class you pick determines which fields exist and whether they're required, recommended, or optional.
The philosophy is classification-first: before you can map anything, you have to answer "what category of event is this?" That commitment unlocks structure. OCSF can validate that you've filled in required fields for your chosen class. The tradeoff is that the semantic choices happen upfront and are hard to change later.
OCSF also uses integer enums where ECS uses strings. activity_id: 1 instead of event.action: "logon". More precise, less readable.
OCSF is newer and still evolving. Its class system gives you strong structure, but it gets tricky when one source spans multiple event types. With Zoom, for example, you may need to map participation data and QoS telemetry into different classes, which can lead to duplicated context (user/device repeated across events) and higher volume (sometimes up to ~3×). Since the class choice dictates the event shape and required fields, a wrong choice can force a remap later. And because many values are integer enums, the raw data is harder to interpret without reference tables.
Same Zoom Event, Three Translations
Let's come back to our Zoom example. Here's a single Zoom participant record from the QoS Summary API. It's a composed event that combines meeting participation data with network quality metrics. The QoS fields might look unusual for security data, but they turn out to be surprisingly useful for detection. Latency, jitter, and packet loss encode physical distance in ways that IP geolocation can't, which is how we used them to build a location anomaly detection.
{
"user_id": "abc123",
"user_name": "Alice Smith",
"email": "alice@corp.com",
"ip_address": "203.0.113.42",
"internal_ip_addresses": ["10.0.1.50"],
"os": "Win",
"os_version": "10.0.19045",
"pc_name": "ALICE-LAPTOP",
"mac_addr": "00:1A:2B:3C:4D:5E",
"join_time": "2024-01-15T14:30:00Z",
"leave_time": "2024-01-15T15:45:00Z",
"health": "good",
"qos": [
{
"type": "audio_input",
"details": {
"avg_latency": "126 ms",
"avg_jitter": "12 ms",
"avg_loss": "0.03%",
"avg_bitrate": "27.15 kbps"
}
}
],
"meeting": {
"id": 98765432101,
"uuid": "abc123xyz",
"topic": "Weekly Sync"
}
}Note: Zoom’s QoS Summary returns quality metrics as strings with units (e.g., “126 ms”, “0.03%”). We parse these into proper numerics so aggregations, thresholds, and percentiles work correctly.
The mappings below were built using Beacon's AI normalization agent, which handles the schema translation for each target format. Now let's see where each field lands.
Not all fields are always available. Zoom's API may not return participant emails for guests or external users, and some fields depend on account-level privacy settings.
This is the judgment surface for one event from one source. Each row is a semantic decision: what does this field mean, where does it belong, what type should it be. Multiply by every vendor in your environment.
Why This Matters: Cross-Source Detection
The schema comparison above might feel academic — who cares whether the email lives in user.email or principal.user.email_addresses? It matters the moment you try to combine sources.
Say you want to add Okta logs alongside your Zoom data. Okta captures something Zoom doesn't: where users authenticated from. Now you can ask a question neither source can answer alone: does the user's Zoom network behavior match where they logged in?
Without a unified schema, you're translating inside every query:
-- Without unified schema: different field names, different paths
SELECT *
FROM okta_logs o
JOIN zoom_sessions z
ON o.actor.alternateId = z.participant_email
WHERE o.client.ipAddress != z.ip_address
AND o.published BETWEEN z.join_time AND z.leave_timeEvery new source means learning another vendor's naming conventions. Now imagine doing that across ten or twenty sources.
With a unified schema, the join is implicit:
-- With unified schema: same anchors across sources
SELECT *
FROM events
WHERE user.email = 'alice@corp.com'
AND event.module IN ('okta', 'zoom')
AND @timestamp BETWEEN '2024-01-15T09:00:00Z' AND '2024-01-15T10:00:00Z'
Both sources use user.email for identity, source.ip for network origin, @timestamp for time. Each new source you add works the same way: same anchors, same query patterns.
This matters for two reasons, and they're converging.
For analysts, a unified schema means writing one query instead of learning every vendor's conventions. Investigations that span five sources don't require five different field dictionaries. You can layer in HR data, VPN logs, endpoint telemetry. Each source adds signals, but only if user.email means the same thing everywhere.
For AI agents, For AI agents, the impact is more direct. An agent querying normalized data works with a known schema. It doesn't need vendor-specific context in its prompt, doesn't carry field mapping tables, doesn't burn tokens re-deriving what each field means. Instead of figuring out that actor.alternateId, participant_email, and UserName all refer to the same person, the agent sees user.email everywhere and reasons about behavior across sources. Normalized data doesn't make agents smarter — it removes an entire class of errors they're prone to with raw data.
This is also why we apply AI to normalization itself. The semantic judgment needed to map vendor data correctly is exactly the kind of work that benefits from specialized AI.
Two Use Cases for AI-Powered Normalization
There are two ways to bring normalization intelligence to your security data, and they serve different needs.
Normalized at ingest (stored). Translate data into a common schema before it hits storage, so everything that flows through your SIEM is already clean, correctly typed, and ready for detections and correlation rules. This is Beacon's core. Your detections fire reliably because user.email always means the same thing. Your correlation rules work because timestamps are in the same format and IP roles are consistent.
Normalized on read (ad hoc). Not every organization has a clean pipeline for every source. Sometimes the data is already sitting in a lake, raw and messy, and you need answers now, not after building a full pipeline. Beacon's AI assistant brings schema intelligence to the query layer, navigating unnormalized data and surfacing what you need on the fly.
Both depend on the same foundation: deep knowledge of security schemas and the entities within them.
So Which Schema Should You Be On?
In practice, you rarely choose a schema. You inherit one from the platform you build on. Every detection, dashboard, enrichment pipeline, and correlation rule you build is coupled to that schema's field paths and conventions.
That coupling also goes deeper than field names. UDM and OCSF enforce type correctness at ingestion: send a string where an integer is expected and the event gets rejected or dropped. ECS is more forgiving — it'll accept the wrong type without complaint, but your query will silently return no results. Either way, the failure mode is the same: detections that look right but don't fire.
Understanding these tradeoffs matters in two places. First, if you're evaluating a SIEM or data lake, the schema should be part of the decision. Prefer one with broad security coverage, an active community, and a track record of keeping up with new source types. Second, consider whether you want to own the mapping layer instead of depending on your SIEM vendor's defaults. Most out-of-the-box mappings are incomplete or inconsistent across sources, and for a source like Zoom, there's usually no default mapping at all.
But the schema itself matters less than the quality of understanding behind the mapping. Get that right, and you can re-target to any schema — or to whatever comes after schemas.
Bridging the Gap
Most teams inherit a schema from their SIEM but don't control the mapping quality. Default mappings are incomplete, inconsistent, or nonexistent (for sources like Zoom, for example). The semantic work either doesn't get done, or gets done manually and inconsistently.
Beacon applies specialized AI to the hard part: the semantic understanding of security data. What role does each field play? What type does each schema expect? Where do vendor conventions diverge from schema intent? This isn't a generalist agent; it's purpose-built intelligence trained on the domain. Today that powers schema normalization from any source to any destination, including sources your SIEM doesn't natively support.
That mapping layer also decouples you from your SIEM choice. If you're happy with your SIEM, you get better data quality without changing anything else. If you're migrating, you re-target the mappings instead of rebuilding every detection and dashboard from scratch. And if you're routing data to multiple destinations (a SIEM and a data lake, for example) each one gets the data in its native format.
We'll go deeper on how this works, and what it unlocks beyond normalization, in an upcoming post.
Want to see it in action? Get in touch!

