Sunday, June 7, 2026

Reading the Wire — Protobuf Without a Map

Moin! 👋

Protobuf turns up constantly in DFIR work. Android apps, iOS apps, Chrome internals, sync engines, health databases — wherever Google's tooling reaches, Protobuf follows. And yet it is one of those formats where a lot of examiners open the hex, see a wall of binary, and move on. That is understandable. Without the schema, Protobuf does not announce what it is holding.

This post is a deep dive into what Protobuf actually is at the wire level — byte by byte — and what you can and cannot recover from it without access to the original .proto definition. Where tool output is shown, it comes from crush — an open-source DFIR workbench I develop in my personal time. The forensic concepts apply regardless of which tool you use.

What Is Protobuf?

Protocol Buffers (Protobuf) is a binary serialisation format developed by Google. The design goal is compact, fast, schema-driven serialisation — the opposite of a human-readable format like JSON or XML. A .proto file defines the message structure: field names, types, and field numbers. The compiled schema is used by both the writer (to encode) and the reader (to decode). Without the schema, you get the wire format — and the wire format is deliberately sparse.

That sparseness is the core forensic problem. The wire format encodes field numbers and wire types, but not field names, not semantic types beyond a handful of primitives, and not the structure of nested messages. A value of 1 in a varint field could be a boolean true, an enum value, an integer count, or a Unix timestamp in seconds — the wire format cannot tell you which.

What makes Protobuf interesting for forensics is that it is self-delimiting and robust. A decoder that does not know the schema can still walk the byte stream, identify field boundaries, and extract raw values. That is exactly what schema-less decode tools — including protoc --decode_raw and the crush BLOB Inspector — do.

The Wire Format — From the Ground Up

A serialised Protobuf message is a sequence of fields. There is no message header, no length prefix for the overall message, no magic bytes. The stream starts immediately with the first field. Each field has two components: a tag and a payload.

The Tag Byte

The tag encodes two things in a single varint: the field number and the wire type. It is constructed as:

tag = (field_number << 3) | wire_type

The three low-order bits carry the wire type (values 0–5). The remaining bits carry the field number. There are six wire types in use:

Wire Type Value Used For
Varint 0 int32, int64, uint32, uint64, sint32, sint64, bool, enum
64-bit 1 fixed64, sfixed64, double
Length-delimited 2 string, bytes, embedded messages, packed repeated fields
Start group 3 proto2 only — deprecated; group and contents silently skipped by crush
End group 4 proto2 only — deprecated; consumed as part of group skip
32-bit 5 fixed32, sfixed32, float

So a tag byte of 0x08 decodes as: 0x08 = 0b00001000 → low 3 bits = 000 = wire type 0 (Varint), remaining bits = 00001 = field number 1. A tag of 0x12 = 0b00010010 → wire type 2 (Length-delimited), field number 2.

Tags themselves are encoded as varints, which matters when field numbers exceed 15 (the tag no longer fits in a single byte).

Varints — The Core Encoding

Varint is the most important encoding to understand, because it is used for both tags and for all integer-typed field values. The encoding is variable-length little-endian with a continuation bit:

  • Each byte contributes 7 bits of value data.
  • The most significant bit (MSB) is the continuation flag: 1 means another byte follows; 0 means this is the last byte.
  • Bytes are in little-endian order — the first byte holds the least significant 7 bits.

Let us decode 0x96 0x01 step by step:

Byte 1: 0x96 = 1001 0110
        MSB = 1 → continuation, more bytes follow
        Value bits: 001 0110 = 0x16 = 22 (least significant 7 bits)

Byte 2: 0x01 = 0000 0001
        MSB = 0 → final byte
        Value bits: 000 0001 = 0x01 (next 7 bits)

Assembled (little-endian 7-bit groups):
  [ 000 0001 ] [ 001 0110 ]
   bits 13–7     bits 6–0

Result: 0b 0000001 0010110 = 0x96 & 0x7F | (0x01 << 7)
      = 22 | 128 = 150

So 0x96 0x01 is the varint encoding of 150. A single-byte varint (MSB = 0) encodes values 0–127 directly. Values from 128–16383 require two bytes. The practical maximum for a 64-bit varint is 10 bytes.

A Complete Field Walkthrough

Take this 7-byte sequence:

08 96 01 12 03 74 65 73

Breaking it down field by field:

Field 1:
  Tag:   0x08 = wire type 0 (Varint), field number 1
  Value: 0x96 0x01 → varint → 150

Field 2:
  Tag:   0x12 = wire type 2 (Length-delimited), field number 2
  Length: 0x03 → 3 bytes follow
  Data:  0x74 0x65 0x73 → UTF-8 → "tes"

Without a schema, we know: field 1 holds the integer 150, field 2 holds 3 bytes that happen to be valid UTF-8. We do not know if field 1 is a count, a status code, an enum, or a timestamp. We do not know if field 2 is a string, a serialised sub-message, or arbitrary bytes. That ambiguity is inherent — and unavoidable.

ZigZag Encoding — Signed Integers

Varint is efficient for small positive integers. Negative integers are a problem: in two's complement, -1 is 0xFFFFFFFF for int32, which as a varint requires 10 bytes. Protobuf solves this with two approaches:

int32 / int64: negative values are sign-extended to 64 bits and then varint-encoded. -1 encodes as 10 bytes. Inefficient, but rarely used for fields that are expected to hold negative values.

sint32 / sint64: ZigZag encoding maps signed integers to unsigned integers such that small-magnitude values — both positive and negative — produce small varints. The mapping is:

ZigZag(n) = (n << 1) ^ (n >> 31)   // sint32
ZigZag(n) = (n << 1) ^ (n >> 63)   // sint64

 0 → 0
-1 → 1
 1 → 2
-2 → 3
 2 → 4
-3 → 5

To decode: if the raw varint value is n, the ZigZag-decoded signed value is (n >> 1) ^ -(n & 1).

The forensic implication: if a schema-less decoder shows a varint value that seems implausibly large (e.g., a field holding 4294967295 where you expected something small), the field may be a sint32 that holds -1. Without the schema, you cannot tell — you can only note the raw value and be aware of the possibility.

64-bit and 32-bit Fixed-Width Fields

Wire type 1 (64-bit) and wire type 5 (32-bit) are fixed-width little-endian. They are used for double, float, fixed64, sfixed64, fixed32, and sfixed32. Because the width is fixed, no varint encoding is involved — the bytes are read directly.

A forensically common case: Unix timestamps stored as double (wire type 1, 8 bytes, IEEE 754 double precision) or as fixed64 (wire type 1, 8 bytes, unsigned 64-bit integer). Both look identical at the wire level. A schema-less decoder will show you the raw 8 bytes and typically interpret them as a 64-bit integer — it cannot know whether you should read it as a double instead.

Length-Delimited Fields

Wire type 2 covers strings, byte arrays, embedded sub-messages, and packed repeated fields. The format is:

[tag: varint] [length: varint] [payload: length bytes]

A packed repeated field is a wire type 2 field where the payload itself is a concatenated sequence of varints or fixed-width values — no tags between them. This is how arrays of integers are efficiently encoded. In a schema-less decoder, a packed repeated field is indistinguishable from a byte string or an embedded message without attempting to parse the payload.

An embedded message is also wire type 2. Its payload is itself a valid Protobuf message. A schema-less decoder can recurse into it and decode it as a nested message — this is what protoc --decode_raw does, and what crush does in its Protobuf decode mode. However, whether a length-delimited payload actually is a valid sub-message, or just happens to parse as one, cannot be determined from the wire format alone.

What You Get Without a Schema

With a solid understanding of the wire format, the limits of schema-less decoding become precise. The table below reflects what is recoverable from the wire format alone — and what remains genuinely ambiguous even after decoding:

What you can recover What remains ambiguous
Field numbers Field names
Wire types (varint / 64-bit / length-delimited / 32-bit) Enum value names and meaning
All numeric interpretations of a varint simultaneously — uint64, int64, sint64 (ZigZag), bool (if 0/1), Unix timestamp (if in range), Chrome/WebKit µs timestamp (if in range) Which of those interpretations the application actually wrote
All numeric interpretations of fixed-width fields — uint64/int64/double/timestamps for wire type 1; uint32/int32/float/timestamp for wire type 5 Which numeric type was intended
Nested message structure (recursive decode) Whether a length-delimited field is a string, bytes, sub-message, or packed repeated — crush applies a nested-first heuristic to choose the most plausible interpretation (described below)
Message field boundaries Whether optional fields are absent vs. set to their default value

The practical upshot: schema-less decoding gives you structure and all plausible numeric interpretations. What it cannot give you is the correct one. Context — from the application, the schema, or known app behaviour — is always the final arbiter.

Finding the Schema

Before treating a Protobuf blob as permanently opaque, it is worth checking whether the schema is publicly available — but realistically, for most apps encountered in casework, it is not. Protobuf schemas are internal implementation details. Unless the application is open source, or the developer has explicitly published the .proto files, you are working without one.

The exceptions worth checking:

  • Chrome / Chromium: Chrome sync, history, and IndexedDB schemas are in the open-source Chromium repository. The sync protocol definitions are under components/sync/protocol/ — this is one of the most complete publicly available Protobuf schema sets relevant to forensics.
  • Android (AOSP): system-level Android components that are part of AOSP sometimes have schemas in the source. This does not extend to Google apps distributed separately through the Play Store — including Device Health Services, which is a Google app, not an AOSP component.
  • Open-source apps: if the app is open source, the .proto files are usually in the repository. Signal Desktop is a notable example.
  • Embedded descriptors in APKs: some apps ship compiled Protobuf descriptors (.pb files) inside the APK or app bundle. Extracting and decompiling the APK and searching for descriptor.pb or .proto.bin files is worth attempting — but success depends entirely on whether the developer included them, and many do not.

For everything else — which is most things — schema-less decode is what you have. The goal then shifts: extract as much structure as possible from the wire format, surface all plausible interpretations, and document the basis for any interpretation that ends up in a report.

Protobuf in crush

Unlike SQLite or binary plist, Protobuf cannot be detected automatically from the data itself — there are no magic bytes, no file extension convention, no self-describing header. Whether a blob is Protobuf or not is something only the examiner can know, based on context: the app, the table, the column name, prior research. crush therefore does not attempt to auto-classify Protobuf in the file panel or database viewer. Instead, it provides two explicit entry points.

The first is the BLOB Inspector, accessible by right-clicking any binary value in the SQLite viewer or LevelDB viewer and selecting Inspect Value… (or Inspect Key…). In Auto mode the inspector attempts several format detections via magic bytes — binary plist, PNG, JPEG, and others. If none match, it attempts a Protobuf heuristic decode. You can also select Protobuf mode manually if you already know what you are looking at. Importantly, the BLOB Inspector goes beyond what protoc --decode_raw produces: for every numeric field it surfaces all plausible interpretations inline — ZigZag-decoded values, Cocoa and Unix timestamps, bool — directly in the output. No separate decode pass required.

The second is the dedicated Protobuf Viewer, invoked explicitly from the context menu on any file or blob. This opens the full Tree Viewer interface — an expandable tree structure designed for working through larger or more complex messages field by field. The Tree Viewer shows all interpretations for every field including uint64 and uint32; the BLOB Inspector omits these from the inline annotations since they already appear as the primary field value, keeping the output less cluttered for quick triage.

The BLOB Inspector

Protobuf blobs surface in crush anywhere binary data appears: SQLite BLOB columns, LevelDB values, or any file opened in the hex viewer. The BLOB Inspector handles all of these through a consistent workflow.

To invoke it: right-click any cell or record that contains binary data and select Inspect Value… (or Inspect Key… in the LevelDB viewer). The inspector opens as a dialog with a mode selector at the top.

In Auto mode, the inspector checks for magic bytes first — binary plist (62706C69 73743030), PNG (89504E47), JPEG (FFD8FF), and others. If none match, the blob is displayed in hex. Protobuf mode must be selected manually — there are no magic bytes to detect it on.

The schema-less output surfaces field numbers, wire types, nested message structure, and — beyond what protoc --decode_raw produces — all plausible numeric interpretations inline. Here is a real example: a Biome SEGB entry payload from an iOS acquisition, opened in the BLOB Inspector with Protobuf mode selected manually:

1 {
  1: "/app/inFocus"
  2 {
    1 [varint]: 2
    # sint64 (zigzag): 1
    2 [varint]: 6584185901589580638
    # sint64 (zigzag): 3292092950794790319
  }
}
2 [fixed64]: 4739523391776620544
# double: 743887831.0
# Cocoa timestamp: 2024-07-28 19:30:31 UTC
3 [fixed64]: 4739523391826952192
# double: 743887837.0
# Cocoa timestamp: 2024-07-28 19:30:37 UTC
4 {
  1 {
    1 [varint]: 2
    # sint64 (zigzag): 1
    2 [varint]: 6584185901589580638
    # sint64 (zigzag): 3292092950794790319
  }
  3: "com.apple.MobileSMS"
}
5: "2E565DD6-9B69-448D-B374-AAB614442F03"
7 {
  1: <>
  2 {
    1 {
      1 [varint]: 2
      # sint64 (zigzag): 1
      2 [varint]: 0
      # sint64 (zigzag): 0
      # bool: false
    }
    3: "com.apple.SpringBoard.transitionReason.homescreen"
  }
  3 [varint]: 6002
  # sint64 (zigzag): 3001
}
7 {
  1: <>
  2 {
    1 {
      1 [varint]: 2
      # sint64 (zigzag): 1
      2 [varint]: 0
      # sint64 (zigzag): 0
      # bool: false
    }
    3: "14.0"
  }
  3 [varint]: 6007
  # sint64 (zigzag): -3004
}
7 {
  1: <>
  2 {
    1 {
      1 [varint]: 2
      # sint64 (zigzag): 1
      2 [varint]: 0
      # sint64 (zigzag): 0
      # bool: false
    }
    3: "1262.400.41.2.3"
  }
  3 [varint]: 6008
  # sint64 (zigzag): 3004
}
8 [fixed64]: 4739523391831073926
# double: 743887837.491349
# Cocoa timestamp: 2024-07-28 19:30:37 UTC
10 [varint]: 18446744073709537216
# int64: -14400
# sint64 (zigzag): 9223372036854768608

Several things are immediately readable without a schema. Fields 2, 3, and 8 are fixed64 — the BLOB Inspector detects the double value falls in the Cocoa epoch range and surfaces the decoded timestamp alongside the raw bytes: 2024-07-28 19:30:31 through 19:30:37 UTC. Field 4.3 contains "com.apple.MobileSMS" — the app in focus. Field 5 is a UUID identifying this entry. Field 7 is a repeated embedded message — it appears three times, each carrying a string value: the homescreen transition reason, an iOS version ("14.0"), and an app build version ("1262.400.41.2.3"). All three surface as plain strings without any schema knowledge.

What remains uncertain: the large varint in field 1.2.2 (6584185901589580638) does not resolve to anything meaningful — likely an internal identifier or hash. Field 10 decodes as int64: -14400; appearing consistently across entries, this looks like a UTC offset in seconds (−4 hours / EDT), but that requires contextual confirmation. The small varints typed as sint64 (zigzag): 1 throughout are plausibly type or state flags — the schema would be needed to confirm. Field 7.1 is empty bytes in all three repeated blocks.

This is a realistic picture of what schema-less decode gives you: timestamps, app identifiers, version strings, and transition reasons surface immediately; internal identifiers and small integers require inference or schema cross-reference.

The Protobuf Viewer — Navigating the Tree

The Protobuf Viewer is invoked via the context menu on any file or blob — not on an individual cell value. It presents the same schema-less decode and the same multi-interpretation output as the BLOB Inspector, but in an expandable tree structure rather than a flat inline view. The difference is navigation, not capability: the BLOB Inspector shows everything at once, annotations inline; the Tree Viewer lets you work through the message field by field and expand only what you need.

Each field in the tree is expandable. Expanding a numeric field reveals every valid interpretation for its wire type — the same set described in the Interpretation Rules tables below. Take field 10 from the SEGB payload above. In the flat BLOB Inspector output you already see the annotations inline, but in the Tree Viewer it renders as:

10 ▼ varint
    uint64:               18446744073709537216
    int64:                -14400
    sint64 (ZigZag):      9223372036854768608
    bool:                 —  (not 0 or 1, not shown)
    Unix timestamp (s):   —  (out of range, not shown)

The three interpretations are visible simultaneously. int64: -14400 is the one that makes contextual sense — a UTC offset in seconds. The ZigZag value is a large meaningless number, the timestamp check fails. The examiner decides; the tree just ensures nothing is hidden. For a large message with many repeated fields, being able to collapse and expand individual subtrees makes this significantly more practical than scrolling through a flat output.

Tree Viewer showing a Protobuf message with a varint field expanded, displaying all interpretations

Interpretation Rules

The following tables document exactly which interpretations crush surfaces per wire type, and under what conditions. These apply to the Tree Viewer; the BLOB Inspector shows the same set minus uint64 and uint32, which are already shown as the primary field value.

Wire Type 0 — varint

Interpretation Condition
uint64 always
int64 only when value ≥ 2⁶³ (negative as int64)
sint64 (ZigZag) always
bool only when value = 0 or 1
Unix timestamp (s) 946 684 800 ≤ value ≤ 4 102 444 800 (2000–2100)
Chrome/WebKit timestamp (µs) 12 591 158 400 000 000 ≤ value ≤ 15 778 800 000 000 000

Wire Type 1 — fixed64 (8 bytes)

Interpretation Condition
uint64 always
int64 only when negative
double when not NaN and not ±inf
Cocoa timestamp double not NaN/inf AND 0 < double ≤ 3 155 673 600 (2001–2101)
Unix timestamp (double, s) double not NaN/inf AND 946 684 800 ≤ double ≤ 4 102 444 800
Unix timestamp (uint64, s) 946 684 800 ≤ uint64 ≤ 4 102 444 800
Chrome/WebKit timestamp (µs) 12 591 158 400 000 000 ≤ uint64 ≤ 15 778 800 000 000 000

Wire Type 5 — fixed32 (4 bytes)

Interpretation Condition
uint32 always
int32 only when negative
float when not NaN and not ±inf
Unix timestamp (uint32, s) 946 684 800 ≤ uint32 ≤ 4 102 444 800

Loading a Schema

If you have located the .proto definition for the data you are examining, the Protobuf Viewer can load it directly — either as a raw .proto source file or as a compiled FileDescriptorSet (.pb), produced with protoc --descriptor_set_out. With a schema loaded, field numbers resolve to names, enum values carry their labels, and type ambiguity is eliminated. Schema loading is only available in the Protobuf Viewer, not in the BLOB Inspector - but I am actively working on the BLOB inspector of crush - so look out for updates ;-).

Protobuf in Non-Obvious Places

Protobuf does not only appear as standalone values in databases or LevelDB stores. A few places worth checking on acquisitions where it is easy to miss:

  • SQLite BLOB columns: any column typed as BLOB in an app database is a candidate. Android health apps, sync clients, and messaging apps (including Signal, which has historically stored some internal state as serialised Protobuf) are worth checking.
  • LevelDB values: as covered in Deep Dive #3, Chrome Sync Data is almost entirely Protobuf. Chrome IndexedDB values often contain Protobuf-serialised JavaScript objects depending on the web application.
  • Raw files in app containers: some apps write Protobuf directly to files — no database wrapper. Common in Android under /data/data/<package>/files/ or /data/data/<package>/cache/. These have no extension convention.
  • Apple Biome / SEGB files (iOS): this one surprises people. Apple's Biome framework stores behavioural telemetry — app usage, screen time, notifications, location visits — in SEGB files under /private/var/mobile/Library/Biome/. SEGB is a proprietary Apple format with no public specification, but its structure has been reverse-engineered: each entry has a fixed binary header (magic bytes, timestamps, flags) followed by a Protobuf-encoded payload. The header is Apple-proprietary; the payload follows the standard Protobuf wire format exactly.

    In crush, opening a SEGB file in the SEGB viewer triggers Decode from Table automatically — the viewer knows the SEGB structure and parses header fields and entry boundaries without any manual intervention. What it does not do automatically is decode the Protobuf payload inside each entry, because there is no way to detect Protobuf without context. To examine the payload, right-click the entry and open it in the BLOB Inspector, then select Protobuf mode manually. The real SEGB payload shown earlier in this post came from exactly that workflow.

    The takeaway: Protobuf is not just an Android or Chrome concern. It is embedded inside Apple's own proprietary formats, one layer down. If you are already examining SEGB files, you are already working with Protobuf — just with a header in front of it. Cross-referencing the output against Chris Vance's research on Biome (https://blog.d204n6.com/2022/09/ios-16-now-you-c-it-now-you-dont.html) gives partial field mappings for several known stream types, which is the closest thing to a public schema available.

  • Network captures: not a crush use case, but Protobuf is common in gRPC traffic and some REST APIs that use application/x-protobuf. Worth keeping in mind for network forensics.

A Note on Heuristic Detection

Unlike binary plists (magic bytes 62706C697374) or PNG (magic bytes 89504E47), Protobuf has no magic bytes. Any sequence of bytes that is internally consistent with the wire format grammar will pass a Protobuf heuristic check. This means false positives are possible — particularly with short blobs, or with data that happens to contain valid-looking varint sequences.

The heuristic crush applies when parsing a blob as Protobuf is conservative: the entire byte sequence to parse without encountering any wire type value outside 0–5 — values in that range have no defined meaning and indicate either corrupt data or a format that is not Protobuf. Any length-delimited field claiming a length that would exceed the remaining bytes also aborts the parse immediately.

Wire types 3 and 4 — the deprecated proto2 group start and end markers — are handled differently. Rather than treating them as parse errors, crush skips group fields entirely: the group and its contents are silently consumed and parsing continues with the next field. Group fields do not appear in the Tree Viewer or BLOB Inspector output. This is the correct forensic trade-off: aborting on wire type 3 would cause an entire blob to be reported as non-Protobuf simply because it contains a legacy field type that still appears in older proto2-serialised data and some Google-internal formats. Silent skip means you see the fields that can be decoded; you do not lose the whole message. The limitation is worth knowing: if a blob contains exclusively group fields, the output will be empty — which looks identical to a failed parse.

There is a subtler heuristic problem specific to length-delimited fields. A wire-type-2 payload can be decoded as a UTF-8 string, as raw bytes, or as a nested Protobuf message — and the wire format alone cannot tell you which interpretation is correct. An earlier version of the crush decoder checked for valid UTF-8 first: if the payload was valid UTF-8, it was displayed as a string and the nested decode was never attempted. This produced a silent data loss: a payload that was simultaneously valid UTF-8 and a valid nested message was always shown as a flat string, with its internal structure invisible.

The current decoder reverses the order: for every length-delimited field, a nested Protobuf parse is attempted first. Only if that parse yields no entries — or raises an error — does the decoder fall back to UTF-8 string or raw bytes display. Forensically, this is the correct priority: a nested message that happens to be valid UTF-8 has structure worth surfacing; a flat string that happens to fail the nested parse is just a string.

If the auto-detected Protobuf output looks structurally inconsistent — deeply nested messages where none are expected, or field numbers that jump around implausibly — it is worth switching to Hex mode and checking whether this is actually Protobuf or something else that happens to parse.

Further Reading

Happy examining. 🐢

No comments:

Post a Comment