Articles in this section

Understanding Data Quality Fields: data_incomplete and null_percentage

This feature applies to data from November 1, 2025 (UTC) onward. Data before that date may not follow the same threshold logic and should be interpreted accordingly. 

 

This article explains how Butlr indicates data usability in API responses using the fields data_incomplete and null_percentage.

These indicators help distinguish between:

  • Valid occupancy values (including zero)

  • Intervals where data collection was incomplete

This allows dashboards, analytics systems, and integrations to correctly interpret sensor data.


Why These Fields Matter

In some time windows, the system may receive incomplete data from one or more sensors.

Without additional indicators, it can be difficult to distinguish between:

  • A space that is truly empty (occupancy = 0)

  • A space that shows empty (occupancy = 0) but should not be used due to incomplete data

  • A space that shows occupancy (occupancy = 0) but should not be used due to incomplete data

To address this, the Butlr API provides data quality indicators that help determine whether the returned value is based on complete data coverage.


Data Quality Fields

data_incomplete

A boolean field that indicates whether the data for a time window is complete.

ValueMeaning
falseData coverage for the interval is complete
trueSome underlying data was not collected during the interval

If data_incomplete = true, the value may still be computed from available inputs, but the interval should not be used for long-term analytics or reporting.

null_percentage (0–1)

null_percentage is a diagnostic field and should be interpreted using a threshold of 0.30.

  • ≤ 0.30: data still be considered usable and does not necessarily indicate poor quality

  • > 0.30: data quality is considered insufficient and the data should generally not be treated as reliable


null_percentage is a diagnostic field. For deciding whether data should be used, Butlr recommends treating data_incomplete as the primary signal.


How to Interpret Results

null_on_incomplete is a filter parameter that controls how the API represents intervals with incomplete underlying data coverage.
ScenarioValuedata_incompleteInterpretation
Complete interval> 0falseValue is reliable
Valid zero occupancy0falseSpace was empty; value is reliable
Partial data coverage0 or > 0trueSome raw data missing, but value computed from available portion.
No data collectednulltrueNo usable data in the window.

Use Case 1 — Occupancy = 0 vs No Data

  • Scenario
    • Occupancy value = 0, and you need to know whether it is real or missing data.
  • Validation Rule
    • If data_incomplete = false → Valid occupancy zero.
    • If data_incomplete = true → Do not trust the value.
    • Do not rely on null_percentage alone.
  • Expected Outcome
    • Clear distinction between “empty space” and “no reporting.”
    • Reduced false alarms in operational workflows.

Use Case 2 — Validating Data Before Using It in Charts

  • Scenario
    • You want to build a dashboard chart and ensure the data is reliable.
  • Recommended Approach
    • Query at 15-minute granularity.
    • If data_incomplete = true:
      • Exclude that 15-minute interval.
    • After validation, aggregate to hourly/daily if needed.
  • Expected Outcome
    • Short degraded periods are visible at 15-minute resolution.
    • Aggregated charts (hourly/daily) reflect only validated intervals.
    • Occupancy = 0 with data_incomplete = false can be trusted as a valid zero.
{
  "window": {
    "every": "15m",
    "function": "max",
    "timezone": "America/New_York"
  },
  "filter": {
    "start": "2026-02-01T05:00:00Z",
    "stop": "2026-02-13T05:00:00Z",
    "null_on_incomplete": true,
    "spaces": { "eq": ["space_XXXXXXXXX"] },
    "measurements": ["traffic_floor_occupancy"],
    "value": { "gte": 0 }
  }
}

Use Case 3 — Long-term Data Availability (e.g., Last 6 Months)

  • Scenario
    • You need to evaluate data reliability over a long period (e.g., 6 months).
  • Recommended Approach
    1. Query data day-by-day.
    2. Use UTC boundaries.
    3. Maintain 15-minute - 1 day granularity.
    4. Exclude intervals where data_incomplete = true.
    5. Exclude known sleep-mode/offline hours.
    6. Aggregate after validation.
  • Expected Outcome
    • Localized degradation is preserved and not hidden by aggregation.
    • SLA metrics reflect actual degraded intervals.
    • Month-level summaries are built only from validated data.
import requests
from datetime import datetime, timedelta, timezone
import pandas as pd

API_URL = "<https://api.butlr.io/api/v3/reporting>"
API_TOKEN = "YOUR_API_TOKEN"

SPACE_ID = "space_....."

START_DATE = datetime(2025, 8, 1, tzinfo=timezone.utc)
END_DATE   = datetime(2026, 2, 1, tzinfo=timezone.utc)

SLEEP_START_UTC = 0   # Example: exclude 00:00–05:00 UTC
SLEEP_END_UTC   = 5

def query_day(day_start):
    day_end = day_start + timedelta(days=1)

    payload = {
        "window": {
            "every": "15m",
            "function": "max",
            "timezone": "UTC"
        },
        "filter": {
            "start": day_start.isoformat(),
            "stop": day_end.isoformat(),
            "null_on_incomplete": True,
            "spaces": { "eq": [SPACE_ID] },
            "measurements": ["traffic_floor_occupancy"],
            "value": { "gte": 0 }
        }
    }

    headers = {
        "Authorization": f"Bearer {API_TOKEN}",
        "Content-Type": "application/json"
    }

    response = requests.post(API_URL, json=payload, headers=headers)
    response.raise_for_status()
    return response.json()

all_records = []
current = START_DATE

while current < END_DATE:
    data = query_day(current)
    all_records.extend(data)
    current += timedelta(days=1)

# Convert to DataFrame
df = pd.DataFrame(all_records)

# --- Validation Step ---

# Convert time to datetime
df["time"] = pd.to_datetime(df["time"], utc=True)

# 1️⃣ Exclude incomplete intervals
df_valid = df[df["data_incomplete"] == False]

# 2️⃣ Exclude sleep-mode hours (example: 00:00–05:00 UTC)
df_valid = df_valid[
    ~df_valid["time"].dt.hour.between(SLEEP_START_UTC, SLEEP_END_UTC - 1)
]

# --- SLA Calculation Example ---

total_intervals = len(df)
valid_intervals = len(df_valid)
incomplete_intervals = len(df[df["data_incomplete"] == True])

sla_percentage = (valid_intervals / total_intervals) * 100

print(f"Total intervals: {total_intervals}")
print(f"Incomplete intervals: {incomplete_intervals}")
print(f"SLA Availability: {sla_percentage:.2f}%")

# --- Monthly Summary (after validation) ---

df_valid["month"] = df_valid["time"].dt.to_period("M")

monthly_summary = df_valid.groupby("month").agg({
    "value": "mean"
})

print(monthly_summary)


Recommended Usage

When processing data returned by the Butlr API, follow the recommended evaluation order below:

  1. Check data_incomplete

    1. If data_incomplete = true, exclude the interval from analytics

    2. If data_incomplete = false, treat the value as usable

  2. Use null_percentage only as an additional diagnostic

    1. This field is intended for debugging data quality.

This is the recommended usage pattern for building reliable charts, summaries, and integrations.


API Behavior and Constraints


Supported Query Windows

This feature is intended for 15-minute intervals or larger.

Requests using granularity smaller than 15 minutes may return output, but those results are not considered meaningful for analysis. Webhook delivery also has a minimum interval of 15 minutes.


Sleep Mode Behavior

During scheduled sleep periods:

  • Data may not be collected

  • null_percentage and data_incomplete may vary depending on the query window

For long-term availability analysis, Butlr recommends excluding known sleep periods from the evaluation window.


Long Offline Periods

If a Hive or sensor remains offline for more than two consecutive weeks, it may be excluded from certain internal calculation paths used in data aggregation.

When this happens:

  • The device may no longer be included in null_percentage calculations

  • The interval may no longer trigger data_incomplete

  • As a result, null_percentage may appear as 0%, even if the device previously had a prolonged outage

Because of this behavior, a “clean” null_percentage does not always guarantee that no prolonged outages occurred, particularly if the outage exceeded the two-week threshold.


Placeholder Sensors

If a space contains unused placeholder sensors in the configuration:

  • They may appear as offline

  • This can inflate null_percentage or trigger data_incomplete

Recommended action:
Remove unused placeholder sensors from the configuration to ensure accurate data quality metrics.
 

Virtually Mirrored or Cloned Sensors

In rare cases, a sensor may be "cloned" or “mirrored” in configuration.

When this occurs:

  • The cloned/mirrored entry may appear offline

  • The physical device itself may still be operating normally

This situation does not negatively affect null_percentage or data_incomplete.


Scope of This Feature

These data quality indicators apply to:

  • Presence sensor occupancy data

  • Traffic sensor IN/OUT measurements

They do not apply to floor-level occupancy estimates generated by traffic sensors, which follows a separate calibration process.

 

Best Practice for Analytics

For the most reliable reporting:

  1. Query data at 15-minute granularity

  2. Exclude intervals where data_incomplete = true

  3. Aggregate the remaining validated intervals into hourly, daily, or longer summaries

This preserves short gaps in data quality while keeping long-term reporting based on usable intervals.

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.