Antenna SQL Tutorial: Reach and Frequency
Antenna is a new method of gaining seamless access to your log-level data for querying in a SQL interface without the need for ingesting, cleaning and transforming your log files.
In this tutorial, we’ll explore a few ways you can use the log-level data that Antenna makes available to analyze your data in ways that are unavailable to you in Beeswax’s Query Tool. Throughout this process we’ll work our way up from basic queries to queries that are more advanced.
While we’ll work our way up in difficulty, this tutorial assumes a basic working knowledge of SQL. For a tutorial on SQL basics, one resource is W3Schools’ SQL Tutorial.
Example 1: Calculating Reach and Frequency
Let’s start with a relatively basic analysis: finding the Reach and Frequency of our campaigns. For this analysis we only need data on won impressions, so we’ll use the table that houses this:
Let’s start with the basic framework of the query, selecting the fields we need and getting a basic impression count before we bring in Reach and Frequency. We’ll aggregate over the
line_item_id dimensions for the last 30 days of data. We’re pulling a count of all the rows here, as this table represents impressions.
SELECT campaign_id, line_item_id, COUNT(*) AS impressions FROM shared_impression_details_view WHERE bid_time >= CURRENT_DATE() - 30 GROUP BY campaign_id, line_item_id ORDER BY impressions DESC
Now we have all our campaigns broken down by line items sorted by impressions in descending order. This isn’t anything too special though — you can pull this in Query Tool today!
Let’s start getting closer to the analysis we actually want, with Reach and Frequency calculations. To calculate Reach we’ll use Snowflake’s HLL function which runs a HyperLogLog approximate count analysis. This will trade off a small degree of accuracy (+/- 1.62%) for a much more cost-efficient query.
SELECT campaign_id, line_item_id, COUNT(*) AS impressions, HLL(user_id) AS reach, -- The below could be used instead of HLL with worse performance but an exact count. We've commented it out for now. -- COUNT(DISTINCT user_id) AS alt_reach_calculation SUM(CASE WHEN user_id IS NOT null THEN 1 ELSE 0 END) AS measurable_reach_impressions, HLL(user_id)/SUM(CASE WHEN user_id IS NOT null THEN 1 ELSE 0 END) AS frequency FROM shared_impression_details_view WHERE bid_time >= CURRENT_DATE() - 30 GROUP BY campaign_id, line_item_id ORDER BY reach DESC
Great! We’ve now got a report calculating reach and frequency across our line items. However, for good measure let’s swap this out for a dimension not available in Query Tool today —
SELECT deal_id, COUNT(*) AS impressions, HLL(user_id) AS reach, SUM(CASE WHEN user_id IS NOT null THEN 1 ELSE 0 END) AS measurable_reach_impressions, HLL(user_id)/SUM(CASE WHEN user_id IS NOT null THEN 1 ELSE 0 END) AS frequency FROM shared_impression_details_view WHERE bid_time >= CURRENT_DATE() - 30 GROUP BY deal_id ORDER BY reach DESC
With a few small steps in Antenna, we’ve now gone from a query you could fully run in Query Tool to one that contains both dimensions and metrics in a frequency analysis that were previously unavailable to you.
Limiting to Converters
Let’s do one final alteration that may offer a little bit more insight. Let’s find out the distribution of impression frequency for converters. Let’s start by getting the frequency for each
user_id, but limited to only impressions that had a conversion. We’ll use a slightly different methodology than before. This analysis doesn’t handle all of the niche cases, but will pull the frequency of impressions for any user who made at least one conversion and saw at least one impression in the last 30 days.
SELECT user_id, COUNT(*) AS impressions FROM shared_impression_details_view WHERE conversions > 0 AND bid_time >= CURRENT_DATE() - 30 GROUP BY user_id
Close, but we’re not done yet to get the frequency distribution. Let’s wrap this in a subquery to take these results and get a distribution. Because there may be a long-tail of users with a large number of impressions. Let’s group all the users with 10+ impressions together into a single bucket. To do this we need to also cast
impressions_per_user to the string datatype using
SELECT CASE WHEN impressions_per_user >= 10 THEN '10+' ELSE TO_CHAR(impressions_per_user) END AS impressions_per_user, COUNT(*) AS frequency FROM (SELECT user_id, COUNT(*) AS impressions_per_user FROM shared_impression_details_view WHERE conversions > 0 AND bid_time_utc >= CURRENT_DATE() - 30 GROUP BY user_id) GROUP BY impressions_per_user ORDER BY impressions_per_user
With these results, we can take the results into a spreadsheet program and create a nice visualization of the frequency distribution (log scale).
Updated almost 3 years ago