SQL-based bot filtering
This SQL-based bot filtering example demonstrates how to use SQL queries to define thresholds and detect bot activity based on predefined rules. This foundational approach helps identify anomalies in web traffic by removing unusual activity. By customizing detection rules with defined thresholds and intervals, you can effectively tailor bot filtering to suit your specific traffic patterns.
Define thresholds for bot activity
Start by analyzing your dataset to identify and categorize user behavior. Focus on attributes like ECID, timestamp, and webPageDetails.name
(the name of the webpage visited) to group user actions and detect patterns that indicate bot activity.
The SQL query below demonstrates how to apply the threshold of more than 60 clicks in one minute to identify suspicious activity:
SELECT *
FROM analytics_events_table
WHERE enduserids._experience.ecid NOT IN (
SELECT enduserids._experience.ecid
FROM analytics_events_table
GROUP BY Unix_timestamp(timestamp) / 60, enduserids._experience.ecid
HAVING Count(*) > 60
);
Expand to multiple intervals
Next, define different time intervals for thresholds. These intervals could include:
- 1-minute interval: Up to 60 clicks.
- 5-minute interval: Up to 300 clicks.
- 30-minute interval: Up to 1800 clicks.
The following SQL query creates a view named analytics_events_clicks_count_criteria
to handle thresholds across multiple intervals. The statement consolidates click counts for 1-minute, 5-minute, and 30-minute intervals into a structured dataset and flags potential bot activity based on predefined thresholds.
CREATE VIEW analytics_events_clicks_count_criteria as
SELECT struct (
cast(count_1_min AS int) one_minute,
cast(count_5_mins AS int) five_minute,
cast(count_30_mins AS int) thirty_minute
) count_per_id,
id,
struct (
struct (name) webpagedetails
) web,
CASE
WHEN count.one_minute > 50 THEN 1
ELSE 0
END AS isBot
FROM (
SELECT table_count_1_min.mcid AS id,
count_1_min,
count_5_mins,
count_30_mins,
table_count_1_min.name AS name
FROM (
(SELECT mcid, Max(count_1_min) AS count_1_min, name
FROM (SELECT enduserids._experience.mcid.id AS mcid,
Count(*) AS count_1_min,
web.webPageDetails.name AS name
FROM delta_table
WHERE TIMESTAMP BETWEEN TO_TIMESTAMP('2019-09-01 00:00:00')
AND TO_TIMESTAMP('2019-09-01 23:00:00')
GROUP BY UNIX_TIMESTAMP(timestamp) / 60,
enduserids._experience.mcid.id,
web.webPageDetails.name)
GROUP BY mcid, name) AS table_count_1_min
LEFT JOIN
(SELECT mcid, Max(count_5_mins) AS count_5_mins, name
FROM (SELECT enduserids._experience.mcid.id AS mcid,
Count(*) AS count_5_mins,
web.webPageDetails.name AS name
FROM delta_table
WHERE TIMESTAMP BETWEEN TO_TIMESTAMP('2019-09-01 00:00:00')
AND TO_TIMESTAMP('2019-09-01 23:00:00')
GROUP BY UNIX_TIMESTAMP(timestamp) / 300,
enduserids._experience.mcid.id,
web.webPageDetails.name)
GROUP BY mcid, name) AS table_count_5_mins
ON table_count_1_min.mcid = table_count_5_mins.mcid
LEFT JOIN
(SELECT mcid, Max(count_30_mins) AS count_30_mins, name
FROM (SELECT enduserids._experience.mcid.id AS mcid,
Count(*) AS count_30_mins,
web.webPageDetails.name AS name
FROM delta_table
WHERE TIMESTAMP BETWEEN TO_TIMESTAMP('2019-09-01 00:00:00')
AND TO_TIMESTAMP('2019-09-01 23:00:00')
GROUP BY UNIX_TIMESTAMP(timestamp) / 1800,
enduserids._experience.mcid.id,
web.webPageDetails.name)
GROUP BY mcid, name) AS table_count_30_mins
ON table_count_1_min.mcid = table_count_30_mins.mcid
)
)
The statement joins data from table_count_1_min
, table_count_5_mins
, and table_count_30_mins
using the mcid
value and webpage. It then consolidates click counts for each user across multiple time intervals to provide a complete view of user activity. Finally, the flagging logic then identifies users that exceed 50 clicks in one minute and marks them as bots (isBot = 1
).
Output dataset structure
The output dataset is structured with both flat and nested fields. This structure allows flexibility when detecting anomolous traffic and supports advanced filtering strategies that use SQL and machine learning. The nested fields include count
and web
which encapsulate granular details about activity thresholds and webpages. Capturing these metrics means that features can be easily extracted for training and prediction tasks.
The following schema diagram outlines the structure of the resultant dataset and highlights how nested and flat fields can be used for efficient processing and bot detection.
root
|-- count: struct (nullable = false)
| |-- one_minute: integer (nullable = true)
| |-- five_minute: integer (nullable = true)
| |-- thirty_minute: integer (nullable = true)
|-- id: string (nullable = true)
|-- web: struct (nullable = false)
| |-- webpagedetails: struct (nullable = false)
| | |-- name: string (nullable = true)
|-- isBot: integer (nullable = false)
The output dataset to be used for training
The result of this expression might look similar to the table provided below. In the table, the isBot
column acts as a label that distinguishes between bot and non-bot activity.
| `id` | `count_per_id` |`isBot`| `web` |
|--------------|-----------------------------------------------------|-------|------------------------------------------------------------------------------------------------------------------------|
| 2.5532E+18 | {"one_minute":99,"five_minute":1,"thirty_minute":1} | 1 | {"webpagedetails":{"name":"KR+CC8TQzPyK4ord6w1PfJay1+h6snSF++xFERc4ogrEX4clJROgzkGgnSTSGWWZfNS/Ouz2K0VtkHG77vwoTg=="}} |
| 2.5532E+18 | {"one_minute":99,"five_minute":1,"thirty_minute":1} | 1 | {"webpagedetails":{"name":"KR+CC8TQzPyK4ord6w1PfJay1+h6snSF++xFERc4ogrEX4clJROgzkGgnSTSGWWZfNS/Ouz2K0VtkHG77vwoTg=="}} |
| 2.5532E+18 | {"one_minute":99,"five_minute":1,"thirty_minute":1} | 1 | {"webpagedetails":{"name":"KR+CC8TQzPyK4ord6w1PfJay1+h6snSF++xFERc4ogrEX4clJROgzkGgnSTSGWWZfNS/Ouz2K0VtkHG77vwoTg=="}} |
| 2.5532E+18 | {"one_minute":99,"five_minute":1,"thirty_minute":99}| 1 | {"webpagedetails":{"name":"KR+CC8TQzPyK4ord6w1PfJay1+h6snSF++xFERc4ogrEX4clJROgzkGgnSTSGWWZfNS/Ouz2K0VtkHG77vwoTg=="}} |
| 2.5532E+18 | {"one_minute":99,"five_minute":1,"thirty_minute":1} | 1 | {"webpagedetails":{"name":"KR+CC8TQzPyK4ord6w1PfJay1+h6snSF++xFERc4ogrEX4clJROgzkGgnSTSGWWZfNS/Ouz2K0VtkHG77vwoTg=="}} |
| 2.5532E+18 | {"one_minute":99,"five_minute":1,"thirty_minute":1} | 1 | {"webpagedetails":{"name":"KR+CC8TQzPyK4ord6w1PfJay1+h6snSF++xFERc4ogrEX4clJROgzkGgnSTSGWWZfNS/Ouz2K0VtkHG77vwoTg=="}} |
| 2.5532E+18 | {"one_minute":99,"five_minute":1,"thirty_minute":1} | 1 | {"webpagedetails":{"name":"KR+CC8TQzPyK4ord6w1PfJay1+h6snSF++xFERc4ogrEX4clJROgzkGgnSTSGWWZfNS/Ouz2K0VtkHG77vwoTg=="}} |
| 2.5532E+18 | {"one_minute":99,"five_minute":1,"thirty_minute":1} | 1 | {"webpagedetails":{"name":"KR+CC8TQzPyK4ord6w1PfJay1+h6snSF++xFERc4ogrEX4clJROgzkGgnSTSGWWZfNS/Ouz2K0VtkHG77vwoTg=="}} |
| 2.5532E+18 | {"one_minute":99,"five_minute":1,"thirty_minute":1} | 1 | {"webpagedetails":{"name":"KR+CC8TQzPyK4ord6w1PfJay1+h6snSF++xFERc4ogrEX4clJROgzkGgnSTSGWWZfNS/Ouz2K0VtkHG77vwoTg=="}} |
| 2.5532E+18 | {"one_minute":99,"five_minute":1,"thirty_minute":1} | 1 | {"webpagedetails":{"name":"KR+CC8TQzPyK4ord6w1PfJay1+h6snSF++xFERc4ogrEX4clJROgzkGgnSTSGWWZfNS/Ouz2K0VtkHG77vwoTg=="}} |
| 1E+18 | {"one_minute":1,"five_minute":1,"thirty_minute":1} | 0 | {"webpagedetails":{"name":"KR+CC8TQzPyMOE/bk7EGgN3lSvP8OsxeI2aLaVrbaeLn8XK3y3zok2ryVyZoiBu3"}} |
| 1.00007E+18 | {"one_minute":1,"five_minute":1,"thirty_minute":1} | 0 | {"webpagedetails":{"name":"8DN0dM4rlvJxt4oByYLKZ/wuHyq/8CvsWNyXvGYnImytXn/bjUizfRSl86vmju7MFMXxnhTBoCWLHtyVSWro9LYg0MhN8jGbswLRLXoOIyh2wduVbc9XeN8yyQElkJm3AW3zcqC7iXNVv2eBS8vwGg=="}} |
| 1.00008E+18 | {"one_minute":1,"five_minute":1,"thirty_minute":1} | 0 | {"webpagedetails":{"name":"KR+CC8TQzPyMOE/bk7EGgN3lSvP8OsxeI2aLaVrbaeLn8XK3y3zok2ryVyZoiBu3"}} |