# Dataset statistics computation

You can now compute column-level statistics on Azure Data Lake Storage (ADLS) datasets with the `COMPUTE STATISTICS`

SQL command. The SQL commands that compute dataset statistics are an extension of the `ANALYZE TABLE`

command. Full details on the `ANALYZE TABLE`

command can be found in the SQL reference documentation.

To see the statistics that were computed with the `ANALYZE TABLE COMPUTE STATISTICS`

command, you can use a SELECT query on the alias name or Statistics ID. You can also limit the scope of the statistical analysis to either the entire dataset, a subset of a dataset, all columns, or a subset of columns.

`COMPUTE STATISTICS`

, `FILTERCONTEXT`

, and `FOR COLUMNS`

commands are not supported on accelerated store tables. These extensions for the `ANALYZE TABLE`

command are currently only supported for ADLS tables. For more information, see the ANALYZE TABLE section of the SQL syntax guide.This guide helps you structure your queries so that you can compute the column statistics of an ADLS dataset. Using these commands, you can see the statistics generated in your session through a PSQL client using an SQL query.

## Compute statistics compute-statistics

Additional constructs have been added to the `ANALYZE TABLE`

command that allows you to **compute statistics for a subset of a dataset and for certain columns**. To compute dataset statistics, you must use the `ANALYZE TABLE <tableName> COMPUTE STATISTICS`

format.

**entire dataset**and for

**all columns**. To compute statistics on all columns, you would use the query format

`ANALYZE TABLE COMPUTE STATISTICS`

. You are **not**recommended to use the

`COMPUTE STATISTICS`

command without filters on an ADLS dataset, as the size of the dataset can be very large (potentially petabytes of data). Instead, you should always consider running the analyze command using `FILTERCONTEXT`

and a specified list of columns. See the sections on limiting analyzed columns and adding a filter condition for more details.The example seen below computes statistics for the `adc_geometric`

dataset and for **all** columns in the dataset.

```
ANALYZE TABLE adc_geometric COMPUTE STATISTICS;
```

`COMPUTE STATISTICS`

command does not support the array or map data types. You can set a `skip_stats_for_complex_datatypes`

flag to be notified or to error out if the input data frame has columns with arrays and map data types. By default, the flag is set to true. To enable notifications or errors, use the following command: `SET skip_stats_for_complex_datatypes = false`

.## Create an alias name alias-name

Since the results of calculations can be a large amount of data, it is unreasonable to return the computed data directly in the console output. Although alias names are optional, you are recommended to use them as best practice when you compute statistics. Provide an alias name in the statement to descriptively reference the results in your SQL queries. Alternatively, an automatically generated `Statistics ID`

is generated and used to store the calculated information.

The example below stores the output computed statistics in the `alias_name`

for later reference. The alias name used in the query is available for reference as soon as the `ANALYZE TABLE`

command has been run.

```
ANALYZE TABLE adc_geometric COMPUTE STATISTICS AS alias_name;
```

The output for the above example is `SUCCESSFULLY COMPLETED, alias_name`

. The console output does not display the statistics in the response to the analyze table compute statistics command. To see the detailed results, you must use a SELECT query on the alias name or Statistics ID.

## View the output of computed statistics view-output-of-computed-statistics

If you do not provide an alias name in advance, Query Service automatically generates a name for the `Statistics ID`

that follows the format of `<tableName_stats_{incremental_number}>`

. If an alias name is provided it appears in the `Statistics ID`

column.

An example output of a `COMPUTE STATISTICS`

query is as follows:

```
| Statistics ID |
| --------------------- |
| adc_geometric_stats_1 |
(1 row)
```

You can then **query the computed statistics directly** by referencing the `Statistics ID`

. The example statement below allows you to view the output in full when used with the `Statistics ID`

or the alias name.

```
SELECT * FROM adc_geometric_stats_1;
```

The computed statistics output might look similar to the example below.

```
columnName | mean | max | min | standardDeviation | approxDistinctCount | nullCount | dataType
------------------------------------------------------------+----------------+----------------+----------------+-------------------+---------------------+-----------+-----------
marketing.trackingcode | 0.0 | 0.0 | 0.0 | 0.0 | 1213.0 | 0 | String
_experience.analytics.customdimensions.evars.evar13 | 0.0 | 0.0 | 0.0 | 0.0 | 8765.0 | 20 | String
_experience.analytics.customdimensions.evars.evar74 | 0.0 | 0.0 | 0.0 | 0.0 | 11.0 | 0 | String
web.webpagedetails.name | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | String
_experience.analytics.event1to100.event8.value | 5.0 | 9077.0 | 123.0 | 10.0 | 1001.0 | 80 | Double
search.ispaid | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | Boolean
commerce.productlistviews.value | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 10 | Double
device.typeid | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 10 | String
commerce.purchases.value | 765.0 | 98760.0 | -980.0 | 32.0 | 99.0 | 90 | Double
_experience.analytics.customdimensions.props.prop45 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | String
environment.browserdetails.javaenabled | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | Boolean
timestamp | 0.0 | 0.0 | 0.0 | 0.0 | 98.0 | 3 | Timestamp
(12 rows)
```

## Show the statistical analysis metadata show-statistics

You can use the `SHOW STATISTICS`

command to display the metadata for all the temporary statistics generated in the session. This command can help you refine the scope of your statistical analysis.

An example output of `SHOW STATISTICS`

is seen below.

```
statsId | tableName | columnSet | filterContext | timestamp
----------------------+---------------+-----------+-----------------------------+--------------------
adc_geometric_stats_1 | adc_geometric | (age) | | 25/06/2023 09:22:26
demo_table_stats_1 | demo_table | (*) | ((age > 25)) | 25/06/2023 12:50:26
age_stats | castedtitanic | (age) | ((age > 25) AND (age < 40)) | 25/06/2023 09:22:26
```

A description of the metadata column names is provided below.

`statsId`

`COMPUTE STATISTICS`

command.`tableName`

`columnSet`

`filterContext`

`timestamp`

You can use the statistics ID or alias name to look up the computed statistics with a SELECT statement at any time within that session. The statistics ID and the statistics generated are only valid for this particular session and cannot be accessed across different PSQL sessions. The computed statistics are not currently persistent. See the section on how to view the output of your computed statistics for more details.

## Limit the included columns limit-included-columns

To focus your analysis, you can compute statistics for particular dataset columns by referencing them by name. Use the `FOR COLUMNS (<col1>, <col2>)`

syntax to target specific columns. The example below computes statistics for the columns `commerce`

, `id`

, and `timestamp`

for the dataset `tableName`

.

```
ANALYZE TABLE tableName COMPUTE STATISTICS FOR columns (commerce, id, timestamp);
```

You can calculate the statistics for any root level or nested column. The following example demonstrates these references.

```
ANALYZE TABLE adcgeometric COMPUTE STATISTICS FOR columns (commerce, commerce.purchases.value, commerce.productListAdds.value);
```

## Add a timestamp filter condition filter-condition

To focus the analysis of your columns based on chronology, you can add a timestamp filter condition. This condition can be used to filter out historical data or focus your data analysis on a specific period. The `FILTERCONTEXT`

command calculates statistics on a subset of the dataset based on the filter condition that you provide.

In the example below, statistics are computed on all columns for the dataset `tableName`

, where the column timestamp has values between the specified range of `2023-04-01 00:00:00`

and `2023-04-05 00:00:00`

.

```
ANALYZE TABLE tableName FILTERCONTEXT (timestamp >= to_timestamp('2023-04-01 00:00:00') and timestamp <= to_timestamp('2023-04-05 00:00:00')) COMPUTE STATISTICS FOR ALL COLUMNS;
```

You can combine the column limit and the filter to create highly specific computational queries for your dataset columns. For example, the following query computes statistics on the columns `commerce`

, `id`

, and `timestamp`

for the dataset `tableName`

, where the column timestamp has values between the specified range of `2023-04-01 00:00:00`

and `2023-04-05 00:00:00`

.

```
ANALYZE TABLE tableName FILTERCONTEXT (timestamp >= to_timestamp('2023-04-01 00:00:00') and timestamp <= to_timestamp('2023-04-05 00:00:00')) COMPUTE STATISTICS FOR columns (commerce, id, timestamp);
```

## Next steps next-steps

By reading this document, you now have a better understanding of how to generate column-level statistics from an ADLS dataset using an SQL query. You are recommended to read the SQl syntax guide to discover more features of the Adobe Experience Platform Query Service.