3_advanced_data_analysis

Table Calculations vs. Database Calculations

Overview

Understanding when to perform calculations in the visualization layer versus the database layer is crucial for performance and accuracy.

Table Calculations

Processing Location: Performed locally in the visualization tool
Timing: After aggregations have occurred
Pros:
- Respond to filters and context changes
- More flexible and dynamic
- Can be adjusted without modifying source data
Cons:
- May be slower with large datasets
- Limited by local processing power

Common Table Calculations:

Running totals
Percent of total
Difference from previous
Moving averages
Rank and percentile

Database Calculations

Processing Location: Performed in the source database
Timing: As part of the query to the data source
Pros:
- Faster for large datasets
- Leverages database optimization
- Consistent results across different views
Cons:
- Fixed results (ignore some filters)
- Require database query modification

When to Use Each:

Use table calculations when results need to adapt to filters and user interactions
Use database calculations when dealing with large datasets or when filter context should not affect the calculation

Ranking Functions

Types of Rank Calculations

Standard Rank (RANK)

Behavior: Identical values receive identical rank
Gap Handling: Skips ranks after ties
Example: Values [100, 90, 90, 80] → Ranks [1, 2, 2, 4]

Dense Rank (RANK_DENSE)

Behavior: Identical values receive identical rank
Gap Handling: No gaps in ranking sequence
Example: Values [100, 90, 90, 80] → Ranks [1, 2, 2, 3]

Modified Rank (RANK_MODIFIED)

Behavior: Identical values receive identical rank
Tie Breaking: Uses lower of the possible ranks
Example: Values [100, 90, 90, 80] → Ranks [1, 3, 3, 4]

Unique Rank (RANK_UNIQUE)

Behavior: Every value receives unique rank
Tie Breaking: Based on additional sort criteria (e.g., alphabetical)
Example: Values [100, 90, 90, 80] → Ranks [1, 2, 3, 4]

Ranking Applications

Top N Analysis: Identify top performers
Competitive Analysis: Compare entities directly
Percentile Ranking: Relative performance assessment

Statistical Analysis Techniques

Reference Lines and Bands

Reference Lines

Purpose: Mark specific values on visualizations
Types:
- Constant: Fixed value (e.g., target = 100)
- Average: Mean of the data
- Median: Middle value
- Minimum/Maximum: Range boundaries
- Percentile: Specific percentile value

Reference Bands

Purpose: Highlight ranges of values
Configuration:
- Per Cell: Band for each individual data point
- Per Pane: Band for each category/group
- Per Table: Single band across entire visualization
Usage:
- Show acceptable ranges
- Highlight variance zones
- Create bullet charts

Distribution Analysis

Box Plots (Box and Whisker)

Components:

Box: Interquartile range (25th to 75th percentile)
Median Line: 50th percentile
Whiskers: Extend to minimum and maximum (excluding outliers)
Outliers: Points beyond whiskers

Histogram Analysis

Key Insights:

Central Tendency: Where data clusters
Spread: Width of distribution
Skewness: Asymmetry in distribution
Outliers: Unusual values
Bimodal/Multimodal: Multiple peaks

Forecasting and Trend Analysis

Trend Lines

Types of Trend Models

Linear: Constant rate of change
- Formula: y = mx + b
- Use: Steady growth or decline
Exponential: Accelerating growth/decay
- Formula: y = ae^(bx)
- Use: Viral growth, compound interest
Logarithmic: Rapid initial change, then leveling
- Formula: y = a + b·ln(x)
- Use: Diminishing returns
Polynomial: Multiple inflection points
- Formula: y = a + bx + cx² + …
- Use: Cyclical patterns, complex relationships

Forecasting Techniques

Time Series Forecasting

Moving Averages: Smooth short-term fluctuations
Seasonal Decomposition: Separate trend, seasonality, and residuals
Exponential Smoothing: Weight recent data more heavily

Forecast Components

Trend: Long-term direction
Seasonality: Regular periodic patterns
Cyclical: Longer-term economic cycles
Residual/Error: Random variation

Confidence Intervals

Show range of likely future values
Typically 80% or 95% confidence
Wider intervals indicate more uncertainty

Conditional Logic and Segmentation

Calculated Fields with Logic

IF-THEN-ELSE Logic

Structure:

IF condition THEN result
ELSE IF condition THEN result
ELSE default_result
END

Applications:

Categorization: Classify data into groups
- Example: IF profit > 0 THEN “Profitable” ELSE “Loss” END
Threshold Analysis: Flag values meeting criteria
- Example: IF sales > 100000 THEN “High” ELSE “Low” END
Multi-level Classification: Complex business rules
- Example: Business friendly ratings based on multiple criteria

CASE Statements

Structure:

CASE expression
  WHEN value1 THEN result1
  WHEN value2 THEN result2
  ELSE default_result
END

Use When: Multiple discrete conditions with specific values

Segmentation Analysis

Cohort Analysis

Group data by shared characteristics
Track behavior over time
Common cohorts: Acquisition date, first purchase month

ABC Analysis (Pareto Segmentation)

A Items: Top 20% (generate 80% of value)
B Items: Next 30% (generate 15% of value)
C Items: Bottom 50% (generate 5% of value)

RFM Analysis (for customer data)

Recency: How recently customer purchased
Frequency: How often they purchase
Monetary: How much they spend

Data Joins and Relationships

Types of Joins

Inner Join

Result: Only matching rows from both tables
Use Case: When you need complete information from both sources
Venn Diagram: Intersection of two sets

Left Join (Left Outer Join)

Result: All rows from left table + matching rows from right
Use Case: Preserve all primary data even without matches
Handling: Unmatched right table values = NULL

Right Join (Right Outer Join)

Result: All rows from right table + matching rows from left
Use Case: When right table is primary focus
Handling: Unmatched left table values = NULL

Full Outer Join

Result: All rows from both tables
Use Case: Need complete data from both sources
Handling: Unmatched values from either side = NULL

Join Best Practices

Performance Considerations

Join on indexed columns when possible
Filter data before joining (reduce row count)
Avoid joining on calculated fields
Use appropriate join type to minimize data duplication

Data Quality

Verify join keys are clean and consistent
Check for duplicate keys (cartesian products)
Validate join results with spot checks
Document join logic for maintenance

Unions vs. Joins

Unions (Append Operations)

Purpose: Stack rows from multiple tables vertically
Requirement: Same number and type of columns
Use Cases:
- Combine monthly data files
- Merge historical and current data
- Aggregate similar datasets

Key Differences

Aspect	Join	Union
Direction	Horizontal (adds columns)	Vertical (adds rows)
Result Size	Multiplicative	Additive
Matching	Requires common keys	Requires same structure
Use Case	Relate different entities	Combine similar entities

Geographic Analysis

Geographic Data Types

Recognized Geographic Fields

Country: Full names or ISO codes
State/Province: Names or abbreviations
City: City names
Postal Code: ZIP codes
Latitude/Longitude: Coordinates

Geographic Hierarchy

Country
  └── State/Province
       └── City
            └── Postal Code

Map Visualization Types

Filled Maps (Choropleth)

Regions colored by data values
Best for: Comparing regions directly
Considerations:
- Larger areas visually dominate
- Normalize by population/area when appropriate

Symbol Maps (Proportional Symbol)

Symbols sized by data values
Best for: Showing magnitude at specific locations
Advantage: Symbol size directly represents value

Heat Maps (Density)

Color intensity shows concentration
Best for: Showing density or clustering
Use: Customer locations, event concentrations

Flow Maps

Lines show movement between locations
Best for: Migration, trade routes, network flows
Features: Line thickness represents volume

Spatial Analysis Techniques

Distance Calculations

Calculate distances between points
Find nearest locations
Create distance-based filters

Territory Analysis

Aggregate data by custom regions
Compare performance across territories
Identify underserved areas

Time Intelligence

Time-Based Calculations

Period-to-Date Calculations

Year-to-Date (YTD): Cumulative from year start
Quarter-to-Date (QTD): Cumulative from quarter start
Month-to-Date (MTD): Cumulative from month start

Period Comparisons

Year-over-Year (YoY): Compare to same period last year
Quarter-over-Quarter (QoQ): Sequential quarter comparison
Month-over-Month (MoM): Sequential month comparison

Moving Calculations

Moving Average: Average over trailing N periods
Moving Sum: Total over trailing N periods
Exponential Moving Average: Weighted average favoring recent data

Date Hierarchy and Drill-Down

Standard Hierarchy

Year
  └── Quarter
       └── Month
            └── Week
                 └── Day

Fiscal Calendar

Non-standard year start (e.g., fiscal year begins July 1)
Custom quarters and periods
4-4-5 or 5-4-4 week accounting periods

Advanced Segmentation

Sets and Groups

Static Sets

Manually defined collections of members
Fixed membership unless edited
Use: Track specific customers, products, or regions

Dynamic Sets (Conditional Sets)

Membership based on conditions
Automatically update as data changes
Examples:
- Top 10 customers by revenue
- Products with negative profit
- High-priority orders

Groups

Combine multiple dimension members
Create custom categories
Simplify complex dimensions

Parameters

Types of Parameters

Numeric Parameters
- Integer or decimal values
- Range controls (min, max, step)
- Example: Top N selector (5, 10, 15, 20)
String Parameters
- Free text or predefined list
- Example: Select metric (Revenue, Profit, Quantity)
Date Parameters
- Single dates or date ranges
- Dynamic date options (Today, Yesterday, etc.)
Boolean Parameters
- True/False switches
- Example: Show details Yes/No

Parameter Applications

Dynamic Metrics: Allow users to select measures
What-If Analysis: Adjust assumptions
Threshold Controls: Set alert levels
Date Ranges: Custom time period selection

Annotations and Context

Types of Annotations

Mark Annotations

Attached to specific data points
Contextual information about that value
Updates position if data changes

Point Annotations

Fixed position on visualization
Highlight specific locations
Manual placement

Area Annotations

Cover regions of the visualization
Explain patterns or groupings
Call out sections

Best Practices for Annotations

Keep text concise and clear
Use to explain anomalies or key insights
Don’t overcrowd the visualization
Update or remove outdated annotations
Use consistent formatting

Data Storytelling

Narrative Structure

The Data Story Arc

Setup: Context and baseline
Conflict: Problem or question
Rising Action: Analysis and exploration
Climax: Key insight or discovery
Falling Action: Implications and recommendations
Resolution: Action items or conclusions

Story Points

Sequential Storytelling

Guide users through analysis step-by-step
Build context gradually
Progressive disclosure of complexity
Each point reveals new information

Interactive Storytelling

Allow user exploration within narrative
Branching paths based on interests
Drill-down capabilities at key points
Filter and highlight to focus attention

Effective Data Communication

Key Principles

Start with Questions: What does the audience need to know?
Show, Don’t Tell: Let visualizations speak
Highlight Insights: Direct attention to findings
Provide Context: Baseline comparisons matter
Recommend Actions: Data without action is just numbers

Common Pitfalls to Avoid

Chart Junk: Excessive decoration distracts
Misleading Scales: Can distort perceptions
Overloading: Too much information at once
Missing Context: Numbers without comparison
Ignoring Baseline: Changes without reference points

Performance and Optimization

Query Performance

Strategies for Large Datasets

Extract vs. Live Connections
- Extracts: Pre-aggregated, faster performance
- Live: Real-time data, always current
- Choose based on data size and freshness requirements
Data Source Filters
- Filter at source before loading
- Reduces volume transferred
- Improves processing speed
Aggregation Strategy
- Pre-aggregate when possible
- Use appropriate granularity
- Avoid unnecessary detail levels

Dashboard Performance

Loading Optimization

Limit Initial Load: Show summary first
Progressive Loading: Load critical elements first
Pagination: Break large tables into pages
Data Blending: Only blend when necessary

Interactive Performance

Debouncing: Wait for user to finish input before updating
Caching: Store frequently accessed calculations
Simplification: Reduce complexity for real-time interactions

Key Takeaways

Calculations and Analysis

Choose calculation type based on context needs (table vs. database)
Use appropriate ranking for your analysis goals
Apply statistical methods to understand distributions
Leverage forecasting for predictive insights

Data Relationships

Understand join types and their impact on results
Use unions for similar data, joins for related data
Validate join logic to prevent data duplication or loss

Geographic and Time Analysis

Normalize map data for accurate comparison
Use appropriate time hierarchies for your analysis
Apply period comparisons to identify trends

Advanced Techniques

Dynamic sets and parameters enable flexible analysis
Annotations provide context and highlight insights
Data storytelling transforms analysis into action

Performance

Optimize early in the design process
Balance interactivity with performance
Test with realistic data volumes

On this page