3_advanced_data_analysis
Table Calculations vs. Database Calculations
Overview
Understanding when to perform calculations in the visualization layer versus the database layer is crucial for performance and accuracy.
Table Calculations
- Processing Location: Performed locally in the visualization tool
- Timing: After aggregations have occurred
- Pros:
- Respond to filters and context changes
- More flexible and dynamic
- Can be adjusted without modifying source data
- Cons:
- May be slower with large datasets
- Limited by local processing power
Common Table Calculations:
- Running totals
- Percent of total
- Difference from previous
- Moving averages
- Rank and percentile
Database Calculations
- Processing Location: Performed in the source database
- Timing: As part of the query to the data source
- Pros:
- Faster for large datasets
- Leverages database optimization
- Consistent results across different views
- Cons:
- Fixed results (ignore some filters)
- Require database query modification
When to Use Each:
- Use table calculations when results need to adapt to filters and user interactions
- Use database calculations when dealing with large datasets or when filter context should not affect the calculation
Ranking Functions
Types of Rank Calculations
Standard Rank (RANK)
- Behavior: Identical values receive identical rank
- Gap Handling: Skips ranks after ties
- Example: Values [100, 90, 90, 80] โ Ranks [1, 2, 2, 4]
Dense Rank (RANK_DENSE)
- Behavior: Identical values receive identical rank
- Gap Handling: No gaps in ranking sequence
- Example: Values [100, 90, 90, 80] โ Ranks [1, 2, 2, 3]
Modified Rank (RANK_MODIFIED)
- Behavior: Identical values receive identical rank
- Tie Breaking: Uses lower of the possible ranks
- Example: Values [100, 90, 90, 80] โ Ranks [1, 3, 3, 4]
Unique Rank (RANK_UNIQUE)
- Behavior: Every value receives unique rank
- Tie Breaking: Based on additional sort criteria (e.g., alphabetical)
- Example: Values [100, 90, 90, 80] โ Ranks [1, 2, 3, 4]
Ranking Applications
- Top N Analysis: Identify top performers
- Competitive Analysis: Compare entities directly
- Percentile Ranking: Relative performance assessment
Statistical Analysis Techniques
Reference Lines and Bands
Reference Lines
- Purpose: Mark specific values on visualizations
- Types:
- Constant: Fixed value (e.g., target = 100)
- Average: Mean of the data
- Median: Middle value
- Minimum/Maximum: Range boundaries
- Percentile: Specific percentile value
Reference Bands
- Purpose: Highlight ranges of values
- Configuration:
- Per Cell: Band for each individual data point
- Per Pane: Band for each category/group
- Per Table: Single band across entire visualization
- Usage:
- Show acceptable ranges
- Highlight variance zones
- Create bullet charts
Distribution Analysis
Box Plots (Box and Whisker)
Components:
- Box: Interquartile range (25th to 75th percentile)
- Median Line: 50th percentile
- Whiskers: Extend to minimum and maximum (excluding outliers)
- Outliers: Points beyond whiskers
Histogram Analysis
Key Insights:
- Central Tendency: Where data clusters
- Spread: Width of distribution
- Skewness: Asymmetry in distribution
- Outliers: Unusual values
- Bimodal/Multimodal: Multiple peaks
Forecasting and Trend Analysis
Trend Lines
Types of Trend Models
Linear: Constant rate of change
- Formula: y = mx + b
- Use: Steady growth or decline
Exponential: Accelerating growth/decay
- Formula: y = ae^(bx)
- Use: Viral growth, compound interest
Logarithmic: Rapid initial change, then leveling
- Formula: y = a + bยทln(x)
- Use: Diminishing returns
Polynomial: Multiple inflection points
- Formula: y = a + bx + cxยฒ + …
- Use: Cyclical patterns, complex relationships
Forecasting Techniques
Time Series Forecasting
- Moving Averages: Smooth short-term fluctuations
- Seasonal Decomposition: Separate trend, seasonality, and residuals
- Exponential Smoothing: Weight recent data more heavily
Forecast Components
- Trend: Long-term direction
- Seasonality: Regular periodic patterns
- Cyclical: Longer-term economic cycles
- Residual/Error: Random variation
Confidence Intervals
- Show range of likely future values
- Typically 80% or 95% confidence
- Wider intervals indicate more uncertainty
Conditional Logic and Segmentation
Calculated Fields with Logic
IF-THEN-ELSE Logic
Structure:
IF condition THEN result
ELSE IF condition THEN result
ELSE default_result
END
Applications:
Categorization: Classify data into groups
- Example: IF profit > 0 THEN “Profitable” ELSE “Loss” END
Threshold Analysis: Flag values meeting criteria
- Example: IF sales > 100000 THEN “High” ELSE “Low” END
Multi-level Classification: Complex business rules
- Example: Business friendly ratings based on multiple criteria
CASE Statements
Structure:
CASE expression
WHEN value1 THEN result1
WHEN value2 THEN result2
ELSE default_result
END
Use When: Multiple discrete conditions with specific values
Segmentation Analysis
Cohort Analysis
- Group data by shared characteristics
- Track behavior over time
- Common cohorts: Acquisition date, first purchase month
ABC Analysis (Pareto Segmentation)
- A Items: Top 20% (generate 80% of value)
- B Items: Next 30% (generate 15% of value)
- C Items: Bottom 50% (generate 5% of value)
RFM Analysis (for customer data)
- Recency: How recently customer purchased
- Frequency: How often they purchase
- Monetary: How much they spend
Data Joins and Relationships
Types of Joins
Inner Join
- Result: Only matching rows from both tables
- Use Case: When you need complete information from both sources
- Venn Diagram: Intersection of two sets
Left Join (Left Outer Join)
- Result: All rows from left table + matching rows from right
- Use Case: Preserve all primary data even without matches
- Handling: Unmatched right table values = NULL
Right Join (Right Outer Join)
- Result: All rows from right table + matching rows from left
- Use Case: When right table is primary focus
- Handling: Unmatched left table values = NULL
Full Outer Join
- Result: All rows from both tables
- Use Case: Need complete data from both sources
- Handling: Unmatched values from either side = NULL
Join Best Practices
- Join on indexed columns when possible
- Filter data before joining (reduce row count)
- Avoid joining on calculated fields
- Use appropriate join type to minimize data duplication
Data Quality
- Verify join keys are clean and consistent
- Check for duplicate keys (cartesian products)
- Validate join results with spot checks
- Document join logic for maintenance
Unions vs. Joins
Unions (Append Operations)
- Purpose: Stack rows from multiple tables vertically
- Requirement: Same number and type of columns
- Use Cases:
- Combine monthly data files
- Merge historical and current data
- Aggregate similar datasets
Key Differences
| Aspect | Join | Union |
|---|
| Direction | Horizontal (adds columns) | Vertical (adds rows) |
| Result Size | Multiplicative | Additive |
| Matching | Requires common keys | Requires same structure |
| Use Case | Relate different entities | Combine similar entities |
Geographic Analysis
Geographic Data Types
Recognized Geographic Fields
- Country: Full names or ISO codes
- State/Province: Names or abbreviations
- City: City names
- Postal Code: ZIP codes
- Latitude/Longitude: Coordinates
Geographic Hierarchy
Country
โโโ State/Province
โโโ City
โโโ Postal Code
Map Visualization Types
Filled Maps (Choropleth)
- Regions colored by data values
- Best for: Comparing regions directly
- Considerations:
- Larger areas visually dominate
- Normalize by population/area when appropriate
Symbol Maps (Proportional Symbol)
- Symbols sized by data values
- Best for: Showing magnitude at specific locations
- Advantage: Symbol size directly represents value
Heat Maps (Density)
- Color intensity shows concentration
- Best for: Showing density or clustering
- Use: Customer locations, event concentrations
Flow Maps
- Lines show movement between locations
- Best for: Migration, trade routes, network flows
- Features: Line thickness represents volume
Spatial Analysis Techniques
Distance Calculations
- Calculate distances between points
- Find nearest locations
- Create distance-based filters
Territory Analysis
- Aggregate data by custom regions
- Compare performance across territories
- Identify underserved areas
Time Intelligence
Time-Based Calculations
Period-to-Date Calculations
- Year-to-Date (YTD): Cumulative from year start
- Quarter-to-Date (QTD): Cumulative from quarter start
- Month-to-Date (MTD): Cumulative from month start
Period Comparisons
- Year-over-Year (YoY): Compare to same period last year
- Quarter-over-Quarter (QoQ): Sequential quarter comparison
- Month-over-Month (MoM): Sequential month comparison
Moving Calculations
- Moving Average: Average over trailing N periods
- Moving Sum: Total over trailing N periods
- Exponential Moving Average: Weighted average favoring recent data
Date Hierarchy and Drill-Down
Standard Hierarchy
Year
โโโ Quarter
โโโ Month
โโโ Week
โโโ Day
Fiscal Calendar
- Non-standard year start (e.g., fiscal year begins July 1)
- Custom quarters and periods
- 4-4-5 or 5-4-4 week accounting periods
Advanced Segmentation
Sets and Groups
Static Sets
- Manually defined collections of members
- Fixed membership unless edited
- Use: Track specific customers, products, or regions
Dynamic Sets (Conditional Sets)
- Membership based on conditions
- Automatically update as data changes
- Examples:
- Top 10 customers by revenue
- Products with negative profit
- High-priority orders
Groups
- Combine multiple dimension members
- Create custom categories
- Simplify complex dimensions
Parameters
Types of Parameters
Numeric Parameters
- Integer or decimal values
- Range controls (min, max, step)
- Example: Top N selector (5, 10, 15, 20)
String Parameters
- Free text or predefined list
- Example: Select metric (Revenue, Profit, Quantity)
Date Parameters
- Single dates or date ranges
- Dynamic date options (Today, Yesterday, etc.)
Boolean Parameters
- True/False switches
- Example: Show details Yes/No
Parameter Applications
- Dynamic Metrics: Allow users to select measures
- What-If Analysis: Adjust assumptions
- Threshold Controls: Set alert levels
- Date Ranges: Custom time period selection
Annotations and Context
Types of Annotations
Mark Annotations
- Attached to specific data points
- Contextual information about that value
- Updates position if data changes
Point Annotations
- Fixed position on visualization
- Highlight specific locations
- Manual placement
Area Annotations
- Cover regions of the visualization
- Explain patterns or groupings
- Call out sections
Best Practices for Annotations
- Keep text concise and clear
- Use to explain anomalies or key insights
- Don’t overcrowd the visualization
- Update or remove outdated annotations
- Use consistent formatting
Data Storytelling
Narrative Structure
The Data Story Arc
- Setup: Context and baseline
- Conflict: Problem or question
- Rising Action: Analysis and exploration
- Climax: Key insight or discovery
- Falling Action: Implications and recommendations
- Resolution: Action items or conclusions
Story Points
Sequential Storytelling
- Guide users through analysis step-by-step
- Build context gradually
- Progressive disclosure of complexity
- Each point reveals new information
Interactive Storytelling
- Allow user exploration within narrative
- Branching paths based on interests
- Drill-down capabilities at key points
- Filter and highlight to focus attention
Effective Data Communication
Key Principles
- Start with Questions: What does the audience need to know?
- Show, Don’t Tell: Let visualizations speak
- Highlight Insights: Direct attention to findings
- Provide Context: Baseline comparisons matter
- Recommend Actions: Data without action is just numbers
Common Pitfalls to Avoid
- Chart Junk: Excessive decoration distracts
- Misleading Scales: Can distort perceptions
- Overloading: Too much information at once
- Missing Context: Numbers without comparison
- Ignoring Baseline: Changes without reference points
Strategies for Large Datasets
Extract vs. Live Connections
- Extracts: Pre-aggregated, faster performance
- Live: Real-time data, always current
- Choose based on data size and freshness requirements
Data Source Filters
- Filter at source before loading
- Reduces volume transferred
- Improves processing speed
Aggregation Strategy
- Pre-aggregate when possible
- Use appropriate granularity
- Avoid unnecessary detail levels
Loading Optimization
- Limit Initial Load: Show summary first
- Progressive Loading: Load critical elements first
- Pagination: Break large tables into pages
- Data Blending: Only blend when necessary
- Debouncing: Wait for user to finish input before updating
- Caching: Store frequently accessed calculations
- Simplification: Reduce complexity for real-time interactions
Key Takeaways
Calculations and Analysis
- Choose calculation type based on context needs (table vs. database)
- Use appropriate ranking for your analysis goals
- Apply statistical methods to understand distributions
- Leverage forecasting for predictive insights
Data Relationships
- Understand join types and their impact on results
- Use unions for similar data, joins for related data
- Validate join logic to prevent data duplication or loss
Geographic and Time Analysis
- Normalize map data for accurate comparison
- Use appropriate time hierarchies for your analysis
- Apply period comparisons to identify trends
Advanced Techniques
- Dynamic sets and parameters enable flexible analysis
- Annotations provide context and highlight insights
- Data storytelling transforms analysis into action
- Optimize early in the design process
- Balance interactivity with performance
- Test with realistic data volumes