Sveriges mest populära poddar

Machine Learning Guide

MLA 009 Charting and Visualization Tools for Data Science

25 min • 6 november 2018

Python charting libraries - Matplotlib, Seaborn, and Bokeh - explaining, their strengths from quick EDA to interactive, HTML-exported visualizations, and clarifies where D3.js fits as a JavaScript alternative for end-user applications. It also evaluates major software solutions like Tableau, Power BI, QlikView, and Excel, detailing how modern BI tools now integrate drag-and-drop analytics with embedded machine learning, potentially allowing business users to automate entire workflows without coding.

Links Core Phases in Data Science Visualization
  • Exploratory Data Analysis (EDA):
    • EDA occupies an early stage in the Business Intelligence (BI) pipeline, positioned just before or sometimes merged with the data cleaning (“munging”) phase.
    • The outputs of EDA (e.g., correlation matrices, histograms) often serve as inputs to subsequent machine learning steps.
Python Visualization Libraries 1. Matplotlib
  • The foundational plotting library in Python, supporting static, basic chart types.
  • Requires substantial boilerplate code for custom visualizations.
  • Serves as the core engine for many higher-level visualization tools.
  • Common EDA tasks (like plotting via .corr(), .hist(), and .scatter() methods on pandas DataFrames) depend on Matplotlib under the hood.
2. Pandas Plotting
  • Pandas integrates tightly with Matplotlib and exposes simple, one-line commands for common plots (e.g., df.corr(), df.hist()).
  • Designed to make quick EDA accessible without requiring detailed knowledge of Matplotlib’s verbose syntax.
3. Seaborn
  • A high-level wrapper around Matplotlib, analogous to how Keras wraps TensorFlow.
  • Sets sensible defaults for chart styles, fonts, colors, and sizes, improving aesthetics with minimal effort.
  • Importing Seaborn can globally enhance the appearance of all Matplotlib plots, even without direct usage of Seaborn’s plotting functions.
4. Bokeh
  • A powerful library for creating interactive, web-ready plots from Python.
  • Enables user interactions such as hovering, zooming, and panning within rendered plots.
  • Exports visualizations as standalone HTML files or can operate as a server-linked app for live data exploration.
  • Supports advanced features like cross-filtering, allowing dynamic slicing and dicing of data across multiple axes or columns.
  • More suited for creating reusable, interactive dashboards rather than quick, one-off EDA visuals.
5. D3.js
  • Unlike previous libraries, D3.js is a JavaScript framework for creating complex, highly customized data visualizations for web and mobile apps.
  • Used predominantly on the client-side to build interactive front-end graphics for end users, not as an EDA tool for analysts.
  • Common in production-grade web apps, but not typically part of a Python-based data science workflow.
Dedicated Visualization and BI Software Tableau
  • Leading commercial drag-and-drop BI tool for data visualization and dashboarding.
  • Connects to diverse data sources (CSV, Excel, databases), auto-detects column types, and suggests default chart types.
  • Users can interactively build visualizations, cross-filter data, and switch chart types without coding.
Power BI
  • Microsoft’s BI suite, similar to Tableau, supporting end-to-end data analysis and visualization.
  • Integrates data preparation, visualization, and increasingly, built-in machine learning workflows.
  • Focused on empowering business users or analysts to run the BI pipeline without programming.
QlikView
  • Another major BI offering is QlikView, emphasizing interactive dashboards and data exploration.
Excel
  • Still widely used for basic EDA and visualizations directly on spreadsheets.
  • Offers limited but accessible charting tools for histograms, scatter plots, and simple summary statistics.
  • Data often originates from Excel/CSV files before being ingested for further analysis in Python/pandas.
Trends & Insights
  • Workflow Integration: Modern BI tools are converging, adding both classic EDA capabilities and basic machine learning modeling, often through a code-free interface.
  • Automation Risks and Opportunities: As drag-and-drop BI tools increase in capabilities (including model training and selection), some data science coding work traditionally required for BI pipelines may become accessible to non-programmers.
  • Distinctions in Use:
    • Python libraries (Matplotlib, Seaborn, Bokeh) excel in automating and scripting EDA, report generation, and static analysis as part of data pipelines.
    • BI software (Tableau, Power BI, QlikView) shines for interactive exploration and democratized analytics, integrated from ingestion to reporting.
    • D3.js stands out for tailored, production-level, end-user app visualizations, rarely leveraged by data scientists for EDA.

Key Takeaways

  • For quick, code-based EDA: Use Pandas’ built-in plotters (wrapping Matplotlib).
  • For pre-styled, pretty plots: Use Seaborn (with or without direct API calls).
  • For interactive, shareable dashboards: Use Bokeh for Python or BI tools for no-code operation.
  • For enterprise, end-user-facing dashboards: Choose BI software like Tableau or build custom apps using D3.js for total control.
Förekommer på
Podcastbild

00:00 -00:00
00:00 -00:00