Using Seaborn and Matplotlib for Data Exploration

Photo by Lukas on Pexels

Data visualization is important for many analytical tasks including data summarization, exploratory data analysis and model output analysis. One of the easiest ways to communicate your findings with other people is through a good visualization. Fortunately, Python features many libraries that provide useful tools for gaining insights from data. The most well-known of these, Matplotlib, enables users to generate visualizations like histograms, scatterplots, bar charts, pie charts and much more.

Seaborn is another useful visualization library that is built on top of Matplotlib. It provides data visualizations that are typically more aesthetic and statistically sophisticated. …

Working with the Python Machine Learning Library

Photo by Burst on Pexels

Scikit-learn is a powerful machine learning library that provides a wide variety of modules for data access, data preparation and statistical model building. It has a good selection of clean toy datasets that are great for people just getting started with data analysis and machine learning. Easy access to these data sets removes the hassle of searching for and downloading files from an external data source. The library also enables data processing tasks such as imputation, data standardization and data normalization. These tasks can often lead to significant improvements in model performance.

Scikit-learn also provides a variety of packages for…

Understanding the Django Web Framework

Photo by Kevin Ku on Pexels

This post was originally published on the BuiltIn blog. The original piece can be found here.

Django is a high-level web framework in Python that allows users to easily build web applications rapidly with minimal code. The Django framework follows the model-view-controller (MVC) design pattern. This setup facilitates easy development of complex database-driven web apps. Through these design patterns, Django emphasizes reusability of components. It also follows the don’t-repeat-yourself (DRY) principle, which reduces repetition in software through abstraction and data normalization to avoid redundant code.

Django can be used in a variety of web applications, including customer relationship management systems…

Understanding Function Caching in Python

Photo by Kaboompics .com on Pexels

Memoization is a method used to store the results of previous function calls to speed up future calculations. If repeated function calls are made with the same parameters, we can store the previous values instead of repeating unnecessary calculations. This results in a significant speed up in calculations. In this post, we will use memoization to find factorials.

Let’s get started!

First, let’s define a recursive function that we can use to display the first factorials up to n. If you are unfamiliar with recursion, check out this article: Recursion in Python.

As a reminder, the factorial is defined for…

Custom Python Classes for Generating Statistical Insights from Data

Photo by Max Fischer on Pexels

In computer programming, a class is a blueprint for a user-defined data type. Classes are defined in terms of attributes (data) and methods (functions). These data structures are a great way to organize data and methods such that they are easy to reuse and extend in the future. In this post, we will define a python class that will allow us to generate simple summary statistics and perform some EDA on data.

Let’s get started!

For our purposes we will be working with the FIFA 19 data set which can be found here.

To start, let’s import the pandas package:

EDA and Sentiment Analysis of Reddit Data

Photo by on Pexels

Reddit WallStreetBets Posts is a data set available on the Kaggle website that contains WallStreetBet information. WallStreetBets is a subreddit used for discussing stock and option trading. WallStreetBets is most notable for its role in the GameStop short squeeze that resulted in $70 billion in losses on short positions in US firms. In this post we will explore the Reddit WallStreetBets Posts in python. The data was scraped using the python Reddit API wrapper (PRAW) in compliance with Reddit’s rules around API usage. The data is can be found here.

Let’s get started!

First, let’s read the data into a…

A Short Survey of Healthcare Cost Data

Photo by on Pexels

Healthcare spending in the US continues to rapidly grow as the aging population and disease prevalence increase. A study published in the Journal of the American Medical Association (JAMA) reported that healthcare spending in the US rose by almost $1 trillion between 1996 and 2015.

Health cost and quality are often made opaque to consumers due to lack of healthcare transparency. If consumers had access to quality healthcare information they may be able to have more agency in their healthcare services. For example, another study published in the American Heart Journal found that, after adjusting for patient risk and length…

Millennium Prize Problem: Yang-Mills and Mass Gap

Photo by Pixabay on Pexels

The Millennium prize problems are seven challenging problems in mathematics for which a solution results in a $1 million prize. In this post we will briefly discuss one of the Millennium prize problems, the Yang-Mills and Mass Gap problem.

All of the Millennium prize problems are listed on the Clay Mathematics Institutes’ website here.

Yang-Mills and Mass Gap

The Yang-Mills theory describes elementary particles, which are particles with no substructure (quarks, leptons, Higgs boson), using algebraic objects. Specifically, non-abelian Lie groups are used to unify electromagnetic and weak forces. …

Building Machine Learning Models with Scikit-learn

Photo by Steve Johnson on Pexels

Scikit-learn is a powerful machine learning library in python. It provides many tools for classification, regression and clustering tasks. In this post we will discuss some popular tools for building classification models using scikit-learn.

Let’s get started!

For our purposes we will be working with the Bank Churn Modeling data set. The data can be found here.

To start, let’s import the Pandas library, relax display limits and print the first five rows of data:

import pandas as pddf = pd.read_csv("Bank_churn_modelling.csv")pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Data selection, aggregation and statistics with Bank Churn Modeling data

Photo by Burst on Pexels

Pandas is a python library that is used for wrangling data, generating statistics, aggregating data and much more. In this post we will discuss how to perform data selection, aggregation and statistical analysis using the Pandas library.

Let’s get started!

For our purposes we will be working with the Bank Churn Modeling data set. The data can be found here.

To start, let’s import the Pandas library, relax display limits and print the first five rows of data:

import pandas as pddf = pd.read_csv("Bank_churn_modelling.csv")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Sadrach Pierre, Ph.D.

Data Scientist at WorldQuant Predictive. Writer for Built In & Towards Data Science. Cornell University Ph. D. in Chemical Physics.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store