Using Streamlit to Build an ML-based Web Application

Photo by Negative Space on Pexels

Companies have great interest in clearly communicating their ML-based predictive analytics to their clients. No matter how accurate a model is, clients want to know how machine learning models make predictions from data. For example, if a subscription-based company is interested in finding customers who are at high risk of canceling their subscriptions, they can use their historical customer data to predict the likelihood of someone leaving.

From there, they would want to analyze the factors that drive this event. By understanding the driving factors, they can take actions like targeted promotions or discounts to prevent the customer from leaving…

Understanding Classification Performance Metrics

Photo by Pixabay on Pexels

Machine learning classification is a type of supervised learning in which an algorithm maps a set of inputs to discrete output. Classification models have a wide range of applications across disparate industries and are one of the mainstays of supervised learning. This is because, across industries, many analytical questions can be framed in terms of mapping inputs to a discrete set of outputs. The simplicity of defining a classification problem makes classification models versatile and industry agnostic.

An important part of building classification models is evaluating model performance. In short, data scientists need a reliable way to test approximately how…

Using Pandas to Extract Information from Text

Photo by Pixabay on Pexels

Raw text data often comes in a form that is difficult to use directly for analysis and it often requires text processing methods. Text processing is the practice of automating the generation and manipulation of text. It can be used for many data manipulation tasks including feature engineering from text, data wrangling, web scraping, search engines and much more. In this tutorial, we’ll take a look at how to use the Pandas library to perform some important data wrangling tasks.

Data wrangling is the process of gathering and transforming data to address an analytical question. It’s also often the most…

Using Seaborn and Matplotlib for Data Exploration

Photo by Lukas on Pexels

Data visualization is important for many analytical tasks including data summarization, exploratory data analysis and model output analysis. One of the easiest ways to communicate your findings with other people is through a good visualization. Fortunately, Python features many libraries that provide useful tools for gaining insights from data. The most well-known of these, Matplotlib, enables users to generate visualizations like histograms, scatterplots, bar charts, pie charts and much more.

Seaborn is another useful visualization library that is built on top of Matplotlib. It provides data visualizations that are typically more aesthetic and statistically sophisticated. …

Working with the Python Machine Learning Library

Photo by Burst on Pexels

Scikit-learn is a powerful machine learning library that provides a wide variety of modules for data access, data preparation and statistical model building. It has a good selection of clean toy datasets that are great for people just getting started with data analysis and machine learning. Easy access to these data sets removes the hassle of searching for and downloading files from an external data source. The library also enables data processing tasks such as imputation, data standardization and data normalization. These tasks can often lead to significant improvements in model performance.

Scikit-learn also provides a variety of packages for…

Understanding the Django Web Framework

Photo by Kevin Ku on Pexels

This post was originally published on the BuiltIn blog. The original piece can be found here.

Django is a high-level web framework in Python that allows users to easily build web applications rapidly with minimal code. The Django framework follows the model-view-controller (MVC) design pattern. This setup facilitates easy development of complex database-driven web apps. Through these design patterns, Django emphasizes reusability of components. It also follows the don’t-repeat-yourself (DRY) principle, which reduces repetition in software through abstraction and data normalization to avoid redundant code.

Django can be used in a variety of web applications, including customer relationship management systems…

Understanding Function Caching in Python

Photo by Kaboompics .com on Pexels

Memoization is a method used to store the results of previous function calls to speed up future calculations. If repeated function calls are made with the same parameters, we can store the previous values instead of repeating unnecessary calculations. This results in a significant speed up in calculations. In this post, we will use memoization to find factorials.

Let’s get started!

First, let’s define a recursive function that we can use to display the first factorials up to n. If you are unfamiliar with recursion, check out this article: Recursion in Python.

As a reminder, the factorial is defined for…

Custom Python Classes for Generating Statistical Insights from Data

Photo by Max Fischer on Pexels

In computer programming, a class is a blueprint for a user-defined data type. Classes are defined in terms of attributes (data) and methods (functions). These data structures are a great way to organize data and methods such that they are easy to reuse and extend in the future. In this post, we will define a python class that will allow us to generate simple summary statistics and perform some EDA on data.

Let’s get started!

For our purposes we will be working with the FIFA 19 data set which can be found here.

To start, let’s import the pandas package:

EDA and Sentiment Analysis of Reddit Data

Photo by on Pexels

Reddit WallStreetBets Posts is a data set available on the Kaggle website that contains WallStreetBet information. WallStreetBets is a subreddit used for discussing stock and option trading. WallStreetBets is most notable for its role in the GameStop short squeeze that resulted in $70 billion in losses on short positions in US firms. In this post we will explore the Reddit WallStreetBets Posts in python. The data was scraped using the python Reddit API wrapper (PRAW) in compliance with Reddit’s rules around API usage. The data is can be found here.

Let’s get started!

First, let’s read the data into a…

A Short Survey of Healthcare Cost Data

Photo by on Pexels

Healthcare spending in the US continues to rapidly grow as the aging population and disease prevalence increase. A study published in the Journal of the American Medical Association (JAMA) reported that healthcare spending in the US rose by almost $1 trillion between 1996 and 2015.

Health cost and quality are often made opaque to consumers due to lack of healthcare transparency. If consumers had access to quality healthcare information they may be able to have more agency in their healthcare services. For example, another study published in the American Heart Journal found that, after adjusting for patient risk and length…

Sadrach Pierre, Ph.D.

Data Scientist at WorldQuant Predictive. Writer for Built In & Towards Data Science. Cornell University Ph. D. in Chemical Physics.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store