Exploring Financial Data in Python

Photo by energepic.com on Pexels

Exploratory data analysis (EDA) is an important part of every data scientist’s workflow. EDA allows data scientists to summarize the most important characteristics of the data they’re working with. In the case of financial data analysis, this includes generating simple summary statistics such as standard deviation in returns and average returns, visualizing relationships between stocks through correlation heatmaps, generating stock price time series plots, boxplots, and more.

Let’s consider the analysis of three stocks: Amazon (AMZN), Google (GOOGL) and Apple (AAPL). We’ll see how to perform simple exploratory data analysis of these stocks by generating summary statistics and visualizations, risk…


Understanding model testing, feature selection, and model tuning

Photo by Lukas on Pexels

Building stable, accurate and interpretable machine learning models is an important task for many companies across industries. Machine learning model predictions have to be stable in time as the underlying training data is updated. Drastic changes in data due to unforeseen events can lead to significant deterioration in model performance. Model hyperparameter tuning can help make necessary changes to machine learning models that account for statistical changes in data over time. It is also important to understand the various ways of testing your models depending on how much data you have and consequently the stability of your model predictions. Further…


Using Streamlit to Build an ML-based Web Application

Photo by Negative Space on Pexels

Companies have great interest in clearly communicating their ML-based predictive analytics to their clients. No matter how accurate a model is, clients want to know how machine learning models make predictions from data. For example, if a subscription-based company is interested in finding customers who are at high risk of canceling their subscriptions, they can use their historical customer data to predict the likelihood of someone leaving.

From there, they would want to analyze the factors that drive this event. By understanding the driving factors, they can take actions like targeted promotions or discounts to prevent the customer from leaving…


Understanding Classification Performance Metrics

Photo by Pixabay on Pexels

Machine learning classification is a type of supervised learning in which an algorithm maps a set of inputs to discrete output. Classification models have a wide range of applications across disparate industries and are one of the mainstays of supervised learning. This is because, across industries, many analytical questions can be framed in terms of mapping inputs to a discrete set of outputs. The simplicity of defining a classification problem makes classification models versatile and industry agnostic.

An important part of building classification models is evaluating model performance. In short, data scientists need a reliable way to test approximately how…


Using Pandas to Extract Information from Text

Photo by Pixabay on Pexels

Raw text data often comes in a form that is difficult to use directly for analysis and it often requires text processing methods. Text processing is the practice of automating the generation and manipulation of text. It can be used for many data manipulation tasks including feature engineering from text, data wrangling, web scraping, search engines and much more. In this tutorial, we’ll take a look at how to use the Pandas library to perform some important data wrangling tasks.

Data wrangling is the process of gathering and transforming data to address an analytical question. It’s also often the most…


Using Seaborn and Matplotlib for Data Exploration

Photo by Lukas on Pexels

Data visualization is important for many analytical tasks including data summarization, exploratory data analysis and model output analysis. One of the easiest ways to communicate your findings with other people is through a good visualization. Fortunately, Python features many libraries that provide useful tools for gaining insights from data. The most well-known of these, Matplotlib, enables users to generate visualizations like histograms, scatterplots, bar charts, pie charts and much more.

Seaborn is another useful visualization library that is built on top of Matplotlib. It provides data visualizations that are typically more aesthetic and statistically sophisticated. …


Working with the Python Machine Learning Library

Photo by Burst on Pexels

Scikit-learn is a powerful machine learning library that provides a wide variety of modules for data access, data preparation and statistical model building. It has a good selection of clean toy datasets that are great for people just getting started with data analysis and machine learning. Easy access to these data sets removes the hassle of searching for and downloading files from an external data source. The library also enables data processing tasks such as imputation, data standardization and data normalization. These tasks can often lead to significant improvements in model performance.

Scikit-learn also provides a variety of packages for…


Understanding the Django Web Framework

Photo by Kevin Ku on Pexels

This post was originally published on the BuiltIn blog. The original piece can be found here.

Django is a high-level web framework in Python that allows users to easily build web applications rapidly with minimal code. The Django framework follows the model-view-controller (MVC) design pattern. This setup facilitates easy development of complex database-driven web apps. Through these design patterns, Django emphasizes reusability of components. It also follows the don’t-repeat-yourself (DRY) principle, which reduces repetition in software through abstraction and data normalization to avoid redundant code.

Django can be used in a variety of web applications, including customer relationship management systems…


Understanding Function Caching in Python

Photo by Kaboompics .com on Pexels

Memoization is a method used to store the results of previous function calls to speed up future calculations. If repeated function calls are made with the same parameters, we can store the previous values instead of repeating unnecessary calculations. This results in a significant speed up in calculations. In this post, we will use memoization to find factorials.

Let’s get started!

First, let’s define a recursive function that we can use to display the first factorials up to n. If you are unfamiliar with recursion, check out this article: Recursion in Python.

As a reminder, the factorial is defined for…


Custom Python Classes for Generating Statistical Insights from Data

Photo by Max Fischer on Pexels

In computer programming, a class is a blueprint for a user-defined data type. Classes are defined in terms of attributes (data) and methods (functions). These data structures are a great way to organize data and methods such that they are easy to reuse and extend in the future. In this post, we will define a python class that will allow us to generate simple summary statistics and perform some EDA on data.

Let’s get started!

For our purposes we will be working with the FIFA 19 data set which can be found here.

To start, let’s import the pandas package:

Sadrach Pierre, Ph.D.

Data Scientist at WorldQuant Predictive. Writer for Built In & Towards Data Science. Cornell University Ph. D. in Chemical Physics.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store