Get the "Applied Data Science Edge"!

The ViralML School

Fundamental Market Analysis with Python - Find Your Own Answers On What Is Going on in the Financial Markets

Web Work

Python Web Work - Prototyping Guide for Maker

Use HTML5 Templates, Serve Dynamic Content, Build Machine Learning Web Apps, Grow Audiences & Conquer the World!

Hot off the Press!

The Little Book of Fundamental Market Indicators

My New Book: "The Little Book of Fundamental Analysis: Hands-On Market Analysis with Python" is Out!

Finding Patterns and Outcomes in Time Series Data - Hands-On with Python

Introduction

Let's analyze time-series data and assign outcome variables depending on pattern types. If you are looking to model raw time series for classification, this video is for you.



If you liked it, please share it:

Code

ViralML-Hands-On-Time-Series-Pattern-Recognition-and-Assigning-Outcome-Variables
In [231]:
from IPython.display import Image
Image(filename='viralml-book.png')
Out[231]:

Fundamental and Technical Indicators - Hands-On Market Analysis

Companion book: "The Little Book of Fundamental Market Indicators":

https://amzn.to/2DERG3d

More at:

https://www.viralml.com/

Pattern Recognition on Time Series Data - Finding Outcomes using Matching Shapes

ViralML-Hands-On-Time-Seriees-Pattern-Recognition-and-Assigning-Outcome-Variables

Gold Price: London Fixing

https://www.quandl.com/data/LBMA/GOLD-Gold-Price-London-Fixing

In [187]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import io, base64, os, json, re 
import pandas as pd
import numpy as np
import datetime
import warnings
warnings.filterwarnings('ignore')
In [188]:
path_to_market_data = '/Users/manuel/Documents/financial-research/market-data/2019-08-03/'

Load Data

In [189]:
# https://www.quandl.com/data/LBMA/GOLD-Gold-Price-London-Fixing
gold_df = pd.read_csv(path_to_market_data + 'LBMA-GOLD.csv')
gold_df['Date'] = pd.to_datetime(gold_df['Date'])

gold_df = gold_df[['Date', 'USD (PM)']]
gold_df.columns = ['Date', 'GLD']
gold_df['GLD'] = pd.to_numeric(gold_df['GLD'], errors='coerce')

print(np.min(gold_df['Date'] ),np.max(gold_df['Date'] ))
gold_df = gold_df.sort_values('Date', ascending=True) 
gold_df = gold_df.dropna(how='any')

gold_df.head()
1968-01-02 00:00:00 2019-07-30 00:00:00
Out[189]:
Date GLD
12982 1968-04-01 37.70
12981 1968-04-02 37.30
12980 1968-04-03 37.60
12979 1968-04-04 36.95
12978 1968-04-05 37.00
In [190]:
# Price chart
fig, ax = plt.subplots(figsize=(16, 8))
plt.plot(gold_df['Date'], gold_df['GLD'], label='GLD', color='gold')
plt.title('Gold ' + str(np.min(gold_df['Date'])) + ' - ' + str(np.max(gold_df['Date'])))
plt.legend(loc='upper left')
plt.grid()
plt.show()
 
In [172]:
def split_seq(seq, num_pieces):
    # https://stackoverflow.com/questions/54915803/automatically-split-data-in-list-and-order-list-elements-and-send-to-function
    start = 0
    for i in range(num_pieces):
        stop = start + len(seq[i::num_pieces])
        yield seq[start:stop]
        start = stop
        
        
def pearson(s1, s2):
    """take two pd.Series objects and return a pearson corrleation"""
    s1_c=s1-np.mean(s1)
    s2_c=s2-np.mean(s2)
    return np.sum(s1_c * s2_c) / np.sqrt(np.sum(s1_c ** 2) * np.sum(s2_c ** 2))

Build time series out of daily data

30 trading day series

In [191]:
# we don't need to do this, just emphasizing 
gold_df = gold_df.sort_values('Date', ascending=True) 

lookback = 30
dates = gold_df['Date']
prices = list(gold_df['GLD'].values)
counter_ = -1
price_series = []
for day in dates:
    counter_ += 1
    # if counter_ % 1000 == 0: print(counter_)
    if counter_ >= lookback:
        price_series.append(prices[counter_-lookback:counter_])
                
timeseries_df = pd.DataFrame(price_series)              
 

Look for rises and build outcome

In [194]:
timeseries_df.shape
Out[194]:
(12867, 30)
In [195]:
timeseries_df.head()
Out[195]:
0 1 2 3 4 5 6 7 8 9 ... 20 21 22 23 24 25 26 27 28 29
0 37.70 37.30 37.60 36.95 37.00 37.05 37.5 37.70 38.00 38.00 ... 39.20 39.45 39.10 39.75 39.30 39.75 39.70 39.60 39.50 39.80
1 37.30 37.60 36.95 37.00 37.05 37.50 37.7 38.00 38.00 37.80 ... 39.45 39.10 39.75 39.30 39.75 39.70 39.60 39.50 39.80 40.25
2 37.60 36.95 37.00 37.05 37.50 37.70 38.0 38.00 37.80 37.55 ... 39.10 39.75 39.30 39.75 39.70 39.60 39.50 39.80 40.25 41.25
3 36.95 37.00 37.05 37.50 37.70 38.00 38.0 37.80 37.55 37.65 ... 39.75 39.30 39.75 39.70 39.60 39.50 39.80 40.25 41.25 41.50
4 37.00 37.05 37.50 37.70 38.00 38.00 37.8 37.55 37.65 38.00 ... 39.30 39.75 39.70 39.60 39.50 39.80 40.25 41.25 41.50 42.30

5 rows × 30 columns

In [196]:
counter = 5
for index, row in timeseries_df.iterrows():
    counter -= 1
    # look for desired shape
    plt.plot(row.values)
    plt.grid()
    plt.show()
    if counter < 0:
        break

Pattern simplifier

Here we break a long list of data into smaller lists set by 'complexity' and then average out each one

In [197]:
counter = 5
complexity = 5
for index, row in timeseries_df.iterrows():
    counter -= 1
    # look for desired shape
    plt.plot([np.mean(r) for r in split_seq(list(row.values), complexity)])
    plt.grid()
    plt.show()
    if counter < 0:
        break
In [205]:
[np.mean(t) for t in split_seq(list(r), complexity)]
Out[205]:
[37.26666666666667,
 37.75833333333333,
 38.208333333333336,
 39.225,
 39.60833333333333]

Create an ideal shape pattern

Play around with the shape, you can select ups, downs, u's or v's - anythin goes

In [207]:
# let's single out the shape we want
correlate_against = [0,0,0,0,1,2] 
plt.plot(correlate_against)
plt.grid()

Using the pearson correlation function to find the best matching shape

In [212]:
complexity = 6
outcome_list = []
for index, row in timeseries_df.iterrows():
    simplified_values = []
    for r in split_seq(list(row.values), complexity):
        simplified_values.append(np.mean(r))
    correz = pearson(simplified_values,correlate_against)
    if correz > 0.5:
        outcome_list.append(1)
    else:
        outcome_list.append(0)
In [213]:
np.mean(outcome_list)
Out[213]:
0.35571617315613585
In [219]:
timeseries_df['outcome'] = outcome_list
timeseries_df.head(20)
Out[219]:
0 1 2 3 4 5 6 7 8 9 ... 21 22 23 24 25 26 27 28 29 outcome
0 37.70 37.30 37.60 36.95 37.00 37.05 37.50 37.70 38.00 38.00 ... 39.45 39.10 39.75 39.30 39.75 39.70 39.60 39.50 39.80 1
1 37.30 37.60 36.95 37.00 37.05 37.50 37.70 38.00 38.00 37.80 ... 39.10 39.75 39.30 39.75 39.70 39.60 39.50 39.80 40.25 1
2 37.60 36.95 37.00 37.05 37.50 37.70 38.00 38.00 37.80 37.55 ... 39.75 39.30 39.75 39.70 39.60 39.50 39.80 40.25 41.25 1
3 36.95 37.00 37.05 37.50 37.70 38.00 38.00 37.80 37.55 37.65 ... 39.30 39.75 39.70 39.60 39.50 39.80 40.25 41.25 41.50 1
4 37.00 37.05 37.50 37.70 38.00 38.00 37.80 37.55 37.65 38.00 ... 39.75 39.70 39.60 39.50 39.80 40.25 41.25 41.50 42.30 1
5 37.05 37.50 37.70 38.00 38.00 37.80 37.55 37.65 38.00 38.40 ... 39.70 39.60 39.50 39.80 40.25 41.25 41.50 42.30 42.40 1
6 37.50 37.70 38.00 38.00 37.80 37.55 37.65 38.00 38.40 38.25 ... 39.60 39.50 39.80 40.25 41.25 41.50 42.30 42.40 41.55 1
7 37.70 38.00 38.00 37.80 37.55 37.65 38.00 38.40 38.25 38.30 ... 39.50 39.80 40.25 41.25 41.50 42.30 42.40 41.55 41.40 1
8 38.00 38.00 37.80 37.55 37.65 38.00 38.40 38.25 38.30 38.65 ... 39.80 40.25 41.25 41.50 42.30 42.40 41.55 41.40 41.75 1
9 38.00 37.80 37.55 37.65 38.00 38.40 38.25 38.30 38.65 38.75 ... 40.25 41.25 41.50 42.30 42.40 41.55 41.40 41.75 41.50 1
10 37.80 37.55 37.65 38.00 38.40 38.25 38.30 38.65 38.75 39.10 ... 41.25 41.50 42.30 42.40 41.55 41.40 41.75 41.50 41.50 1
11 37.55 37.65 38.00 38.40 38.25 38.30 38.65 38.75 39.10 39.20 ... 41.50 42.30 42.40 41.55 41.40 41.75 41.50 41.50 41.60 1
12 37.65 38.00 38.40 38.25 38.30 38.65 38.75 39.10 39.20 39.45 ... 42.30 42.40 41.55 41.40 41.75 41.50 41.50 41.60 41.95 1
13 38.00 38.40 38.25 38.30 38.65 38.75 39.10 39.20 39.45 39.10 ... 42.40 41.55 41.40 41.75 41.50 41.50 41.60 41.95 41.95 1
14 38.40 38.25 38.30 38.65 38.75 39.10 39.20 39.45 39.10 39.75 ... 41.55 41.40 41.75 41.50 41.50 41.60 41.95 41.95 41.15 1
15 38.25 38.30 38.65 38.75 39.10 39.20 39.45 39.10 39.75 39.30 ... 41.40 41.75 41.50 41.50 41.60 41.95 41.95 41.15 41.20 1
16 38.30 38.65 38.75 39.10 39.20 39.45 39.10 39.75 39.30 39.75 ... 41.75 41.50 41.50 41.60 41.95 41.95 41.15 41.20 41.20 1
17 38.65 38.75 39.10 39.20 39.45 39.10 39.75 39.30 39.75 39.70 ... 41.50 41.50 41.60 41.95 41.95 41.15 41.20 41.20 41.25 1
18 38.75 39.10 39.20 39.45 39.10 39.75 39.30 39.75 39.70 39.60 ... 41.50 41.60 41.95 41.95 41.15 41.20 41.20 41.25 41.30 0
19 39.10 39.20 39.45 39.10 39.75 39.30 39.75 39.70 39.60 39.50 ... 41.60 41.95 41.95 41.15 41.20 41.20 41.25 41.30 41.55 0

20 rows × 31 columns

In [226]:
timeseries_df_tmp = timeseries_df[timeseries_df['outcome']==1]
timeseries_df_tmp.tail()
Out[226]:
0 1 2 3 4 5 6 7 8 9 ... 21 22 23 24 25 26 27 28 29 outcome
12862 1332.35 1335.90 1351.25 1341.30 1341.35 1344.05 1379.50 1397.15 1405.70 1431.40 ... 1413.75 1407.60 1412.40 1409.85 1410.35 1417.45 1439.70 1427.75 1425.55 1
12863 1335.90 1351.25 1341.30 1341.35 1344.05 1379.50 1397.15 1405.70 1431.40 1403.95 ... 1407.60 1412.40 1409.85 1410.35 1417.45 1439.70 1427.75 1425.55 1426.95 1
12864 1351.25 1341.30 1341.35 1344.05 1379.50 1397.15 1405.70 1431.40 1403.95 1402.50 ... 1412.40 1409.85 1410.35 1417.45 1439.70 1427.75 1425.55 1426.95 1416.10 1
12865 1341.30 1341.35 1344.05 1379.50 1397.15 1405.70 1431.40 1403.95 1402.50 1409.00 ... 1409.85 1410.35 1417.45 1439.70 1427.75 1425.55 1426.95 1416.10 1420.40 1
12866 1341.35 1344.05 1379.50 1397.15 1405.70 1431.40 1403.95 1402.50 1409.00 1390.10 ... 1410.35 1417.45 1439.70 1427.75 1425.55 1426.95 1416.10 1420.40 1419.05 1

5 rows × 31 columns

In [227]:
timeseries_df_tmp = timeseries_df_tmp.tail()
# pull one example and remove the outcome variable
example = timeseries_df_tmp.values[0][:-1]
plt.plot(example)
Out[227]:
[<matplotlib.lines.Line2D at 0x12bb9a7b8>]
In [229]:
simplified_values = []
for r in split_seq(list(example), complexity):
    simplified_values.append(np.mean(example))
plt.plot(simplified_values)
Out[229]:
[<matplotlib.lines.Line2D at 0x12bc6bac8>]
In [230]:
vals = [np.mean(r) for r in split_seq(list(example), complexity)]
np.min(vals)
vals2 = [val - np.min(vals) for val in vals]
plt.plot(vals2)
Out[230]:
[<matplotlib.lines.Line2D at 0x12bfd1908>]

Show Notes

(pardon typos and formatting -
these are the notes I use to make the videos)

Let's analyze time-series data and assign outcome variables depending on pattern types. If you are looking to model raw time series for classification, this video is for you.