A few tips&tricks for accelerating your data processing

Handling Financial data using Python is relatively easy, compared with other programming languages because,

Python is interpreter language which is easier to check/modify data in runtime
There are specialized python libraries(like pandas, numpy…) to handle financial/time-series data

But it also has disadvantages,

Relatively low performance of Pure Python
Python libraries have some pitfalls make program lot slower
Imperfect support on Threading (GIL)
Weak type annotation and Vulnerability to unknown errors

To handle weaknesses while exploiting advantages, you should consider some modifications to your codes. Below are some example tricks that can speed up and stabilize financial data which actually frequently used by our team.

Avoid pitfalls of libraries

Most python libraries are more concerned with ease of use than performance. This lowers the barrier to entry, making it easier for many users to use. However, by focusing more on the specifics, you can make your program much faster.

Pandas is a good example of this. Pandas is a great library for data science and time-series data processing on python, but there are few things to avoid. A typical example is the processing of datetime objects.

df = pd.read_csv(‘./data/2020-Jan.csv’, index_col=’event_time’,parse_dates=[‘event_time’])

The example above is the most convenient and natural way to read datetime indexed data in pandas, which takes a bunch of times.

# read / parse
CPU times: user 7min 39s, sys: 1.12 s, total: 7min 40s
Wall time: 7min 40s

When you use parse_dates in pandas, pandas infer datetime format of given columns (e.g. [2020/01/01], [Jan 6, 2020], …). Inferring datetime from arbitrary string takes a long time. You can avoid this process by telling read_csv function to how to parse datetime.

head = pd.read_csv("./data/2020-Jan.csv", index_col='event_time', nrows=1)
print(head.index[0])
"""
2020-01-01 00:00:00 UTC
"""

On the first row of data. we can see this data follows iso8601 format except for the last 4 characters “ UTC”. With this in mind, a more efficient parser can be implemented.

import datetime as dt

def my_parser(date_str):
    return ciso8601.parse_datetime(date_str[:-4])

df = pd.read_csv("./data/2020-Jan.csv", index_col='event_time', parse_dates=['event_time'], date_parser=my_parser)

Giving this function as date_parser argument for read_csv function, The time taken to read csv file is significantly reduced, more than a hundred times.

# read / parse
CPU times: user 5.58 s, sys: 880 ms, total: 6.46 s
Wall time: 6.46 s

If datetime format doesn’t follow iso8601, you can create your custom parser using datetime.datetime.strptime.

This type of problem is a bit trivial given that reading data is non-repetitive. Most of the time, a more critical type of performance degradation occurs continuously while program executing, such as “datetime indexing”.

Convenient and natural way to slice data in pandas looks like,

%%time
# from 2020-01-03:00:00:00 to 2020-01-09:00:00:00(open)
sliced = df.loc['2020-01-03': '2020-01-08']

The main problem of this code is that it may take a while to parse string-time into pandas.DatetimeIndex.

# slice
CPU times: user 383 ms, sys: 20 ms, total: 403 ms
Wall time: 402 ms

If you only need slicing sorted index by date, One alternative is using a raw string without parsing it into DatetimeIndex. You can use numpy.searchsorted to get the position of the start and end point, and then slice.

%%time
df = pd.read_csv('./data/2020-Jan.csv')

note that we use ‘2020–01–09’ as an end point.

%%time
idxs = np.searchsorted(df['event_time'], '2020-01-03'), np.searchsorted(df['event_time'], '2020-01-09')
df.iloc[idxs[0]: idxs[1]]

Result is

# read
CPU times: user 3.24 s, sys: 184 ms, total: 3.42 s
Wall time: 3.43 s
# slice
CPU times: user 413 µs, sys: 0 ns, total: 413 µs
Wall time: 397 µs

You don’t have to use pd.DatetimeIndex every time!

This is a lot faster, both on reading and slicing operation than Datetime indexing.

20200723
Slicing by using ‘loc’ on DatetimeIndex shows poor performance especially on ‘first shot’. From second trial, it shows much better performance, but still lower than finding number index using np.searchsorted.

Using numba and Cython to speed up iterative operation(fold/scan/rolling_apply).

One of the most frequently used operations for financial time series is iterative(sequential) one, which applies function while iterating rows from past to present.

EMA(exponential moving average) is a famous iterative operation that smooths(denoise) value of current row, which cannot be done parallel.
(we’ve taken this as an example for easy explanation. pandas already have optimized calculation for ewm as pandas.Dataframe.ewm)

from kirin import Kirin
api = Kirin()
api.reset_cache()
api.fred.earliest_realtime_start = '1970-01-31'
data = pd.concat([api.fred.aaa_corporate(frequency='D', ffill=True), api.fred.baa_corporate(frequency='D', ffill=True)], axis=1).dropna()
print(data)

(Kirin is Qraft’s Research Platform which manages financial data from various sources.)

DAAA   DBAA
date                   
1986-01-06  9.91  11.38
1986-01-07  9.92  11.35
1986-01-08  9.92  11.35
1986-01-09  9.92  11.35
1986-01-10  9.92  11.36
...          ...    ...
2020-07-15  2.13   3.34
2020-07-16  2.15   3.35
2020-07-17  2.14   3.31
2020-07-18  2.14   3.31
2020-07-19  2.14   3.31

[12614 rows x 2 columns]

In this example, we use 12611 rows of AAA and BAA Corporate bond data to calculate each column’s EMA. (We use forward fill and dropna to ignore nan values — which is not commonly used in the real case for accuracy)

The most naive implementation is iterating over pd.Dataframe rows.

%%timeit
data_copy = data.copy()
for i in range(1, len(data_copy)):
    data_copy.iloc[i] = data_copy.iloc[i-1] * 0.3 + data_copy.iloc[i] * 0.7
data_copy.values
-----------------------------------------------------
13.7 s ± 329 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

if you use numpy array instead of dataframe, can save more time.

%%time
arr = data.values.copy()
for i in range(1, len(arr)):
    arr[i] = arr[i-1]  * 0.3 + arr[i] * 0.7
arr
-----------------------------------------------------
50.2 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

np.vectorize is a great tool which can apply a function to different elements(in this case, aaa and baa each)

%%time
arr = data.values.copy()
func = np.frompyfunc(lambda x, y: x*0.3 + y*0.7, 2, 1)
ret = func.accumulate(arr, dtype=np.object).astype(np.float64)
ret
-----------------------------------------------------
5.48 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

another option is to use numba, which compiles function just in time(JIT) for efficiency.

@numba.jit
def ema(a):
    for i in range(1, len(a)):
        a[i] = a[i-1] * 0.3 + a[i] *0.7
    return a
%time ema(data.values.copy())
---------------------------------------
CPU times: user 447 ms, sys: 12 ms, total: 459 ms
Wall time: 457 ms

One thing to watch out for is that numba uses a Just-in-Time compile, so it can take much longer for the first run which includes compiling time. You can consider using numba in case using some function repeatedly.

%%timeit
ema(data.values.copy())
-----------------------------------------------------
1.86 ms ± 157 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Time needed for calculate EMA (ms, log scale)

Below is a result of applying these same functions to a larger case (4264752 rows).

# pandas
CPU times: user 28min 52s, sys: 3.27 s, total: 28min 55s
Wall time: 28min 52s
# numpy
CPU times: user 12.9 s, sys: 3.99 ms, total: 12.9 s
Wall time: 12.9 s
# numpy vectorize
CPU times: user 916 ms, sys: 164 ms, total: 1.08 s
Wall time: 1.08 s
# numba jit
CPU times: user 211 ms, sys: 4.01 ms, total: 215 ms
Wall time: 214 ms
# numba jit after compiled
CPU times: user 2.1 ms, sys: 143 µs, total: 2.24 ms
Wall time: 1.74 ms

Time needed for calculate EMA on larger dataset(ms, log scale)

The point is, you can remarkably reduce time even if function is iterative, which means it cannot be optimized by threading or multiprocessing, by using compiled-function.

Low-level Compile Language for faster & safe implementation

If you need higher speed and error-resistant action for some operation, the low-level compilation is worth considering.

Consider the case using patent data, to evaluate 'Freshness' of the invention.

patent_text_paths = glob.glob('./data/patents/*_text.txt')
print(open(patent_text_paths[10]).read())

Let’s see one of the patent texts.

RELATED APPLICATIONS
This application claims priority to U.S. Provisional Patent Application No. 62/580,476, filed Nov. 2, 2017, and incorporates the same herein by reference.


TECHNICAL FIELD
This invention relates to a firearm cartridge and barrel chamber for it. More particularly, it relates to a cartridge and cartridge load combination for efficient, high-velocity, long-range, precision shooting.
BACKGROUND
Ballistics, the science and study of projectiles and firearms, may be divided into three categories......
...

A scoring rule is pretty simple, give high score if there is a word which reflects ‘freshness’ of patent, and negative score on vice versa.

signals = {
    "useful":1,
    "new":1,
    "important":2,
    'complex':2,
    "impressive":3,
    "powerful":3,
    "special":4,
    "innovation":5,
    "same": -1,
    "trivial": -2,
    "modify":-3,
}
def _search_file(fname):
    ret = 0
    text = open(fname,'r').read()
    for line in text.split('\n'):
        for word in line.split(' '):
            if word in signals:
                ret += signals[word]
    return ret
%%timeit 
scores_python = []
for fname in patent_text_paths:
    scores_python.append(_search_file(fname))

_search_file function is applied to 9221 text files, which accumulated from 2020–04–23 to 2020–04–30 on USPTO.

9.58 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Because Python doesn’t support the full performance of multi-threading, multiprocessing is the only option to parallel applies.

from multiprocessing import Pool
%%timeit -r 30
pool = Pool(32)
scores_mp = pool.map(_search_file, patent_text_paths)
-----------------------------------------------------
930ms ± 163 ms per loop (mean ± std. dev. of 30 runs, 1 loop each)

or you can use another library, like ray

import ray
ray.init()
@ray.remote
def _search_file_remote(fname):
    return _search_file(fname)
%%timeit -r 30
scores_ray = []
scores_ray = ray.get([_search_file_remote.remote(fname) for fname in patent_text_paths])
-----------------------------------------------------
965ms ± 17.6 ms per loop (mean ± std. dev. of 30 runs, 1 loop each)

Rust is brand-new compile language which supports high-performance and tight type checking&annotation for fast and safe function.

use pyo3::prelude::*;
use pyo3::wrap_pyfunction;
use rayon::prelude::*;
use std::fs::read_to_string;
use std::collections::HashMap;


#[pyfunction]
fn search_from_files(fpaths: Vec<&str>, score_information: HashMap<&str, i32>) -> Vec<i32> {
    fpaths.par_iter().map(|&fpath| search_from_file(fpath, score_information.clone())).collect::<Vec<_>>()
}

#[pyfunction]
fn search_from_file(fpath: &str, score_information: HashMap<&str, i32>) -> i32 {
    let str = read_to_string(fpath).unwrap().to_lowercase();
    let contents = str.as_str();
    search(contents, score_information)
}

#[pyfunction]
fn search(contents: &str, score_information: HashMap<&str, i32>) -> i32 {
    contents
        .par_lines()
        .map(|line| score(line, score_information.clone()))
        .sum()
}

fn score(line: &str, score_information: HashMap<&str, i32>) -> i32 {
    let mut total = 0;
    for word in line.split(' ') {
        match score_information.get(&word) {
            Some(score) => { total += *score; }
            None => 
        
    }
    total
}

#[pymodule]
fn word_count(_py: Python<'_>, m: &PyModule) -> PyResult<()> {
    m.add_wrapped(wrap_pyfunction!(search))?;
    m.add_wrapped(wrap_pyfunction!(search_from_file))?;
    m.add_wrapped(wrap_pyfunction!(search_from_files))?;
    Ok(())
}

you can bind the project using pyo3
(refer https://github.com/PyO3/pyo3/tree/master/examples/word-count)

python setup.py develop && yes | cp word_count/word_count.cpython-37m-x86_64-linux-gnu.so ..

then you can import python-wrapped-rust functions(or class)

from word_count import search_from_files
%%timeit -r 30
scores_rust_parallel = search_from_files(patent_text_paths, signals)
-----------------------------------------------------
648 ms ± 13.3 ms per loop (mean ± std. dev. of 30 runs, 1 loop each)

This style of using low-level compile language will be more attractive when the project is being bigger and you should consider error-resistant running(e.g. production).

Optimizing Python code on Financial Data Science

A few tips&tricks for accelerating your data processing

Using numba and Cython to speed up iterative operation(fold/scan/rolling_apply).

Low-level Compile Language for faster & safe implementation

Related Articles