decimal bias

never design a trading system without proper prior analysis. post your market statistics here

Moderator: moderators

User avatar
michal.kreslik
rank: 1000+ posts
rank: 1000+ posts
Posts: 1047
Joined: Sat May 13, 2006 2:40 am
Reputation: 36
Location: Monte Carlo, Monaco
Real name: Michal Kreslik
Gender: Male

decimal bias

Postby michal.kreslik » Thu Aug 03, 2006 12:56 pm

Hello, friends,

when observing the tick price action in Forex, I was wondering whether the prices have any bias towards certain repetitive decimal levels ("00", for example).


General considerations

In plain language, I was asking myself: "are there more price ticks ending with the digit 0 than with the digit 5"? Or "are there more price ticks ending with the digits 50 than with the digits 75?"

Since all the securities' prices are expressed in the decimal numerical system, the basic information we are interested in here is to know what percentage of price ticks in a data set are ending with a particular decimal digit or a set of digits. This research shows there's no doubt the decimal symbol set (digits 0 to 9) repetition in the numerical price representation of quantity psychologically distorts the free price flow and causes the decimal bias.

The decimal bias simply means that the probability that the price will end with a certain decimal digit (or a certain set of decimal digits) changes with the digit (or set of digits). Also, according to the research results it seems that every security features different distribution of its decimal levels' probabilities creating a unique fingerprint of the security.

Now the numerical system symbol set iteration boundary psychological bias, as I call it in a more universal way (don't be scared, I won't use the term more than once :) ), emerges in a real life every time when we are dealing with some representation of quantity in some predetermined numerical system (see Numerical systems in Wikipedia). For the purpose of this statistics, I will be talking solely about the decimal symbol set psychological bias, or a decimal bias for short. Decimal bias is the most common variety, although one can develop other numerical system-based biases (I can imagine a computer programmer having a severe hexadecimal bias when ordering "1A" pizzas instead of decimal "26" ahead of his weekend coding spree, but that's another story :) ).


The decimal system

The most widely used numerical system for ordinary life purposes is the decimal system, the same as the one used for expressing all the securities' prices. Decimal system uses 10 symbols in a certain sequence to express any required quantity representation.

The reason why the decimal system is so widely popular is attributed to the fact that people have 10 fingers, so that they are better equipped to imagine the multiples of 10 than the multiples of some other quantity. By the way, the very English word "digit" meaning a number is derived from the Latin word "digitus" meaning a finger.

Other numerical systems used heavily in computer science today are the binary system (uses 2 symbols), hexadecimal system (16 symbols) and the octal system (8 symbols). We call the decimal symbols "digits" just as we call the alphabet symbols "letters". A "number" is then a set of one or more digits that expresses a quantity, just as a "word" is a set of one or more letters that expresses certain meaning in language.

The above mentioned "iteration boundary psychological bias" occures when you run out of symbols and thus you reach the boundary symbol in the numerical system. Remember, we only have 10 symbols in a decimal system - thence the name "decimal", meaning "ten". The boundary symbols for the decimal system are "0" and "9". I'll explain below how these two boundary symbols play an important role in how do we percieve the quantity subjectively.

Before running the statistics, I thought that the major distrortion of probability distribution would be seen around these boundary symbols. For one digit statistic, the boundaries are 0 and 9, respectively, while for two digits statistic, the boundaries are 00 and 99. The fact is that the results haven't confirmed this hypothesis - the various decimal levels show various distortions, regardless of the proximity to the boundary symbol. Since the best known decimal biases in real life occur around the boundary symbols (see below), I'll use the boundary symbol bias as an example to show how the decimal bias may develop in the first place, although the term decimal bias means "bias towards any decimal digit".

What we humans do internally in our mind when we are using the decimal (or any other numerical system) is that we convert the written symbols (digits) to the actual quantity in our head.

We have all learned how to deal with the decimal system and how to quickly convert between quantity and its decimal representation in a primary school. Being able to count is so important for a modern homo sapiens that we are bombarded with the decimal numerical system from our early childhood so that now we even don't realize this system is artificial and has nothing to do with the quantity itself.

To illustrate, let's use similarly artificial numerical system and let's say I told you to buy CMXLIV shares of something. We are not trained to convert "CMXLIV" to quantity just as fluently as we are with decimal symbols. It would certainly take you several seconds to realize I'm talking about Roman numerals and that the representation in decimal symbols is "944". Only then, upon converting the "CMXLIV" symbol string to "944" symbol string, you would be able to finally convert the "944" symbol string to the actual quantity in your head.

Now a million dollar question: which one of the two symbol strings represent the same quantity better?
  • CMXLIV
  • 944
Did you answer "944"? Then you are so used to the decimal system that you can't see that "CMXLIV" and "944" represent the quantity in the same way.


Limited number of symbols

People strive to find ways to simplify things, so the development of numerical systems was inevitable since the dawn of civilization. Instead of saying "I want this this this this apple", it's more comfortable to say "I want four apples". The trouble with numerical systems is that to design a system that is simple enough to be broadly usable, there must be only a limited number of symbols in the system and you have to come up with some rules to shuffle those symbols around to arrive at the required quantity.

The major trouble with using a limited set of symbols is that you inevitably run out of symbols if there are less symbols than the values you want to represent. You can't have symbols for all individual values - imagine having one million symbols for all numbers from 1 to 1 million. Thus, numerical systems are always using a limited set of symbols and shuffle them around in some particular way to be able to describe all the values.


We are all doomed - the ubiquitous decimal bias

To make my point more clear, before we get down to the decimal bias in price action statistics, I'll show you several examples of how the decimal symbols psychologically distort people's decisions every single day.

The basic trouble is that we are so used to the decimal system that we treat the symbol boundaries (the symbol switch from "9" to "0") as the natural boundaries in the object's quantity almost as if this artificial boundary had some magical tie to the inner quantitative structure of all the objects.

Thus, we are foolishly:
  • celebrating anniversaries that are multiples of decimal symbol boundary, like 30, 50 etc. Why are we not celebrating anniversaries that are multiples of 8? Is eight a worse quantity than ten?
  • drawing borders between salaries that are at a decimal symbol boundary, like "up to $100k" and "over $100k" (why not "$98723"?) as if the symbol boundary symbols of "zero zero" possessed some magical power
  • treating psychologically the quantity (for example price) that is close to the decimal symbol boundary differently. "$99.99" is seemingly much lower than the price "$100" just because there's no symbol added to the left of the leftmost "9" and our decimal-trained brain is telling us "this quantity is order of magnitude lower!" (actually, prices ending with "95" or "99" digits as a psychological weapon were invented in 1922 by my Czech countryman, Tomas Bata, who established the biggest shoe making company in the world to this time)
  • getting quantity discounts for buying 10, 50 or 100 items. You never get quantity discount for buying 11, 47 or 83 items. In fact, this is a Bata prices ($99.99) phenomenon reversed: 100 items is an order of magnitude higher order than a 99 items order in the eye of the vendor
  • .. I'm sure you can come up with many other examples of decimal bias from your own experience
As you may have already noticed, traders are mere human beings :) They, too, are influenced by this nonsensical inclination to decimal boundaries and certain decimal digits in general. Especially the obsession with decimal boundaries works like a self-fulfilling prophecy. If enough people would believe certain decimal boundary (say "000") is a strong support level, than the prices ending with this set of symbols will become strong support levels. Our quest here is not to judge whether this behavior has some rational grounds, but we want to research what influence the decimal system has on price flow and take advantage of the findings.


Mining for the decimal bias data with C#

I wrote a data-mining code in C#.NET that went thru the tick database for these 15 major FOREX symbols in a sample period of about 8 months ending June 2006: AUDJPY, AUDUSD, CHFJPY, EURAUD, EURCHF, EURGBP, EURJPY, EURUSD, GBPCHF, GBPJPY, GBPUSD, NZDUSD, USDCAD, USDCHF, USDJPY.

Thus, the total amount of data used for conduting the research was huge. About 22 million price ticks were examined (uh, you see? I didn't say "about 21,827,645 ticks" - as stated above, the decimal bias is ubiquitous, we are all doomed :) ). This means that the results are statistically significant and are not victim to the non-representative data sample selection.

I performed the statistics for two decimal orders:
  • all prices ending with one decimal digit
  • all prices ending with a combination of two decimal digits
This means, the datamining routine only looked at the last one digit or the last two digits of the price tick:

Example - one decimal digit:
  • 1.2746 = returns value of 6
  • 1.2349 = returns value of 9
  • 1.7523 = returns value of 3
Example - two decimal digits:
  • 1.2438 = returns value of 38
  • 1.7324 = returns value of 24
  • 1.9467 = returns value of 67
For every FOREX symbol, the total number of all the ending decimal digit (or digits) occurences was then determined. For example,
  • for one digit stats: if xx_number of price ticks in the sample were found to be ending with "9", then the value of xx_number has been assigned to the "ending decimal digit: 9" row in the statistics
  • for two digits stats: if yy_number of price ticks in the sample were found to be ending with "26", then the value of yy_number has been assigned to the "ending decimal digits: 26" row in the statistics
All the individual rows in the statistics (10 total for one digit and 100 total for the two digits combination) were then one by one divided by the total number of all price ticks in the sample to arrive at the percentage value.

If I was to perform the stats for some higher order like the last three digits (actually I did), then certain combinations of ending decimal digits would be so tinily represented in the sample that it would distort the statistical validity of the particular test, so I stayed with one and two ending digits statistics only.


How to make a living trading decimal bias

Now what can we conclude from the results for practical trading? Clearly, the distribution of ending decimal digits in examined price tick history data is not even.

Let's have a look at this EURGBP two digits statistic example:



You can see that the "48" decimal level (all price ticks that ended with "48") was occuring only in 0.74% of total cases (price ticks). On the other hand, this "48" level lies in the valley, surrounded by higher values. The decimal level "31" (all price ticks that ended with "31") has been present in 1.36% of cases, which is almost double the value for the "48" level. And the "62" decimal level shows that even 1.57% of all price ticks ended with the digits "62".

Now let's ask: what drives the prices to stay at or leave certain decimal level? Is it a weather? Is it a political climate? Is it a day's time?

No. The answer is simple: the prices are driven by the human traders. And, in turn, one of the forces that drive traders is the decimal bias. The above chart shows that the traders don't feel comfortably trading EURGBP at or around the market prices that end with "48". But they love to trade EURGBP at prices ending "31" and around. Still better, the absolutely most cherished two-digit ending level for traders is "62". They feel comfortable trading EURGBP at "62" and around. Why the "48" level is so uncomfortable for traders? I dont' know really, but I know that it is so and I can take advantage of it easily as a FOREX trader:
  • the hot 48 level in EURGBP means that no one wants to trade at that level
  • if the price goes from our leisure 31 level up to the hot 48, I know that it will either:
    • turn back (no one wants to stay at the hot level!)
    • or very probably go all the way up to the super relaxed 62 level. Then I can take my profits at 62 (I'm not talking age here :) ) since the majority of traders are very probably going to take a nap at that price level, too
  • the same applies to the price action in the opposite direction, too
  • when the prices are at the "uneasy" level - in this case the hot 48, i may place a straddle with stop buy at 52 and a stop sell at 44 with limits of say 8 pips each, knowing that the price will go either one or the other way all the way up/down to the nearest comfort level


Decimal bias statistics - one ending digit

All the statistics charts for one ending digit are set to the same vertical scale to be able to compare the absolute percentage distribution accross the symbols (charts). It seems the statistics for only one ending digit are not much usable for practical trading and are more of an "academic nature" only:
































Decimal bias statistics - two ending digits

All the statistics charts for two ending digits are set to the same vertical scale to be able to compare the absolute percentage distribution accross the symbols (charts):
































I'd love to hear your comments

This was just an example analysis. Many other interesting conclusions can be drawn from the statistics (including those I've overlooked) :). Below I am attaching the zip file with all the statistics in excel.

So tell me, will you join me in taking advantage of other traders' decimal bias? :)

Have a nice day all,
Michal
Attachments
decimal_bias_stats.zip
(20.76 KiB) Downloaded 447 times

Please add www.kreslik.com to your ad blocker white list.
Thank you for your support.

JPT
rank: <50 posts
rank: <50 posts
Posts: 49
Joined: Mon May 22, 2006 3:26 pm
Reputation: 0
Location: Monterey, CA USA
Gender: None specified

Postby JPT » Thu Aug 03, 2006 6:38 pm

Michal:

To assist my learning of C# could you attache the C# code also that you used for the analysis. Very interesting.

Thanks

User avatar
michal.kreslik
rank: 1000+ posts
rank: 1000+ posts
Posts: 1047
Joined: Sat May 13, 2006 2:40 am
Reputation: 36
Location: Monte Carlo, Monaco
Real name: Michal Kreslik
Gender: Male

Postby michal.kreslik » Fri Aug 04, 2006 5:01 am

Jim,

the decimal bias stats application is actually very simple and self-explanatory:



In the attached zip files, you will find:
  • C# project with source code
  • three data sample history-ticks files in the format required by the app
  • application installation in case you want to run the app without Visual Studio


Michal
Attachments
decimal_bias_app_and_data_samples.zip
(515.12 KiB) Downloaded 434 times

User avatar
michal.kreslik
rank: 1000+ posts
rank: 1000+ posts
Posts: 1047
Joined: Sat May 13, 2006 2:40 am
Reputation: 36
Location: Monte Carlo, Monaco
Real name: Michal Kreslik
Gender: Male

Postby michal.kreslik » Fri Aug 04, 2006 2:15 pm

drthsolr@aol.com wrote:Hello Michal,

I am surprised to see you’r using C# instead of Borland’s Delphi compiler which is according to Wouter at Grail 14-21% faster than any other compiler on market much easier to debug and far superior in which every way in comparison to C…I have been using both languages since the beginning of the their development and also prefer Delphi for the above reason whenever I can use it…thanks for the great work!

Have a great day!

T

BTW: Wouter’s Grail was also designed with the use of Delphi compiler in case you didn’t know


Hello, Tony,

C#.NET is far superior to Delphi. C# is a true strictly object-oriented language (Delphi is not) and is quite different from C/C++.

In comparison to C#, Delphi is lacking :
- type safety and custom type conversion
- static classes
- object indexers
- operator overloading

.NET framework is Microsoft’s flagship product and they are treating it as such.

Visual Studio Express .NET 2005 including the MS SQL Express 2005 server is completely free, compared to Delphi.

Michal

JPT
rank: <50 posts
rank: <50 posts
Posts: 49
Joined: Mon May 22, 2006 3:26 pm
Reputation: 0
Location: Monterey, CA USA
Gender: None specified

Postby JPT » Fri Aug 04, 2006 4:18 pm

Michal:

Thank you very much. This is certainly going to accelerate my learning curve on C#. I will be back with questions this weekend I am sure. Have a great weekend.

Regards

Please add www.kreslik.com to your ad blocker white list.
Thank you for your support.

djfort
rank: <50 posts
rank: <50 posts
Posts: 2
Joined: Sat May 27, 2006 9:33 pm
Reputation: 0
Gender: None specified

Postby djfort » Fri Aug 04, 2006 11:29 pm

Hello Michal,

I see that your mind is all made up about the type of compiler you’re going to be using and there is probably nothing to change that, but you have been greatly misinformed about Delphi multicast specifies.

I am running little short on time and didn’t get a chance to find good contrary reference to your posted arguments but the following links are brief overview on the topic in case you’re interested.

Also, have you looked the “NeuroShell DayTrader Professional Platform” before utilizing NeoTicker? in case you did what are your thoughts..

http://www.delphinperu.com/custom/downl ... 0compiler'

http://en.wikipedia.org/wiki/Borland_De ... s_and_cons

Thanks for all of the good work you’re doing here and good luck !

Tony

User avatar
michal.kreslik
rank: 1000+ posts
rank: 1000+ posts
Posts: 1047
Joined: Sat May 13, 2006 2:40 am
Reputation: 36
Location: Monte Carlo, Monaco
Real name: Michal Kreslik
Gender: Male

Postby michal.kreslik » Tue Aug 08, 2006 12:42 pm

djfort wrote:I see that your mind is all made up about the type of compiler you’re going to be using and there is probably nothing to change that


C# is far better than Delphi. There are tons of articles on the theme all around the internet, so I guess it's no good repeating the various links here and starting a flamewar :). I'd like to mention just two indisputable facts here:
  • Anders Hejlsberg, the lead author of Delphi, now works for Microsoft. Guess what he's doing now. Correct, he's the lead author of C# :)
  • Microsoft now internally uses C#.NET for all software development. What else can be said?

djfort wrote:Also, have you looked the “NeuroShell DayTrader Professional Platform” before utilizing NeoTicker?


I've looked at several dozen platforms before going for NeoTicker. For sure, there is no ideal platform, but Neo is the most open platform of today that looks into the future, too.

Michal

djfort
rank: <50 posts
rank: <50 posts
Posts: 2
Joined: Sat May 27, 2006 9:33 pm
Reputation: 0
Gender: None specified

Postby djfort » Tue Aug 08, 2006 5:56 pm

How can I argue with you Michal when you’re on top of it all – :shock:

Came a cross with following text and would be interested to know what you think of it…thanks

Tony

United States Patent Application 20020010663
Kind Code A1
Muller, Ulrich A. January 24, 2002
________________________________________
Filtering of high frequency time series data
Abstract
The present invention is a method and apparatus for filtering high frequency time series data using a variety of techniques implemented on a computer. The techniques are directed to detecting and eliminating data errors such as the decimal error, monotonic series of quotes, long series of repeated quotes, scaling changes, and domain errors. Further, by means of comparison with nearby quotes in the time series, the techniques are also able to evaluate the credibility of the quotes.
________________________________________
Inventors: Muller, Ulrich A.; (Zurich, CH)
Correspondence Name and Address: PENNIE AND EDMONDS
1155 AVENUE OF THE AMERICAS
NEW YORK
NY
100362711
Serial No.: 842440
Series Code: 09
Filed: April 26, 2001
U.S. Current Class: 705/30
U.S. Class at Publication: 705/30
Intern'l Class: G06F 017/60
________________________________________
Claims
________________________________________


What is claimed is:

1. A method of filtering time series data comprising the steps of: testing said data for decimal error, testing said data for scaling error, testing said data for domain error, testing for credibility of said data that passes the tests for decimal error, scaling error and domain error by comparing nearby data in the time series.
________________________________________
Description
________________________________________


CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims priority to U.S. Provisional Application No. 60/200,742, filed May 1, 2000; U.S. Provisional Application No. 60/200,743, filed May 1, 2000; U.S. Provisional Application No. 60/200,744, filed May 1, 2000; and U.S. Provisional Application No. 60/274,174, filed Mar. 8, 2001. The contents of the above applications are incorporated herein in their entirety by reference.

FIELD OF THE INVENTION

[0002] This invention relates to filtering of high frequency time series data. It is particularly applicable to the filtering of time series financial data and, for convenience, will be described in that context.

BACKGROUND OF THE INVENTION

[0003] A time series is a time-ordered sequence of data. A financial time series is a time-ordered sequence of financial data, typically a sequence of quotes for a financial instrument. These data should be homogeneous. Homogeneity means that the quotes of a series are of the same type and the same market; they may differ in the origins of the contributors, but should not differ in important parameters such as the maturity (of interest rates, . . . ) or the moneyness (of options or their implied volatilities). The filter user is responsible for ensuring data homogeneity.

[0004] All quotes have the following general structure:.

[0005] 1. A time stamp. This is the time of the data collection, the arrival time of the real-time quote in the collector's environment. Time stamps are monotonically increasing over the time series. We might also have other time stamps (e.g. the reported time of the original production of the quote). Such a secondary time stamp would however be considered as side information (see item 4 immediately below) rather than a primary time stamp.

[0006] 2. Information on the quote level. There are different types of level information as markets and sources are different in nature and also differently organized. Some level information can be termed "price", some other information such as transaction volume figures cannot. Some non-prices such as implied volatility quotes can be treated as prices with bid and ask quotes. A neutral term such as "level" or "filtered variable" is therefore preferred to "price". In the case of options, the price might be first converted to an implied volatility which is then the filtered variable. Different quoting types require different filtering approaches. This is discussed below.

[0007] 3. Information on the origin of the quote: information provider, name of exchange or bank, city, country, time zone, . . . . In the filtering algorithm, we only need one function to compare two origins. This will be used when judging the independence and credibility of quotes as explained in further sections. A further analysis of bank names or other IDs is not really needed.

[0008] 4. Side information: everything that does not fall into one of the three aforementioned categories, e.g. a second time stamp. This is ignored by filtering.

[0009] The information on the quote levels is organized in different structures depending on the market and the source. Some important cases are listed here:

[0010] single-valued quotes: each quote has only one value describing its level. Example: stock indices.

[0011] bid-ask quotes: each quote has a bid value and an ask value. Example: foreign exchange (FX) spot rates.

[0012] bid or ask quotes: each quote has a bid or an ask value, often in unpredictable sequence. This can be regarded as two different single-valued time series. Example: quotes on some exchanges.

[0013] bid or ask or transaction quotes: each quote has a bid value or an ask value or a transaction value. Again, this can be regarded as three different single-valued time series. Example: the data stream from the major short-term interest rate futures exchanges also includes transaction data.

[0014] middle quotes: in certain cases, we only obtain a time series of middle quotes with are treated as single-valued quotes. The case of getting only transaction quotes (no bid, no ask) is technically identical. Also transaction volume figures are treated as single-valued quotes, for example.

[0015] OHLC quotes: open/high/low/close. An OHLC filter can be made in analogy to the bid-ask filter, with some tests of the whole quote followed by quote splitting as to be explained.

[0016] We recognize a data error as being present if a piece of quoted data does not conform to the real situation of the market. We have to identify a price quote as being a data error if it is neither a correctly reported transaction price nor a possible transaction price at the reported time. In the case of indicative prices, however, we have to tolerate a certain transmission time delay.

[0017] There are many causes for data errors. The errors can be separated in two classes:

[0018] 1. human errors: errors directly caused by human data contributors, for different reasons:

[0019] (a) unintentional errors, e.g. typing errors;

[0020] (b) intentional errors, e.g. dummy quotes produced just for technical testing;

[0021] 2. system errors: errors caused by computer systems, their interactions and their failures.

[0022] Strictly speaking, system errors are also human errors because human operators have the ultimate responsibility for the correct operation of computer systems. However, the distance between the data error and the responsible person is much larger for system errors.

[0023] In many cases, it is impossible to find the exact reason for the data error even if the quote is very aberrant. The task of the filter is to identify such outliers, whatever the reason.

[0024] Sometimes the cause of the error can be guessed from the particular behavior of the bad quotes. This knowledge of the error mechanism can help to improve filtering and, in some cases, correct the bad quotes.

[0025] Examples of some of the errors to be expected are as follows:

[0026] 1. Decimal errors: Failure to change a "big" decimal digit of the quote. Example: a bid price of 1.3498 is followed by a true quote 1.3505, but the published, bad quote is 1.3405. This error is most damaging if the quoting software is using a cache memory somewhere. The wrong decimal digit may stay in the cache and cause a long series of bad quotes. For Reuters page data, this was a dominant error type around 1988! Nowadays, this error type seems to be rare.

[0027] 2. "Test" quotes: Some data contributors sometimes send test quotes to the system, usually at times when the market is not liquid. These test quotes can cause a lot of damage because they may look plausible to the filter, at least initially. Two important examples:

[0028] "Early morning test": A contributor sends a bad quote very early in the morning, in order to test whether the connection to the data distributor (e.g. Reuters) is operational. If the market is inactive overnight, no trader would take this test quote seriously. For the filter, such a quote may be a major challenge. The filter has to be very critical to first quotes after a data gap.

[0029] Monotonic series: Some contributors test the performance and the time delay of their data connection by sending a long series of linearly increasing quotes at inactive times such as overnight or during a weekend. For the filter, this is hard to detect because quote-to-quote changes look plausible. Only the monotonic behavior in the long run can be used to identify the fake nature of this data.

[0030] 3. Repeated quotes: Some contributors let their computers repeat the last quote in more or less regular time intervals. This is harmless if it happens in a moderate way. In some markets with high granularity of quoting (such as Eurofutures), repeated quote values are quite natural. However, there are contributors that repeat old quotes thousands of times with high frequency, thereby obstructing the filtering of the few good quotes produced by other, more reasonable contributors.

[0031] 4. Quote copying: Some contributors employ computers to copy and re-send the quotes of other contributors, just to show a strong presence on the data feed. Thus, they decrease the data quality, but there is no reason for a filter to remove copied quotes that are on a correct level. Some contributors run programs to produce slightly modified copied quotes by adding a small random correction to the quote. Such slightly varying copied quotes are damaging because they obstruct the clear identification of fake monotonic or repeated series made by other contributors.

[0032] 5. Scaling problem: Quoting conventions may differ or be officially redefined in some markets. Some contributors may quote the value of 100 units, others the value of 1 unit. The filter may run into this problem "by surprise" unless a very active filter user anticipates all scale changes in advance and preprocesses the data accordingly.

[0033] Filtering of high-frequency time-series data is a demanding, often underestimated task. It is complicated because of

[0034] the variety of possible errors and their causes;

[0035] the variety of statistical properties of the filtered variables (distribution functions, conditional behavior, non-stationarity and structural breaks);

[0036] the variety of data sources and contributions of different reliability;

[0037] the irregularity of time intervals (sparse/dense data, sometimes long data gaps over time);

[0038] the complexity and variety of the quoted information: transaction prices, indicative prices, FX forward premia (where negative values are allowed), interest rates, prices and other variables from derivative markets, transaction volumes, . . . ; bid/ask quotes vs. single-valued quotes;

[0039] the necessity of real-time filtering: producing instant filter results before seeing any successor quote.

[0040] There are different possible approaches to filtering. Some guidelines determine our approach:

[0041] Plausibility: we do not know the real cause of data errors with rare exceptions (e.g. the decimal error). Therefore we judge the validity or credibility of a quote according to its plausibility, given the statistical properties of the series.

[0042] We need a whole neighborhood of quotes for judging the credibility of a quote: a filtering window. A comparison to only the "last valid" quote of the series is not enough. The filtering window can grow and shrink with data quality and the requirements for arriving at a good filtering decision.

[0043] The statistical properties of the series needed to measure the plausibility of a quote are determined inside the filtering algorithm rather than being hand-configured. The filter is thus adaptive.

[0044] Quotes with complex structures (i.e. bid/ask or open/high/low/close) are split into scalar variables to be filtered separately. These filtered variables may be derived from the raw variables, e.g. the logarithm of a bid price or the bid-ask spread. Quote splitting is motivated by keeping the algorithm modular and overseeable. Some special error types may also be analyzed for full quotes before splitting.

[0045] Numerical methods with convergence problems (such as non-linear minimization) are not used. Such methods would probably lead to problems as the filter is exposed to very different situations. The chosen algorithm produces unambiguous results.

[0046] The filter needs a high execution speed; computing all filtering results from scratch with every new quote would not be efficient. The chosen algorithm is iterative: when a new quote is considered, the filtering information obtained from the previous quotes is re-used; only a minimal number of computations concerning the new quote is added.

[0047] The filter has two modes: real-time and historical. Thanks to the filtering window technique, both modes can be supported by the same filter run. In historical filtering, the final validation of a quote is delayed to a time after having seen some successor quotes.

SUMMARY OF THE INVENTION

[0048] The present invention is a method and apparatus for filtering high frequency time series data using a variety of techniques implemented on a computer. The techniques are directed to detecting and eliminating data errors such as the decimal error, monotonic series of quotes, long series of repeated quotes, scaling changes, and domain errors. Further, by means of comparison with nearby quotes in the time series, the techniques are also able to evaluate the credibility of the quotes.

BRIEF DESCRIPTION OF THE DRAWINGS

[0049] These and other objects, features and advantages of the invention will be more readily apparent from the following detailed description of the invention in which:

[0050] FIG. 1 is a schematic illustration of a conventional computer used in the practice of the invention;

[0051] FIG. 2 is an illustration of the format of the time series data processed by the invention;

[0052] FIG. 3 depicts illustrative time series data;

[0053] FIG. 4 is a UML diagram of an illustrative implementation of the filter of the present invention; and

[0054] FIG. 5 is an illustration of a scalar filtering window used in the practice of the invention.

DETAILED DESCRIPTION OF THE INVENTION

[0055] Illustrative apparatus for practicing the invention is a general purpose computer 10 such as that schematically illustrated in FIG. 1. As shown therein, the computer comprises a processor 12, a memory 14, input-output means 16 and a communications bus 18 interconnecting the processor, the memory and the input-output means. The memory typically includes both high-speed semiconductor memory for temporary storage of programs and data as well as magnetic disk or tape storage media for long-term storage of programs and data. Illustrative computers used in the practice of the invention include both main frame computers and high-speed personal computers such as those supplied by IBM Corporation. Illustrative long-term storage systems include large disk arrays such as those supplied by EMC Corporation. Illustrative input-output means comprise one or more data feeds from communication links such as telecommunication lines and/or the Internet, a display, a printer, and a keyboard.

[0056] Time series data is supplied to input-output means 16 of the computer over one or more communication links such as telecommunication lines or the Internet. The data has the general format illustrated in FIG. 2 comprising a time stamp 32, a level 34, and a source 36.

[0057] A sample of the time series data is illustrated in FIG. 3. As indicated, time stamp 32 includes both date information of 08.01.97 and time information related to events a little after 10 o'clock. The time stamp increases monotonically. The level information 34 is bid-ask price data. The source information 36 is a variety of banks identified by a four character code.

[0058] The received time series data is processed by the filter of the present invention to determine if the data is good or bad. The result of this process can simply be an output stream of only the good data in the same order as the data was received. Alternatively, as shown in FIG.3, the output can be the received time series data along with information 38, that indicates the filter's evaluation of each item of time series data.

[0059] Information 38 can be a number (e.g., on a scale of 0 to 1) that assesses the credibility of an item of data or simply, as shown in FIG. 3, an indication that the item of data is good or bad.

[0060] Before explaining the many details of the filter, we provide a description of the functionality of the filter from the user's perspective as well as its basic internal structure.

[0061] The filter has the capability of performing the following operations:

[0062] It is fed by financial quotes in the ordered sequence of their time stamps.

[0063] It delivers the filtering results of the same quotes in the same ordered sequence; for each quote:

[0064] the credibility of the quote between 0 (totally invalid) and 1 (totally valid), also for individual elements such as bid or ask prices or the bid-ask spread;

[0065] the value(s) of the quote, whose errors can possibly be corrected in some cases where the error mechanism is well known;

[0066] the filtering reason, explaining why the filter has rejected (or corrected) the quote.

[0067] Special filter users may want to use all these filtering results, e.g. for filter testing purposes. Normal users may use only those (possibly corrected) quotes with a credibility exceeding a threshold value (which is often chosen to be 0.5). and ignore all invalid quotes and all side results of the filter such as the filtering reason.

[0068] The timing of the filter operations is non-trivial. In real-time operation, a per-quote result is produced right after the corresponding quote has entered the filter. In historical operation, the user can see a per-quote result only after the filter has seen a few newer quotes and adapted the credibility of older quotes accordingly.

[0069] The filter needs a build-up period as specified by section 6.1. This is natural for an adaptive filter. If the filtering session starts at the first available quote (database start), the build-up means to run the filter for a few weeks from this start, storing a set of statistical variables in preparation for restarting the filter from the first available quote. The filter will then be well adapted because it can use the previously stored statistical variables. If the filtering session starts at some later point in the time series, the natural build-up period is the period immediately preceding the first quote of the session.

[0070] The filtering algorithm can be seen as one whole block that can be used several times in a data flow, also in series. Examples:

[0071] Mixing already filtered data streams from several sources where the mixing result is again filtered. The danger is that the combined filters reject too many quotes, especially in the real-time filtering of fast moves (or price jumps).

[0072] Filtering combined with computational blocks: raw data.fwdarw.filter.fwdarw.computational block.fwdarw.filter.fwdarw.applic- ation. Some computational blocks such as cross rate or yield curve computations require filtered input and produce an output that the user may again want to filter.

[0073] Repeated filtering in series is rather dangerous because it may lead to too many rejections of quotes. If it cannot be avoided, only one of the filters in the chain should be of the standard type. The other filter(s) should be configured to be weak, i.e. they should eliminate not more than the totally aberrant outliers.

[0074] a. Overview of the Filtering Algorithm and its Structure

[0075] The filtering algorithm is structured in a hierarchical scheme of sub-algorithms. Table 1 gives an overview of this structure. The filter is univariate, it treats only one financial instrument at a time. Of course, we can create many filter objects for many instruments, but these filters do not interact with each other.

[0076] However, we can add a higher hierarchy level at the top of Table 1 for multivariate filtering. A multivariate filter could coordinate several univariate filters and enhance the filtering of sparse time series by using information from well-covered instruments. This is discussed in section 5.4.

[0077] Details of the different algorithmic levels are explained in the next sections. The sequence of these sections follows Table 1, from bottom to top.

1TABLE I (The basic structure of the filtering algorithm) hier- archy name of level the level purpose, description 1 univariate The complete filtering of one time series: filter passing incoming quotes to the analysis of the lower hierarchy levels; managing the filter results of the lower hierarchy levels and packaging these results into the right output format of the filter; supporting real-time and historical filtering; supporting one or more filtering hypothesis, each with its own full-quote filtering window. 2 full-quote A sequence of recent full quotes, some of the possibly filtering corrected according to a general filtering hypothesis. window Tasks: quote splitting (the most important task): splitting full quotes (such as bid/ask) into scalar quotes to be filtered individually in their own scalar filtering windows; a basic validity test (e.g. whether prices are in the positive domain); a possible mathematical transformation (e.g.: logarithm); all those filtering steps that require full quotes (not just bid or ask quotes alone) are done here. 3 scalar A sequence of recent scalar quotes whose credibilities filtering are still in the process of being modified. Tasks: window testing new, incoming scalar quotes; comparing a new scalar quote to all older quotes of the window (using a special business time scale and a dependence analysis of quote origins); computing a first (real-time) credibility of the new scalar quote; modifying the credibilities of older quotes based on the information gained from the new quote; dismissing the oldest scalar quote when its credibility is finally settled; updating the statistics with sufficiently credible scalar quotes when they are dismissed from the window.

[0078] In FIG. 4, the structure of the filter is also shown in the form of a UML class diagram. UML diagrams (the standard in object-oriented software development) are explained in (Fowler and Scott, 1997), for example. The same filter technology might also be implemented slightly different from FIG. 4.

[0079] The three hierarchy levels of Table 1 can be found again in FIG. 4: (1) the univariate filter (UnivarFilter), (2) the full-quote filtering window (FullQuoteWindow) and (3) the scalar filtering window (ScalarQuoteWindow). Note that the word "tick" is just a synonym of the term "quote". The filter as explained in this specification is much richer than the class structure shown in FIG. 4. The filter also contains some special elements such as a filters for monotonic fake quotes or scaled quotes. The description of these special filter elements may be found in section 5, following the main filter description. However, everything fits into the main structure given by Table 1 and FIG. 4. We recommend that the reader repeatedly consult this table and this figure in order to gain an overview of the whole algorithm while reading the next sections.

[0080] 1. Basic Filtering Elements and Operations

[0081] The first element to be discussed in a bottom-to-top specification is the scalar filtering window. Its position in the algorithm is shown in FIG. 4 (class ScalarQuoteWindow). Window filtering relies on a number of concepts and operations that are presented even before discussing the management of the window.

[0082] The basic filtering operations see the quotes in the simplified form of scalar quotes consisting of:

[0083] 1. the time stamp,

[0084] 2. one scalar variable value to be filtered (e.g. the logarithm of a bid price), here denoted by x,

[0085] 3. the origin of the quote.

[0086] The basic operations can be divided into two types:

[0087] 1. Filtering of single scalar quotes: considering the credibility of one scalar quote alone. An important part is the level filter where the level of the filtered variable is the criterion.

[0088] 2. Pair filtering: comparing two scalar quotes. The most important part is the change filter that considers the change of the filtered variable from one quote to another one. Filtering depends on the time interval between the two quotes and the time scale on which this is measured. Pair filtering also includes a comparison of quote origins.

[0089] The basic filtering operations and another basic concept of filtering, credibility, are presented in the following sections. Their actual application in the larger algorithm is explained later, starting from section 2.

[0090] a. Credibility and Trust Capital

[0091] Credibility is a central concept of the filtering algorithm. It is expressed by a variable C taking values between 0 and 1, where 1 indicates certain validity and 0 certain invalidity. This number can be interpreted as the probability of a quote being valid according to a certain arbitrary criterion. For two reasons, we avoid the formal introduction of the term "probability". First, the validity of a quote is a fuzzy concept; e.g. slightly deviating quotes of an over-the-counter spot market can perhaps be termed valid even if they are very unlikely to lead to a real transaction. Second, we have no model of probability even if validity could be exactly defined. Credibility can be understood as a "possibility" in the sense of fuzzy logic (Zimmermann, 1985).

[0092] Credibility is not additive: the credibility of a scalar quote gained from two tests is not the sum of the credibilities gained from the individual tests. This follows from the definition of credibility between 0 and 1. The sum of two credibilities of, say, 0.75 would be outside the allowed domain.

[0093] For internal credibility computations, it is useful to define an additive variable, the trust capital T which is unlimited in value. There is no theoretical limit for gathering evidence in favor of accepting or rejecting the validity hypothesis. Full validity corresponds to a trust capital of T-.infin., full invalidity to T-.infin.. We impose a fixed, monotonic relation between the credibility C and the trust capital T of a certain object: 1 C ( T ) = 1 2 + T 2 1 + T 2 ( 4.1 )

[0094] and the inverse relation 2 T ( C ) = C - 1 2 C ( 1 - C ) ( 4.2 )

[0095] There are possible alternatives to this functional relationship. The chosen solution has some advantages in the formulation of the algorithm that will be shown later.

[0096] The additivity of trust capitals and eqs. 4.1 and 4.2 imply the definition of an addition operator for credibilities. Table 2 shows the total credibility resulting from two independent credibility values.

2TABLE 2 The total credibility C.sub.total resulting from two independent credibility values C.sub.1 and C.sub.2. The function C.sub.total = C[T(C.sub.1) + T(C.sub.2)] defines an addition operator for credibilities. Eqs. 4.1 and 4.2 are applied. The values in brackets, (0.5), are in fact indefinite limit values; C.sub.total may converge to any value between 0 and 1. C.sub.1 = C.sub.total 0 0.25 0.5 0.75 1 C2 = 1 (0.5) 1 1 1 1 0.75 0 0.5 0.75 0.878 1 0.5 0 0.25 0.5 0.75 1 0.25 0 0.122 0.25 0.5 1 0 0 0 0 0 (0.5)

[0097] b. Filtering of Single Scalar Quotes: the Level Filter

[0098] In the O & A filter, there is only one analysis of a single quote, the level filter. Comparisons between quotes (done for a pair of quotes, treated in section 1.c immediately below) are often more important in filtering than the analysis of a single quote.

[0099] The level filter computes a first credibility of the value of the filtered variable. This is useful only for those volatile but mean-reverting time series where the levels as such have a certain credibility in the absolute sense--not only the level changes. Moreover, the timing of the mean reversion should be relatively fast. Interest rates or IR futures prices, for example, are mean-reverting only after time intervals of years; they appear to be freely floating within smaller intervals. For those rates and for other prices, level filtering is not suitable.

[0100] The obvious example for fast mean reversion and thus for using a level filter is the bid-ask spread which can be rather volatile from quote to quote but tends to stay within a fixed range of values that varies only very slowly over time. For spreads, an adaptive level filter is at least as important as a pair filter that considers the spread change between two quotes.

[0101] The level filter first puts the filtered variable value x (possibly transformed as described in section 3.c) into the perspective of its own statistical mean and standard deviation. Following the notation of (Zumbach and Muller), the standardized variable {circumflex over (x)} is defined: 3 x ^ = x - x _ MSD [ r , 2 ; x ] = = x - x _ EMA [ r ; ( x - x _ ) 2 ] ( 4.3 )

[0102] where the mean value of x is also a moving average:

{overscore (x)}=EMA[.DELTA..THETA..sub.r; x] (4.4)

[0103] The .THETA.-time scale (Dacorogna et al., 1993) to be used is discussed in section 1.d. The variable .DELTA..THETA. denotes the configurable range of the kernel of the moving averages and should cover the time frame of the mean reversion of the filtered variable; a reasonable value for bid-ask spreads has to be chosen. The iterative computation of moving averages is explained in (Zumbach and Muller). Here and for all the moving averages of the filtering algorithm, a simple exponentially weighted moving average (EMA) is used for efficiency reasons.

[0104] A small .vertline.{circumflex over (x)}.vertline. value deserves high trust; an extreme .vertline.{circumflex over (x)}.vertline. value indicates an outlier with low credibility and negative trust capital. Before arriving at a formula for the trust capital as a function of {circumflex over (x)}, the distribution function of {circumflex over (x)} has to be discussed. A symmetric form of the distribution function is assumed at least in coarse approximation. Positive definite variables such as the bid-ask spread are quite asymmetrically distributed; this is why they must be mathematically transformed. This means that x is already a transformed variable, e.g. the logarithm of the spread as explained in section 3.c.

[0105] The amount of negative trust capital for outliers depends on the tails of the distribution at extreme (positive and negative) {circumflex over (x)} values. A reasonable assumption is that the credibility of outliers is approximately the probability of exceeding the outlier value, given the distribution function. This probability is proportional to {circumflex over (x)}.sup.-.alpha. where .alpha. is called the tail index of the fat-tailed distribution. We know that distribution functions of level-filtered variables such as bid-ask spreads are indeed fat-tailed. Determining the distribution function and .alpha. in a moving sample would be a considerable task, certainly too heavy for filtering software. Therefore, we choose an approximate assumption on .alpha. that was found acceptable across many rates, filtered variable types and financial instruments: .alpha.=4. This value is also used in the analogous pair filtering tool, e.g. for price changes, and discussed in section 1.c.

[0106] For extreme events, the relation between credibility and trust capital, eq. 4.1, can be asymptotically expanded as follows: 4 C = 1 4 T 2 for T - 1 ( 4.5 )

[0107] Terms of order higher than (1/T).sup.2 are neglected here. Defining a credibility proportional to {circumflex over (x)}.sup.-.alpha. is thus identical to defining a trust capital proportional to {circumflex over (x)}.sup..alpha./2. Assuming .alpha.=4, we obtain a trust capital proportional to {circumflex over (x)}.sup.2. For outliers, this trust capital is negative, but for small {circumflex over (x)}, the trust capital is positive up to a maximum value we define to be 1.

[0108] Now, we have the ingredients to come up with a formula that gives the resulting trust capital of the ith quote according to the level filter:

T.sub.i0=1-.xi..sub.i.sup.2 (4.6)

[0109] where the index 0 of T.sub.i0 indicates that this is a result of the level filter only. The variable .xi..sub.i is x in a scaled and standardized form: 5 i = x ^ i 0 ( 4.7 )

[0110] with a constant .xi..sub.0. Eq. 4.6 together with eq. 4.7 is the simplest possible way to obtain the desired maximum and asymptotic behavior. For certain rapidly mean-reverting variables such as hourly or daily trading volumes, this may be enough.

[0111] However, the actual implementation in the filter is for bid-ask spreads which have some special properties. Filter tests have shown that these properties have to be taken into account in order to attain satisfactory spread filter results:

[0112] Quoted bid-ask spreads tend to cluster at "even" values, e.g. 10 basis points, while the real spread may be an odd value oscillating in a range below the quoted value. A series of formal, constant spreads can therefore hide some substantial volatility that is not covered by the statistically determined denominator of eq. 4.3. We need an offset .DELTA.x.sub.min.sup.2 to account for the typical hidden volatility in that denominator. A suitable choice is .DELTA.x.sub.min.sup.2 [constant.sub.1({circumflex over (x)}+constant.sub.2)].sup.2.

[0113] High values of bid-ask spreads are worse in usability and plausibility than low spreads, by nature. Thus the quote deviations from the mean as defined by eq. 4.3 are judged in a biased way. Deviations to the high side ({circumflex over (x)}.sub.i, >0) are penalized by a factor p.sub.high whereas no such penalty is applied against low spreads.

[0114] For some (minor) financial instruments, many quotes are posted with zero spreads, i.e. bid quote=ask quote. This is discussed in section 6.1 (and its subsections). In some cases, zero spreads have to be accepted, but we set a penalty against them as in the case of positive {circumflex over (x)}.sub.i.

[0115] We obtain the following refined definition of .xi..sub.i: 6 i = { x ^ i 0 if x ^ i 0 and no zero - spread case p high x ^ i 0 if x ^ i > 0 or in a zero - spread case ( 4.8 )

[0116] where {circumflex over (x)}.sub.i comes from a modified version of eq. 4.3, 7 x ^ = x - x _ EMA [ r ; ( x - x _ ) 2 ] + x min 2 ( 4.9 )

[0117] The constant .xi..sub.0 determines the size of an {circumflex over (x)} that is just large enough to neither increase nor decrease the credibility.

[0118] Eq. 4.8 is general enough for all mean-reverting filterable variables. The filter of the present invention has a level filter only for bid-ask spreads. If we introduced other mean-reverting variables, a good value for .DELTA.x.sub.min.sup.2 would probably be much smaller or even 0, p.sub.high around one and .xi..sub.0 larger (to tolerate volatility increases in absence of a basic volatility level .DELTA.x.sub.min.sup.2).

[0119] c. Pair Filtering

[0120] The pairwise comparison of scalar quotes is a central basic filtering operation. Simple filtering algorithms indeed consisted of a simple sequence of pair filtering operations: each new quote was judged only in relation to the "last valid" quote. The current filter makes more pairwise comparisons also for quotes that are not neighbors in the series as explained in section 2.

[0121] Pair filtering contains several ingredients, the most important one being the filter of variable changes. The time difference between the two quotes plays a role, so the time scale on which it is measured has to be specified. The criterion is adaptive to the statistically expected volatility estimate and therefore uses some results from a moving statistical analysis.

[0122] Another basic pair filtering operation is the comparison of the origins of the two quotes. Some data sources provide rich information about contributors, some other sources hide this information or have few contributors (or just one). The comparison of quote origins has to be seen in the perspective of the observed diversity of quote origins. Measuring this diversity (which may change over time) adds another aspect of adaptivity to the filter.

[0123] i. The Change Filter

[0124] The change filter is a very important filtering element. Its task is to judge the credibility of a variable change according to experience, which implies the use of on-line statistics and thus adaptivity. The change of the filtered variable from the jth to the ith quote is

.DELTA.x.sub.ij=x.sub.i-x.sub.j (4.10)

[0125] The variable x may be the result of a transformation in the sense of section 3.c. The time difference of the quotes is .DELTA..THETA..sub.ij, measured on a time scale to be discussed in section 1.d.

[0126] The variable change .DELTA.x is put into a relation to a volatility measure: the expected variance V(.DELTA..THETA.) about zero. V is determined by the on-line statistics as described in section 1.c. The relative change is defined as follows: 8 ij = x ij 0 V ( ij ) ( 4.11 )

[0127] with a positive constant .xi..sub.0 which has a value of 5.5 in the present filter and is further discussed below. Low .vertline..xi..vertline. values deserve high trust, extreme .vertline..xi..vertline. values indicate low credibility and negative trust capital: at least one of the two compared quotes must be an outlier.

[0128] The further algorithm is similar to that of the level filter as described in section 4.2, using the relative change .xi..sub.ij instead of the scaled standardized variable .xi..sub.i.

[0129] The amount of negative trust capital for outliers depends on the distribution function of changes .DELTA.x, especially the tail of the distribution at extreme .DELTA.x or .xi. values. A reasonable assumption is that the credibility of outliers is approximately the probability of exceeding the outlier value, given the distribution function. This probability is proportional to .xi..sup.-.alpha. where .alpha. is the tail index of a fat-tailed distribution. We know that distribution functions of high-frequency price changes are indeed fat-tailed. Determining the distribution function and .alpha. in a moving sample would be a considerable task beyond the scope of filtering software. Therefore, we make a rough assumption on .alpha. that is good enough across many rates, filtered variable types and financial instruments. For many price changes, a good value is around .alpha..apprxeq.3.5, according to (Muller et al., 1998). As in section 4.2, we generally use .alpha.=4 as a realistic, general approximation.

[0130] As in section 4.2 and with the help of eq. 4.5, we argue that the trust capital should asymptotically be proportional to .xi..sup.2 and arrive at a formula that gives the trust capital as a function of .xi.:

U.sub.ij=U(.xi..sub.ij.sup.2)=1-.xi..sub.ij.sup.2 (4.12)

[0131] which is analogous to eq. 4.6. This trust capital depending only on .xi. is called U to distinguish it from the final trust capital T that is based on more criteria. At .xi.=1, eq. 4.12 yields a zero trust capital, neither increasing nor decreasing the credibility. Intuitively, a variable change of few standard deviations might correspond to this undecided situation; smaller variable changes lead to positive trust capital, larger ones to negative trust capital. In fact, the parameter .xi..sub.0 of eq. 4.11 should be configured to a high value, leading to a rather tolerant behavior even if the volatility V is slightly underestimated.

[0132] The trust capital U.sub.ij from eq. 4.12 is a sufficient concept under the best circumstances: independent quotes separated by a small time interval. In the general case, a modified formula is needed to solve the following three special pair filtering problems.

[0133] 1. Filtering should stay a local concept on the time axis. However, a quote has few close neighbors and many more distant neighbors. When the additive trust capital of a quote is determined by pairwise comparisons to other quotes as explained in section 3.b, the results from distant quotes must not dominate those from the close neighbors; the interaction range should be limited. This is achieved by defining the trust capital proportional to (.DELTA..THETA.).sup.-3 (assuming a constant .xi.) for asymptotically large quote intervals .DELTA..THETA..

[0134] 2. For large .DELTA..THETA., even moderately aberrant quotes would be too easily accepted by eq. 4.12. Therefore, the aforementioned decline of trust capital with growing .DELTA..THETA. is particularly important in the case of positive trust capital. Negative trust capital, on the other hand, should stay strongly negative even if .DELTA..THETA. is rather large. The new filter needs a selective decline of trust capital with increasing .DELTA..THETA.: fast for small .xi. (positive trust capital), slow for large .xi. (negative trust capital). This treatment is essential for data holes or gaps, where there are no (or few) close neighbor quotes.

[0135] 3. Dependent quotes: if two quotes originate from the same source, their comparison can hardly increase the credibility (but it can reinforce negative trust in the case of a large .xi.). In section 1.c, we introduce an independence variable I.sub.ij, between 0 (totally dependent) and 1 (totally independent).

[0136] The two last points imply a certain asymmetry in the trust capital: gathering evidence in favor of accepting a quote is more delicate than evidence in favor of rejecting it.

[0137] All these concerns can be taken into account in an extended version of eq. 4.12. This is the final formula for the trust capital from a change filter: 9 T ij = T ( ij 2 , ij , I ij ) = I ij * 1 - ij 4 1 + ij 2 + ( d ij v ) 3 ( 4.13 )

[0138] where 10 I ij * = { I ij if ij 2 < 1 1 if ij 2 1 ( 4.14 )

[0139] The independence I.sub.ij is always between 0 and 1 and is computed by eq. 4.23. The variable d is a quote density explained in section 1.c. The configurable constant v determines a sort of filtering interaction range in units of the typical quote interval (.apprxeq.1/d).

[0140] Table 3 shows the behavior of the trust capital according to eq. 4.13. The trust capital converges to zero with an increasing quote interval .DELTA..THETA. much more rapidly for small variable changes .vertline..xi..vertline. than for large ones. For small .DELTA..THETA..sub.ij, and I.sub.ij=1, eq. 4.13 converges to eq. 4.12.

3TABLE 3 The trust capital T resulting from a comparison of two independent (I* = 1) scalar quotes, depending on the relative variable change .xi. and the time interval .DELTA..upsilon. between the quotes. .xi. is defined by eq. 4.11, and d and v are explained in the text. d.DELTA..upsilon./.nu. = T 0 0.5 1 2 4 .vertline..xi..vertline. = 4 -15.0 -14.9 -14.2 -10.2 -3.2 2 -3.0 -2.9 -2.5 -1.2 -0.22 1 0 0 0 0 0 0.5 0.75 0.68 0.42 0.10 0.014 0 1 0.89 0.50 0.11 0.015

[0141] ii. Computing the Expected Volatility

[0142] The expected volatility is a function of the size of the time interval between the quotes and thus requires a larger computation effort than other statistical variables. Only credible scalar quotes should be used in the computation. The updates of all statistics are therefore managed by another part of the algorithm that knows about final credibilities as explained in section 2.d.ii.

[0143] Choosing an appropriate time scale for measuring the time intervals between quotes is also important. A scale like .THETA.-time (Dacorogna et al., 1993) is good because it leads to reasonable volatility estimates without seasonal disturbances. This is further discussed in section 1.d.

[0144] The expected volatility computation can be implemented with more or less sophistication. Here, a rather simple solution is taken. The first required statistical variable is the quote density: 11 d = EMA [ r ; c d ] ( 4.15 )

[0145] This is a moving average in the notation of (Zumbach and Muller); .delta..THETA. is the time interval between two "valid" (as defined on a higher level) neighbor quotes on the chosen time scale, which is .THETA.-time, as in all these computations. .DELTA..THETA..sub.r is the configurable range of the kernel of the moving average. The variable c.sub.d is the weight of the quote which has a value of c.sub.d=1 or lower in case of repeated quote values. The iterative computation of moving averages is explained in (Zumbach and Muller). The value 1/.delta..THETA. has to be assumed for the whole quote interval which implies using the "next point" interpolation as explained by the same documents. It can be shown that a zero value of .delta..THETA. does not lead to a singularity of the EMA (but must be handled correctly in a software program).

[0146] An annualized squared "micro"-volatility is defined as a variance, again in form of a moving average: 12 v = EMA [ r ; ( x ) 2 + 0 ] ( 4.16 )

[0147] where the notation (also the .delta. operator) is again defined by (Zumbach and Muller) and the range .DELTA..THETA..sub.r is the same as in eq. 4.15. .delta.x is the change of the filtered variable between (sufficiently credible) neighbor quotes. There is a small time interval offset 13 0 = max ( d 0 d , min ) (4.17)

[0148] The small positive term .delta..THETA..sub.0 accounts for some known short-term behaviors of markets: (1) certain asynchronicities in the quote transmissions, (2) some temporary market level inconsistencies that need time to be arbitraged out, (3) a negative autocorrelation of many market prices over short time lags (Guillaume et al., 1997). However, .delta..THETA..sub.0 is not needed to avoid singularities of .nu.; even a zero value of both .delta..THETA. and .delta..THETA..sub.0 would not lead to a singularity of the EMA. The "next point" interpolation is again appropriate in the EMA computation.

[0149] Strictly speaking, .nu. can be called annualized only if .THETA. is measured in years, but the choice of this unit does not matter in our algorithm. The exponent of the annualization (here: assuming a Gaussian drift) is not too important because the different values of .delta..THETA. share the same order of magnitude.

[0150] Experience shows that the volatility measure of the filter should not only rely on one variance .nu. as defined above. Preferably, we use three such volatilities: .nu..sub.fast, .nu. and .nu..sub.slow. All of them are computed by eq. 4.16, but they differ in their ranges .DELTA..THETA..sub.r: .nu..sub.fast has a short range, .nu. a medium-sized range and .nu..sub.slow a long range. The expected volatility is assumed to be the maximum of the three:

.nu..sub.exp=max(.nu..sub.fast, .nu., .nu..sub.slow) (4.18)

[0151] This is superior to taking only .nu.. In case of a market shock, the rapid growth of .nu..sub.fast allows for a quick adaptation of the filter, whereas the inertia of .nu..sub.slow prevents the filter from forgetting volatile events too rapidly in a quiet market phase.

[0152] From the annualized .nu..sub.exp, we obtain the expected squared change as a function of the time interval .DELTA..THETA. between two quotes. At this point, the filter has a special element to prevent the filter from easily accepting price changes over large data gaps, time periods with no quotes. Data gaps are characterized by a large value of .DELTA..THETA. but very few quotes within this interval. In case of data gaps, an upper limit of .DELTA..THETA. is enforced: 14 corr = min [ 2.5 Q d , max ( 0.1 Q d , ) ] ( 4.19 )

[0153] where d is taken from eq. 4.15 and Q is the number of valid quotes in the interval between the two quotes; this is explained in section 2. b. Eq. 4.19 also sets a lower limit of .DELTA..THETA..sub.corr in case of a very high frequency of valid quotes; this is important to validate fast trends with many quotes.

[0154] The corrected quote interval .DELTA..THETA..sub.corr, is now used to compute the expected squared change V:

V=V(.DELTA..THETA..sub.corr)=(.DELTA..THETA..sub.corr+.DELTA..THETA..sub.0- ).nu..sub.exp+V.sub.0 (4.20)

[0155] This function V(.DELTA..THETA..sub.corr) is needed in the trust capital calculation of section 1.c.i and inserted in eq. 4.11. The positive offset V.sub.0 is small and could be omitted in many cases with no loss of filter quality. However, a small V.sub.0>0 is useful. Some quotes are quoted in coarse granularity, i.e. the minimum step between two possible quote values is rather large as compared to the volatility.

[0156] This is the case in some interest rate futures and also for bid-ask spreads (in FX markets) which often have a rounded size of 5, 10, or 20 basis points with rarely a value in between. Quotes with coarse granularity have a hidden volatility: a series of identical quotes may hide a movement of a size smaller than the typical granule. The term V.sub.0 thus represents the hidden volatility:

V.sub.0=0.25g.sup.2+.epsilon..sub.0.sup.2 (4.21)

[0157] where the granule size g is determined by eq. 8.15. The whole granularity analysis is explained in section 5.b where it plays a more central role. There is yet another term .epsilon..sub.0.sup.2 which is extremely small in normal cases. This .epsilon..sub.0.sup.2 is not related to economics; it has the purely numerical task to keep V.sub.0>0.

[0158] The term .epsilon..sub.0.sup.2 however plays a special role if the scalar variable to be filtered is a (mathematically transformed) bid-ask spread. The spread filter is the least important filter, but leads to the highest number of rejections of FX quotes if it is configured similar to the filter of other scalars. This fact is not accepted by typical filter users: they want a more tolerant spread filter. A closer look shows that different contributors of bid-ask quotes often have different spread quoting policies. They are often interested only in the bid or ask side of the quote and tend to push the other side off the real market by choosing too large a spread. This results in the so-called bid-ask bouncing and in spreads of different sizes between neighbor quotes even in quiet markets. In some minor FX markets, some contributors even mix retail quotes with very large spreads into the stream of interbank quotes. In order not to reject too many quotes for spread reasons, we have to raise the tolerance for fast spread changes and reject only extreme jumps in spreads. This means raising .epsilon..sub.0.sup.2. The filter has .epsilon..sub.0=constant.sub.1 ({overscore (x)}+constant.sub.2), where {overscore (x)} is defined by eq. 4.4. This choice of .epsilon..sub.0 can be understood and appreciated if the mapping of the bid-ask spread, eq. 6.2, is taken into account.

[0159] In a filter run starting from scratch, we set V.sub.0=.epsilon..sub.0.sup.2 and replace this by eq. 4.21 as soon as the granule estimate g is available, based on real statistics from valid quotes (as explained in section 5.b).

[0160] iii. Comparing Quote Origins

[0161] Pair filtering results can add some credibility to the two quotes only if these are independent. Two identical quotes from the same contributor do not add a lot of confidence to the quoted level--the fact that an automated quoting system sends the same quote twice does not make this quote more reliable. Two non-identical quotes from the same contributor may imply that the second quote has been produced to correct a bad first one; another interpretation might be that an automated quoting system has a random generator to send a sequence of slightly varying quotes to mark presence on the information system. (This is why a third quote from one contributor in a rapid sequence should be given much less confidence than a second one, but this subtle rule has not yet been implemented). Different quotes from entirely different contributors are the most reliable case for pair filtering.

[0162] The basic tool is a function to compare the origins of the two quotes, considering the main source (the information provider), the contributor ID (bank name) and the location information. This implies that available information on contributors has a value in filtering and should be collected rather than ignored. An "unknown" origin is treated just like another origin name. The resulting independence measure I'.sub.ij is confined between 0 for identical origins and 1 for clearly different origins. In some cases (e.g. same bank but different subsidiary), a broken value between 0 and 1 can be chosen.

[0163] I'.sub.ij is not yet the final result; it has to be put into relation with the general origin diversity of the time series. An analysis of data from only one or very few origins must be different from that of data with a rich variety of origins. The general diversity D can be defined as a moving average of the I'.sub.i i-l of valid neighbor quotes:

D=EMA[tick-time, r; I'.sub.i i-l] (4.22)

[0164] where the range r (center of gravity of the kernel) is configured to about 9.5. The "tick-time" is a time scale that is incremented by 1 at each new quote used; the notation of (Zumbach and Muller) is applied. The "next point" interpolation is again appropriate in the EMA computation. Only "valid" quotes are used; this is possible on a higher level of the algorithm, see section 3.d.ii. By doing so, we prevent D from being lowered by bad mass quotes from a single computerized source overnight or weekend. Thus we are protected against a tough filtering problem: the high number of bad mass quotes from a single contributor will not force the filter to accept the bad level.

[0165] The use of D makes the independence variable I.sub.ij, adaptive through the following formula:

I.sub.ij=I'.sub.ij+f(D) (1-I'.sub.ij) (4.23)

[0166] with 15 f ( D ) = 0.0005 + ( 1 - D ) 8 2.001 ( 4.24 )

[0167] If the diversity is very low (e.g., in a single-contributor source), this formula (reluctantly) raises the independence estimate I.sub.ij, in order to allow for some positive trust capital to build up. For a strictly uniform source (I'=D=0), I.sub.ij will reach 0.5, that is one half of the I.sub.ij value of truly independent quotes in a multicontributor series.

[0168] The output variable I.sub.ij resulting from eq. 4.14 is always confined between 0 and 1 and is generally used in eq. 4.14. Some special cases need a special discussion:

[0169] Repeated quotes. Rarely, the raw data contains long series of repeated quotes from the same contributor, and the obtained value of I.sub.ij, may still be too high. Solution: in the present filter, this case is handled by a special filtering element described in section 3.b.

[0170] High-quality data. In Olsen & Associates' database, we have completed and merged our collected data with some old, historical, commercially available daily data that was of distinctly higher quality than the data from a single, average-quality contributor. When comparing two quotes from this historical daily data, we forced I.sub.ij=1 although these quotes came from the same "contributor." This special filtering element is necessary only if there are huge, proven quality differences between contributors; otherwise we can omit this.

[0171] Only in multivariate filtering (which is not included in the present filter, see section 5.d): Artificial quotes that might be injected by a multivariate covariance analysis should have I'.sub.ij=1 when compared to each other or to any other quote.

[0172] d. A Time Scale for Filtering

[0173] Time plays a role in the adaptive elements of the level filter as well as in almost all parts of the change filter. Variable changes are tolerated more easily when separated by a large time interval between the time stamps. When using the term "time interval," we need to specify the time scale to be used.

[0174] The algorithm works with any time scale, but some are more suitable than others. If we tolerate quote level changes of the same size over weekend hours than over working hours, we have to accept almost any bad quote from the few weekend contributors. These weekend quotes are sometimes test quotes or other outliers in the absence of a liquid market. The same danger arises during holidays (but holidays may be confined to individual countries).

[0175] The choice of the time scale is important. Accounting for the low weekend activity is vital, but the exact treatment of typical volatility patterns during working days is less important. Therefore, we cannot accept using only physical time (=calendar/clock time), but the following solutions are possible:

[0176] 1. A very simple business time with two states: active (working days) and inactive (weekend from Friday 21:00:00 GMT to Sunday 21:00:00 GMT, plus the most important and general holidays); the speed of this business time as compared to physical time would be either 1.4 (in active state) or 0.01 (in inactive state);

[0177] 2. An adaptively weighted mean of three simple, generic business time scales .THETA.: smoothly varying weights according to built-in statistics. This is the solution recommended for a new filter development independent of complex .THETA. technology.

[0178] 3. An adaptively weighted mean of three generic business time scales .THETA. as defined by (Dacorogna et al., 1993); This is the solution of the filter running at Olsen & Associates, which requires a rather complicated implementation of .THETA.-time.

[0179] The second solution differs from the third one only in the definition of the basic .THETA.-time scales. Their use and the adaptivity mechanism are the same for both solutions.

[0180] Three generic .THETA.-times are used, based on typical volatility patterns of three main markets: Asia, Europe and America, In the second solution, these theta times are simply defined as follows: 16 k t = { 3.4 if t start , k t d < t end , k on a working day 0.01 otherwise ( inactive times , weekends , holidays ) ( 4.25 )

[0181] where t.sub.d is the daytime in Greenwich Mean Time (GMT) and the generic start and end times of the working-daily activity periods are given by Table 4; they correspond to typical observations in several markets. The active periods of exchange-traded instruments are subsets of the active periods of Table 4. The time scales .THETA..sub.k are time integrals of d.THETA..sub.k/dt from eq. 4.25. Thus the time .THETA..sub.k flows either rapidly in active market times or very slowly in inactive times; its long-term average speed is similar to physical time. The implementation of eq. 4.25 requires some knowledge on holidays. The database of holidays to be applied may be rudimentary (e.g., just Christmas) or more elaborate to cover all main holidays of the financial centers on the three continents. The effect of daylight saving time is neglected here as the market activity model is anyway coarse.

4 TABLE 4 market k t.sub.start,k t.sub.end,k (East) Asia 1 21:00 7:00 Europe 2 6:00 16:00 America 3 11:00 21:00 Daytimes limiting the active periods of three generic, continent-wide markets; in Greenwich Mean Time (GMT). The scheme is coarse, modeling just the main structure of worldwide financial markets. The active periods differ according to local time zones and business hours; the Asian market already starts on the day before (from the viewpoint of the GMT time zone).

[0182] In Olsen & Associates' filter, the three .THETA..sub.k-times are defined according to (Dacorogna et al., 1993); effects like daylight saving time and local holidays (i.e., characteristic for one continent) are covered. The activity in the morning of the geographical markets is higher than that in the afternoon--a typical behavior of FX rates and, even more so, interest rates, interest rate futures and other exchange-traded markets. In both solutions, the sharply defined opening hours of exchange-traded markets cannot fully be modeled, but the approximation turns out to be satisfactory in filter tests. An improved version of .THETA.-time is being tested where sudden changes of market activity over daytime can be modeled.

[0183] Once the three scales .THETA..sub.k are defined (by the integrals of eq. 4.25 in our suggestion), their adaptively weighted mean is constructed and used as the time scale .THETA. for filtering. This .THETA.-time is able to approximately capture the daily and weekly seasonality and the holiday lows of volatility. Absolute precision is not required as .THETA. is only one among many ingredients of the filtering algorithm, many of which are based on rather coarse approximations. This is the definition of .THETA.-time: 17 = all k w k k with ( 4.26 ) all k w k = 1 ( 4.27 )

[0184] where "all k" means "all markets." This is 3 in our case, but the algorithm also works for any other number of generic markets. The weights w.sub.k are adaptive to the actual behavior of the volatility. A high w.sub.k reflects a high fitness of .THETA..sub.k, which implies that the volatility measured in .THETA..sub.k has low seasonal variations.

[0185] The determination of the w.sub.k might be done with complicated methods such as maximum likelihood fitting of a volatility model. However, this would be inappropriate, given the convergence problems of fitting and the anyway existing modeling limitations of eq. 4.26. The heuristic method of the Olsen & Associates' filter always returns an unambiguous solution. The volatility of changes of the filtered variable is measured on all .THETA..sub.k-scales in terms of a variance similar to eq. 4.16: 18 k = EMA [ smooth ; ( x ) 2 k + 0 ] ( 4.28 )

[0186] where .delta..THETA..sub.k is the interval between validated neighbor quotes in .THETA..sub.k-time, .delta.x is the corresponding change of the filtered variable, .delta..THETA..sub.0 is defined by eq. 4.17 and the time scale of the EMA is .THETA..sub.k-time. The notation of (Zumbach and Muller) is again used. Smoothing with a short range .DELTA..THETA..sub.smooth is necessary to diminish the influence of quote-to-quote noise. The EMA computation assumes a constant value of (.delta.x).sup.2/(.delta..THETA..sub.k+.delta..THETA..sub.0) for the whole quote interval, this means the "next point" interpolation (Zumbach and Muller).

[0187] The fluctuations of the variable .sigma..sub.k indicate the badness of the .THETA..sub.k model. In the case of a bad fit, .sigma..sub.k is often very low (when the .THETA..sub.k-scale expands time) and sometimes very high (when the .THETA..sub.k-scale compresses time). The fluctuations are quantified in terms of the variance F.sub.k: 19 F k = EMA [ r : ( k - EMA [ r ; k ] ) 2 ] = = MVar [ r , 2 ; k ] ( 4.29 )

[0188] in the notation of (Zumbach and Muller), where the time scale is again .THETA..sub.k-time. The range .DELTA..THETA..sub.r has to be suitably chosen. In our heuristic approximation, the fluctuations directly define the weight of the k'th market: 20 w k = 1 F k all k ' 1 F k ' ( 4.30 )

[0189] which satisfies eq. 4.27 and can be inserted in eq. 4.26.

[0190] 2. The Scalar Filtering Window

[0191] The filter is using a whole neighborhood of quotes for judging the credibilities of scalar quotes: the scalar filtering window. This window covers the set of all quotes of a time series that are contained in a time interval, In the course of the analysis, new quotes are integrated and old quotes are dismissed at the back end of the window following a certain rule. Thus the window is moving forward in time. This mechanism is illustrated by FIG. 5.

[0192] All the scalar quotes within the window have a provisional credibility value which is modified with new incoming quotes. When the quotes leave

User avatar
michal.kreslik
rank: 1000+ posts
rank: 1000+ posts
Posts: 1047
Joined: Sat May 13, 2006 2:40 am
Reputation: 36
Location: Monte Carlo, Monaco
Real name: Michal Kreslik
Gender: Male

Postby michal.kreslik » Tue Aug 08, 2006 6:52 pm

djfort wrote:Came a cross with following text and would be interested to know what you think of it…thanks

Tony


djfort wrote:[0026] 1. Decimal errors: Failure to change a "big" decimal digit of the quote. Example: a bid price of 1.3498 is followed by a true quote 1.3505, but the published, bad quote is 1.3405. This error is most damaging if the quoting software is using a cache memory somewhere. The wrong decimal digit may stay in the cache and cause a long series of bad quotes. For Reuters page data, this was a dominant error type around 1988! Nowadays, this error type seems to be rare.


The above patent application deals with something different. It's a method for filtering out the so called "bad ticks". One of the errors that the method is claiming to correct is the "decimal error" caused by a higher order decimal symbol not being flushed from the cache in time.

Inspirational reading, thanks. Should I file a patent application for some of my various observations, too? :))

Have a great day,
Michal

User avatar
TheRumpledOne
rank: 10000+ posts
rank: 10000+ posts
Posts: 15544
Joined: Sun May 14, 2006 9:31 pm
Reputation: 3035
Location: Oregon
Real name: Avery T. Horton, Jr.
Gender: None specified
Contact:

Postby TheRumpledOne » Thu Aug 10, 2006 2:27 am

I guess I should file a patent or 2 myself!!
IT'S NOT WHAT YOU TRADE, IT'S HOW YOU TRADE IT!

Please do NOT PM me with trading or coding questions, post them in a thread.

Please add www.kreslik.com to your ad blocker white list.
Thank you for your support.


Return to “statistics”