historical tick data SQL database

michal.kreslik · Postby **michal.kreslik** » Tue May 01, 2007 7:58 pm

It's MS SQL Server 2005 ver. 9.1 Developer edition, which is the same in functionality as the Enterprise edition.

Michal

ezTrader · Postby **ezTrader** » Tue May 01, 2007 9:59 pm

I haven't done much db work for a good while, but I found this on MSDN. There's a quote below about reducing the index size. Knowing you,

this is not new information, but in the unlikely chance it is helpful...

Index Build strategy in SQL Server - Part 1: offline, serial, no partitioning
Builder (write data to the in-build index)

|

Sort (order by index key)

|

Scan (read data from source)

In order to build the b-tree for the index we have to first sort the data from source. The flow is to scan the source, sort it (if possible - in memory*), then build the b-tree from the sort.

Why do we need to sort first before building the b-tree? In theory we don?t have to sort, we could use regular DML and directly insert data into the in-build index (no sort), but in this case we would be doing random inserts, random inserts in a b-tree require searching the b-tree for the correct leaf node first and then inserting the data. And while searching a b-tree is fairly fast, doing so before each insert is far from optimal. So for index build operation, we sort the data using the sort order for the new index, so when we insert data into the in-build index, it is not a random insert, it is actually an append operation and this is why the operation can be much faster than random inserts.

While inserting data between sort and index builder we free each extent from the sort table as soon as all of its rows are copied.

In this way we reduce the overall disk space consumption from a possible 3*Index Size (source + sort table + b-tree) to just 2.2*Index Size (approximately).

*We do not guarantee in memory sort; the decision of whether we can do in memory sort or not depends on memory available and actual row count. ?In-memory? sort is, naturally, fast (also disk space requirements will be more relaxed in this case, because we don?t have to allocate space for the sort table on the disk), but it is not required; we can always spill data to disk, although the performance is much slower than in-memory sort.

For an index build operation, we use the user database (the database where we build the index) by default for sort to spill data, but if the user specifies sort_in_tempdb, then we use tempdb for spill.

Each sort table (even when we have very little data to sort) requires at least 40 pages (3200KB) to run (later we will see that in case of parallelism we can have several sort tables at the same time). When calculating memory for sort, we try to allocate enough memory to have an in-memory sort. For large index build operations it is not likely that we will be able to fit the entire sort in memory. If we can?t provide at least 40 pages for the Index Build operation, it will fail.

The last step of index build is to always build full statistics. Good statistics information helps the query optimizer to generate better plan, users can issue ?create? or ?update? stats commands to force SQL Server generate or refresh stats on a certain object. When we are building a new index, since we need to touch every row, we use this opportunity to build full stats at the same time as a side benefit.

Conclusion:
To be able to build a non partitioned Index offline with serial plan we will need free disk space (in user?s or in tempdb database) of approximately 2.2*IndexSize and at least 40 pages of memory available for the query executor to be able to start the process.

Read in next post: Index Build Scenario 2: Offline, Parallel, No Partitioning
Posted by: Lyudmila Fokina
Published Monday, November 20, 2006 1:19 PM by queryproc
Filed under: Indexing

ezTrader · Postby **ezTrader** » Tue May 01, 2007 10:17 pm

Actually, this might be relevant in that it hints at a workaround...but again, you have likely seen this

http://msdn2.microsoft.com/en-us/librar ... ement.aspx

Firefly · Postby **Firefly** » Mon Jun 15, 2009 7:39 pm

Hey, i know this thread has been dead for a while but i was wondering if someone would be able to answer a simple question. The code Michal posted on the previous page collects tick data (my programming skills are still in their infancy) so im assuming thats both ask and bid prices yet different forex brokers will have different spreads at different times so what is the best way to address this when backtesting, especially on the lower time frames.

thanks for your help

michal.kreslik · Postby **michal.kreslik** » Wed Jun 17, 2009 7:50 am

I'm doing the testing on the top of the book stream that I'm collecting directly from my live interbank platform. If you want to trade live, the most precise form of testing is the one that uses the data collected from your live streams.

Firefly · Postby **Firefly** » Wed Jun 17, 2009 7:54 am

Thanks for the reply.

mabshier · Postby **mabshier** » Sat Mar 20, 2010 5:48 pm

I would recommend the community version of InfoBright built on top of the community version of MySql

It is column store and will utilize compression and column store technology to speed up your queries and minimize the disk footprint for your data

historical tick data SQL database

Share this post

Share this post

Share this post

Share this post

Share this post

Share this post

Share this post