Skip to main content

Calgary Data Loads

Data Loads

Data should always be sorted with maximum locality. You would typically sort on multiple fields by descending selectivity. The reason you pick multiple fields is that if any fields are not populated, you still get a decent amount of locality. A typical sort is:

  • State

  • City

  • ZIP

  • Block

  • Point

It's highly recommended that you use the Auto Field tool before you do the load. Maximizing the efficiency of packing in the fields can be huge. Before you run the Auto Field tool, you might need to trim data via a Formula tool. Especially when data comes from flat files, it can be padded with spaces. Removing these spaces can improve compression and indexing.

There is a 2^31 limit on the number of records in a Calgary Database (approximately 2 Billion). It is designed for a sweet spot of around 100-300 million records.

Record Size

The Calgary Database (*.cydb) is tuned for smaller records, at the same time, it has no limit to ultimate record size. Records over 2K bytes incur both a compression and performance penalty compared to smaller records.

There is a record size limit of 2GB in a 64-bit environment.

Compression

The initial implementation of the Calgary Database file (*.cydb) has minimal compression. It's designed for load and read speed as well as random access. Initial test results show the compression achieving factors of between 1:1 and 4:1 (25% of original size) compared to CSV files. Presuming the load was done with a good geolocated sort, between 2:1 and 3:1 can be expected. It's similar, but slightly larger than the YXDB format without the spatial index.

The Calgary spatial index uses 5 decimal places of accuracy for compression and speed. The YXDB spatial index uses 6 decimal places. This adds an additional round-off error of up to a maximum of 1.8 feet to Calgary indexes. In other words, it is possible that a point can be 1.8 feet inside of a polygon and be found as being outside in Calgary.

Loading

When you load Calgary Databases, you can choose to load or not load any set of fields, as well as choose to index or not index any set of fields. Only the fields that you choose to index are available for queries, and only the fields you choose for data are available for retrieval. There is no requirement that fields selected for indexes have to be loaded for retrieval and visa versa. An example of this might be the Spatial Object (Point). You might index this field so you can retrieve records spatially, but not include it in the data because you also have lat/long fields.