Data File Requirements
Page Contents
Created by gh-md-toc
Each data file you create to upload to your dataset must follow requirements for field and header names and for field values. If you use gzip to compress your TSV file, you must use the .tsv.gz
extension.
TSV Uploader supports parsing files that use either the TSV or CSV file format.
CSV Only: If your files use the CSV file format, prepare the files to conform to the default settings for the OpenCSV library. OpenCSV uses \ (backslash) as the escape character.
Helper Script for Uploads
We recommend using the Imhotep Upload Format Helper Script, which is a combination linter/converter (written in Python) that will make sure that your TSV or CSV data is formatted properly for upload to Imhotep.
Sample Data Files
Use these sample data files as models for preparing your data for upload:
NASA Apache web logs
Wikipedia web logs
World Cup 2014 player data
Filenames
Include the shard time range in the filename
Imhotep partitions your data into shards by the time denoted in the filename. When you specify a time range in IQL, Imhotep selects the shards to search based on the time period associated with each shard, not the timestamps in the documents themselves.
You can specify one full day, one full hour, or a range. If you specify a range, the end time is exclusive to the range.
The time denoted in the filename must be expressed in UTC-6. For example, for a file named 20140801.tsv
, the range of expected times in the file would be 2014-07-31 18:00:00 UTC (inclusive) to 2014-08-01 18:00:00 UTC (exclusive). Note that times in the file must be expressed as Unix timestamps (seconds or milliseconds since 1970).
Supported formats | Example | Description |
---|---|---|
yyyyMMdd | 20131201.tsv |
The file contains data for one day. |
yyyyMMdd.HH | 20131201.01.tsv 20131201.02.tsv |
The file contains data for one hour of the day: 1-2 AM. The file contains data for one hour of the day: 2-3 AM. |
yyyyMMdd.HH-yyyyMMdd.HH | 20131201.00-20131201.03.tsv 20140901.00-20140903.00.tsv |
The file contains data for the first 3 hours of one day. The file contains data for two full days. |
Optional prefixes and suffixes in the filename must be strings
You can add a string prefix or suffix, or both, to your filename. The builder ignores prefixes and suffixes that are strings. For example, the builder correctly ignores QA_report
and _combined
in the QA_report_20141013_combined.tsv
filename.
Digits are not supported in a prefix or suffix. For example, the QA_report_20141013_parts1_2.tsv
filename is invalid because it includes digits in the suffix.
Field Headers
The first line of your file represents the header that defines fields in the resulting dataset
Use field names that contain uppercase A-Z
, lowercase a-z
, digits, or _
(underscore). A field name cannot start with a digit.
A document’s timestamp must be in the same range as the filename
If the field name is time
or unixtime
, the builder parses that field’s values as Unix timestamps and uses them as the document’s timestamps in the dataset. A timestamp can be in seconds or milliseconds since Unix epoch time (UTC). If you don’t include a time
or unixtime
field, the builder uses the start of the shard time range as the timestamp for all of the documents in the file.
NOTE: Do not use floating-point values in a time
or unixtime
field, because floating-point values are treated as strings. Read more.
Field names with the * suffix
Adding the *
suffix to the field name in the header also indexes a tokenized version of that field.
Field Name | Value | Indexed Values |
---|---|---|
query* | "project manager" | query:"project manager" querytok:"project" querytok:"manager" |
Field names with the ** suffix
Adding the **
suffix to the field name in the header also indexes bigrams from the field value.
Field Name | Value | Indexed Values |
---|---|---|
query** | "senior project manager" | query:"senior project manager" querytok:"senior" querytok:"project" querytok:"manager" querybigram:"senior project" querybigram:"project manager" |
Field names with the + suffix
Adding the +
suffix to the field name in the header indexes the tokens in the field instead of the entire field.
Field Name | Value | Indexed Values |
---|---|---|
query_words+ | "project manager" | query_words:"project" query_words:"manager" |
Field Values
Prepare the values in your data file
Ensure that you remove tabs and newlines from your values.
CSV Only: Do not use quotations around field values.
Imhotep has two data types: string and integer(long)
For Imhotep to treat a field’s value as an integer, at least 90% of the values must be integers or blanks, and at least 20% of the total values must be valid integers.
Once set, a field’s type must remain consistent. That is, once a field is indexed as an integer, the field must remain an integer. Likewise, if a field is indexed as a string, the field must remain a string. If a field’s type is not the same in every shard for that dataset, data for one or the other type will not be accessible through IQL.
Floating-point values become strings
Floating-point values like 1.0 or 1.5 are not supported as integers and are treated as strings. To use floating-point values as metrics, consider the following methods:
Method | Examples |
---|---|
Round the values. | 1.0 => 1 1.5 => 2 |
Multiply all values by a decimal constant. | 1.0 => 10 1.5 => 15 |
Empty values
An empty value for a string field is indexed as an empty string term and can be queried. For example, location:""
returns the queries with an empty value for the string field location
. For integer fields, all non-integer values including the empty value are not indexed and cannot be queried.