Imhotep Upload Format Helper Script
The Python script imhotep_helper.py is a combination linter/converter that will make sure that your TSV or CSV data is formatted properly for upload to Imhotep.
- Summarizes the type inference (integer vs. string) for your columns
- Handles time conversion so that your times are adjusted to Imhotep’s default time zone
- Cleans up data in a number of ways to conform with TSV or CSV formats
- Rewrites data into a file named properly for the time range of the data it contains, since the file name timestamp ranges determine the Imhotep shards and must correspond properly to the contained data.
For more information, see Data File Requirements.
Usage
usage: imhotep_helper.py [-h] [-l] [-c] [-n NONINT] [-i INDEX]
[--prefix PREFIX] [-f FORMAT] [-o OFFSET]
[datafile]
positional arguments:
datafile filename of data to upload
optional arguments:
-h, --help show this help message and exit
-l, --lint check file for problems. if specified, overrides
--convert.
-c, --convert automatically fix problems and convert
-n NONINT, --nonint NONINT
name or index of column to display non-integer values
(name must be a valid index name)
-i INDEX, --index INDEX
index of timestamp field
--prefix PREFIX prefix of converted filename. default = "converted_"
-f FORMAT, --format FORMAT
format of timestamp field. defaults include
"%Y-%m-%d", "%Y-%m-%d %H:%M:%S", "%m/%d/%Y", "%m/%d/%Y
%H:%M:%S", "%a %b %d %H:%M:%S %Y",
"%Y-%m-%dT%H:%M:%S". (see
https://docs.python.org/2/library/datetime.html
#strftime-strptime-behavior for details)
-o OFFSET, --offset OFFSET
GMT offset of timestamps. default is -6.
Example Output (lint mode)
Using GMT offset -6
detected tsv file type
WARN [78751]: Found an escape character (\) in entry. TSVs do not support quoting. Entry: "/shuttle/missions/sts-69/\www.pic.net"
WARN [182501]: Found an escape character (\) in entry. TSVs do not support quoting. Entry: "/shuttle/missions/sts-71/images/http:\\www.mca.com"
WARN [336170]: Found an escape character (\) in entry. TSVs do not support quoting. Entry: "/history/\"
WARN [336177]: Found an escape character (\) in entry. TSVs do not support quoting. Entry: "/history/\"
WARN [336336]: Found an escape character (\) in entry. TSVs do not support quoting. Entry: "/history/\"
WARN [535266]: Found an escape character (\) in entry. TSVs do not support quoting. Entry: "/\\espnet.sportszone.com"
WARN [615192]: Found an escape character (\) in entry. TSVs do not support quoting. Entry: "/HISTORY/APOLLO/HTTP:\\POPULARMECHANICS.COM"
WARN [615206]: Found an escape character (\) in entry. TSVs do not support quoting. Entry: "/HISTORY/APOLLO/HTTP:\\POPULARMECHANICS.COM"
WARN [873938]: Found an escape character (\) in entry. TSVs do not support quoting. Entry: "/\\www.isisnet.com/home/newnetscape.html"
WARN [968770]: Found an escape character (\) in entry. TSVs do not support quoting. Entry: "/history\apollo"
WARN [968787]: Found an escape character (\) in entry. TSVs do not support quoting. Entry: "/history\apollo"
WARN [1123563]: Found an escape character (\) in entry. TSVs do not support quoting. Entry: "/shuttle/missions/sts-71/images/http:\\www.yahoo.com"
WARN [1228814]: Found an escape character (\) in entry. TSVs do not support quoting. Entry: "/shuttle/technology\images"
WARN [1364399]: Found an escape character (\) in entry. TSVs do not support quoting. Entry: "/history/apollo/apollo.html\"
WARN [1364538]: Found an escape character (\) in entry. TSVs do not support quoting. Entry: "/history/apollo/apollo.html\"
WARN [1449354]: Found an escape character (\) in entry. TSVs do not support quoting. Entry: "/facilities/www.commerce.com/\\www.commerce.com\cmt"
WARN [1520084]: Found an escape character (\) in entry. TSVs do not support quoting. Entry: "/shuttle/missions/\\www.cinday-cga.com:p09\"
WARN [1520124]: Found an escape character (\) in entry. TSVs do not support quoting. Entry: "/shuttle/missions/\\www.cinday-cga.com:p09\"
WARN [1749804]: Found an escape character (\) in entry. TSVs do not support quoting. Entry: "/software\"
WARN [1836424]: Found an escape character (\) in entry. TSVs do not support quoting. Entry: "/\\www.yahoo.com"
WARN [1836560]: Found an escape character (\) in entry. TSVs do not support quoting. Entry: "/shuttle/missions/mission.html\"
WARN [1836561]: Found an escape character (\) in entry. TSVs do not support quoting. Entry: "/shuttle/missions/mission.html\"
SUMMARY STATS:
Total records: 1891714
Field "host":
- 0.0% int values
- 100.0% string values
> Will be treated as a string field.
Field "time":
> Will be converted to unixtime
Field "method":
- 0.0% int values
- 100.0% string values
> Will be treated as a string field.
Field "url":
- 0.0% int values
- 100.0% string values
> Will be treated as a string field.
Field "version":
- 0.2% int values
- 99.8% string values
> Will be treated as a string field.
Field "response":
- 99.0% int values
- 0.0% string values
> Will be treated as an int field. Non-int values will be discarded.
Field "bytes":
- 98.8% int values
- 0.0% string values
> Will be treated as an int field. Non-int values will be discarded.