How to read a CSV with correct type (int, str, float)?

:information_source: Attention Topic was automatically imported from the old Question2Answer platform.
:bust_in_silhouette: Asked By jch

The problem with File.get_csv_line() is that all the values are read as type string (4). I haven’t found an automated way of getting the right type (i.e. a number), like JSON.parse() does.

I know the type of each CSV column beforehand, so it is simple to write a nested loop to iterate over the lines and recast. But it does not seem like the most efficient thing to do. Especially since there are these get_ methods included in File. It seems they could be used to read the values with correct type, but I don’t know how to use them.

Any pointers would be appreciated!

:bust_in_silhouette: Reply From: jgodfrey

The problem here is that there’s really not enough context in a CSV file to determine intended data types. Unlike JSON, that specifies at least some basic hints to data types (strings must be in double quotes, numbers must not be quoted, …), the CSV format doesn’t really have any such notions.

Popular programs (like Excel) attempt some pre-scanning trickery in an attempt to guess proper data types for various columns.

IMO, about the best you can do here is to make some attempt to infer the intended data type based on your own set of rules.

If I were you, I’d probably write some sort of detect_data_type() function that takes a CSV field as input and returns its data type (or, the converted value directly) based on rules you define for your data. It could be as simple as something like:

  • if the field could be an int, return int
  • if the field could be a float, return float
  • otherwise, return string

… or, as complex as necessary.

Armed with such a function, you could then just cyclically read your CSV, pass every field through the function, and convert the data as needed.

Thanks for the answer! I think I might not have been clear enough though: I do know what data type each column is, so that’s not the challenge. However, getting all values as strings and then looping over everything in order to convert it, is something I wanted to avoid, or at least I wanted to see if there is a more efficient approach. To rephrase: If I do know that the first column is a float, is there a way to read it as a float, or do I have to read it in as a string and then convert?

jch | 2022-11-07 20:49

With specific knowledge of the data types of each field, you could use the various get_* methods to read each individual value in the file (as an appropriate type). However, you’d then be responsible for parsing the file yourself (and dealing with the comma separators).

That’s likely a lot more work and more prone to errors than using get_csv_line() and then converting the individual strings to the appropriate data type (even though it might be more efficient).

If you’re going to need this for varying sets of data, I’d still recommend you employ something like I mentioned above. The difference being that you no longer need to be aware of the data type of each field - only that it meets the rules of your detect_data_type() method (or similar).

jgodfrey | 2022-11-07 21:01