[Review] Parquet reader multithread#146
[Review] Parquet reader multithread#146wmalpica wants to merge 109 commits intorapidsai:masterfrom BlazingDB:parquet-reader-multithread
Conversation
include/gdf/cffi/types.h
Outdated
| GDF_JOIN_DTYPE_MISMATCH, /**< Datatype mismatch between corresponding columns in left/right tables in the Join function */ | ||
| GDF_JOIN_TOO_MANY_COLUMNS, /**< Too many columns were passed in for the requested join operation*/ | ||
| GDF_GROUPBY_TOO_MANY_COLUMNS, | ||
| GDF_IO_ERROR, |
There was a problem hiding this comment.
I don't think it is good practice to insert into the middle of an enum. Also, can you please provide a docstring. Thanks!
There was a problem hiding this comment.
Ok! Thanks for the feedback.
src/parquet/dictionary_decoder.cuh
Outdated
| // std::memcpy(bytes_data + offset, dictionary_[i].ptr, fixed_len); | ||
| // dictionary_[i].ptr = bytes_data + offset; | ||
| // } | ||
| // } |
src/parquet/plain_decoder.cuh
Outdated
| // data_size -= type_length; | ||
| // } | ||
| // return bytes_to_decode; | ||
| // } |
…i so that it ignores string columns.
…erface instead of ReadableFile which was a class
harrism
left a comment
There was a problem hiding this comment.
This is a ton of code to review. We should probably have an overview meeting like we discussed having for the binary ops PR. Unless you have already done this!
| ## eclipse | ||
| .project | ||
|
|
||
| build2/ |
There was a problem hiding this comment.
Is "build2" a common directory we need to gitignore? Perhaps this file was committed accidentally?
| PROJECT(libgdf) | ||
|
|
||
| cmake_minimum_required(VERSION 2.8) # not sure about version required | ||
| cmake_minimum_required(VERSION 3.3) # not sure about version required |
There was a problem hiding this comment.
libgdf CMakeLists.txt requires make version 3.11... Maybe match that and remove the unsure comment.
| GDF_JOIN_DTYPE_MISMATCH, /**< Datatype mismatch between corresponding columns in left/right tables in the Join function */ | ||
| GDF_JOIN_TOO_MANY_COLUMNS, /**< Too many columns were passed in for the requested join operation*/ | ||
|
|
||
| GDF_IO_ERROR, /**< Error occured in a parquet-reader api which load a parquet file into gdf_columns */ |
There was a problem hiding this comment.
Hmm, IO_ERROR seems generic enough of a name to apply to more than the parquet reader. Suggest either narrowing the name or broadening the comment.
| #include <vector> | ||
| #include <arrow/io/file.h> | ||
|
|
||
| namespace gdf { |
There was a problem hiding this comment.
Why do you use the BEGIN_NAMESPACE_GDF_PARQUET macro above, but not here?
| /// \param[in] indices of the columns that will be read from the file | ||
| /// \param[out] out_gdf_columns vector of gdf_column pointers. The data read. | ||
| gdf_error | ||
| read_parquet_by_ids(const std::string & filename, |
There was a problem hiding this comment.
What does "by_ids" signify? It's not explained in the comment.
| namespace gdf { | ||
| namespace parquet { | ||
|
|
||
| /// \brief Read parquet file from file path into array of gdf columns |
There was a problem hiding this comment.
Please follow the comment templates given in https://github.com/rapidsai/libgdf/blob/master/src/example_documentation.cpp
This pull request is ontop of the parquet-reader branch pull request. It uses the same APIs, but it will read all column and rowgroups in parallel using a number of threads up to hardware_concurrency. This PR also contains some improvements to the file_reader-test unit test and a bug fix