[Review] Parquet reader multithread by wmalpica · Pull Request #146 · rapidsai/libgdf

wmalpica · 2018-09-19T20:46:01Z

This pull request is ontop of the parquet-reader branch pull request. It uses the same APIs, but it will read all column and rowgroups in parallel using a number of threads up to hardware_concurrency. This PR also contains some improvements to the file_reader-test unit test and a bug fix

…sts pass

scopatz · 2018-09-20T20:58:14Z

include/gdf/cffi/types.h

    GDF_JOIN_DTYPE_MISMATCH,          /**< Datatype mismatch between corresponding columns in  left/right tables in the Join function */   
    GDF_JOIN_TOO_MANY_COLUMNS,        /**< Too many columns were passed in for the requested join operation*/       
    GDF_GROUPBY_TOO_MANY_COLUMNS,
+    GDF_IO_ERROR,


I don't think it is good practice to insert into the middle of an enum. Also, can you please provide a docstring. Thanks!

Ok! Thanks for the feedback.

scopatz · 2018-09-20T21:03:05Z

src/parquet/dictionary_decoder.cuh

+//         std::memcpy(bytes_data + offset, dictionary_[i].ptr, fixed_len);
+//         dictionary_[i].ptr = bytes_data + offset;
+//     }
+// }


Please remove dead code

scopatz · 2018-09-20T21:03:43Z

src/parquet/plain_decoder.cuh

+//         data_size -= type_length;
+//     }
+//     return bytes_to_decode;
+// }


…i so that it ignores string columns.

…d new unit tests

…erface instead of ReadableFile which was a class

harrism

This is a ton of code to review. We should probably have an overview meeting like we discussed having for the binary ops PR. Unless you have already done this!

harrism · 2018-10-24T03:50:26Z

.gitignore

 ## eclipse
 .project
+
+build2/


Is "build2" a common directory we need to gitignore? Perhaps this file was committed accidentally?

harrism · 2018-10-24T03:52:00Z

CMakeLists.txt

 PROJECT(libgdf)

-cmake_minimum_required(VERSION 2.8)  # not sure about version required
+cmake_minimum_required(VERSION 3.3)  # not sure about version required


libgdf CMakeLists.txt requires make version 3.11... Maybe match that and remove the unsure comment.

harrism · 2018-10-24T03:53:57Z

include/gdf/cffi/types.h

    GDF_JOIN_DTYPE_MISMATCH,          /**< Datatype mismatch between corresponding columns in  left/right tables in the Join function */   
    GDF_JOIN_TOO_MANY_COLUMNS,        /**< Too many columns were passed in for the requested join operation*/       
+
+    GDF_IO_ERROR,                     /**< Error occured in a parquet-reader api which load a parquet file into gdf_columns */


Hmm, IO_ERROR seems generic enough of a name to apply to more than the parquet reader. Suggest either narrowing the name or broadening the comment.

harrism · 2018-10-24T03:55:09Z

include/gdf/parquet/api.h

+#include <vector>
+#include <arrow/io/file.h>
+
+namespace gdf {


Why do you use the BEGIN_NAMESPACE_GDF_PARQUET macro above, but not here?

harrism · 2018-10-24T03:56:56Z

include/gdf/parquet/api.h

+/// \param[in] indices of the columns that will be read from the file
+/// \param[out] out_gdf_columns vector of gdf_column pointers. The data read.
+gdf_error
+read_parquet_by_ids(const std::string &             filename,


What does "by_ids" signify? It's not explained in the comment.

harrism · 2018-10-24T03:57:16Z

include/gdf/parquet/api.h

+namespace gdf {
+namespace parquet {
+
+/// \brief Read parquet file from file path into array of gdf columns


Please follow the comment templates given in https://github.com/rapidsai/libgdf/blob/master/src/example_documentation.cpp

gcca and others added 30 commits July 19, 2018 12:18

[parquet-reader] Add parquet reader wrapper

6cb51df

[parquet-reader] Add column reader

bbe9467

[parquet-reader] Enable read new page call

6ced85b

WIP: add custom decoder

16b40cb

[parquet-reader] Update parquet API to v1.3.1

fc57ccb

[parquet-reader] Read batch as gdf column

3000f89

arrow decoder

a6e7d0e

merge with parquet-reader

7c24364

Merge branch 'parquet-reader' into parquet-decoder

3b9af0e

[parquet-reader] Add gdf column read test

4593968

[parquet-reader] Add file reader by columns benchmark

abe73d3

decoder using host

a384b15

decoder using gpu

79470ea

[parquet-reader] Read spaced batches to gdf column

3ef6ecd

Merge branch 'parquet-reader' into parquet-decoder

4282650

use specific gpu-decoder for int32

819af4e

[parquet-reader] Add API to read a parquet file

5713017

[parquet-reader] Merge from parquet-decoder

7ad9972

[parquet-reader] Fix template definitions for readers

882a296

[parquet-reader] Merger from LibGDF/master

e8068eb

[parquet-reader] Fix testing files

e407912

[parquet-reader] Move tests to src

9ba5d7e

[parquet-reader] Fix access to parquetcpp repository

6aaaa51

[parquet-reader] Fix benchmark test building

13e27c7

[parquet-reader] Fix build moving tests into src

15ff796

[parquet-reader] Update tests building process

d7bed6a

[parquet-reader] Add conda dependencies for Thrift

92d89e9

[parquet-reader] Check gdf dtype from parquet type

f56a978

[parquet-reader] Apply batch spaced reading on tests

9043c7a

[parquet-reader] Add column filter from file

9d2275e

gcca and others added 10 commits September 18, 2018 16:00

[parquet-reader] Downgrade bison and flex

31326fa

[parquet-reader] Add global ParquetCpp include directories

55ab718

[parquet-reader] Fix compiling warnings

c3f2552

fixed bug in guard in bitpacking kernel

07e6e85

[parquet-reader] fix bitpacking decoder and transform_valid

dc76e3d

[parquet-reader]: merge with last fixes

8bf8311

[parquet-reader]: fix warnings

951cbf9

cleaned up code. Using _ReadFileMultiThread where it needs to. All te…

ab57c53

…sts pass

made small change to unit test and found more issues

5002683

fixed bug in allocator function

a7ce67a

wmalpica changed the title ~~Parquet reader multithread~~ [Review] Parquet reader multithread Sep 19, 2018

[parquet-reader-multithread] fix warnings

9cd6e16

scopatz reviewed Sep 20, 2018

View reviewed changes

aocsa and others added 2 commits September 21, 2018 12:36

[parquet-reader-multithread] remove dead code and add comments

52b03f7

added new parquet-multithread-benchmark test. Fixed parquet-reader ap…

efcffd4

…i so that it ignores string columns.

kkraus14 added the 3 - Ready for Review Ready for review by team label Sep 24, 2018

William Malpica added 6 commits September 25, 2018 12:12

fixed benchmark unit test

dd9a65f

moved parquet benchmarks to bench folder

b342fe4

Merge branch 'master' into parquet-reader-multithread

95b16e3

added a new public API which takes in an file reading interface. Adde…

d1e8ff7

…d new unit tests

Merge branch 'master' into parquet-reader-multithread

ec54c9a

fixed interface implementation to be RandomAccessFile which is an int…

b7c2686

…erface instead of ReadableFile which was a class

nsakharnykh mentioned this pull request Oct 24, 2018

[WIP] Apache Parquet reader #85

Closed

3 tasks

harrism reviewed Oct 24, 2018

View reviewed changes

mike-wendt added 4 - Needs Rework Additional work is needed migrate to cudf and removed 3 - Ready for Review Ready for review by team labels Oct 24, 2018

mike-wendt mentioned this pull request Dec 20, 2018

[libgdf-PR-146] Parquet reader multithread rapidsai/cudf#596

Closed

Comments

Conversation

wmalpica commented Sep 19, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harrism left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants