Skip to content

Latest commit

 

History

History
24 lines (17 loc) · 1.3 KB

File metadata and controls

24 lines (17 loc) · 1.3 KB

Filtering process

After downloading and displaying the csv file, we can finally start to observe the pattern of individual file.

The data seems from relational database. Most of available data is stored in the file called sales_train.csv. Before we start to process these csv files, filtering some abnormal items is necessary. First, I wrote a program (actually I asked AI to write a program) aiming to search the maximum and minimum value as well as the rows with negative values in item_cnt_day column.

input_file = 'sales_train.csv'  
output_file = 'output_filtered.csv'

max_price = df['item_price'].max()
min_price = df['item_price'].min()
print(f"item_price maximum value: {max_price}")
print(f"item_price minimum value: {min_price}")

negative_cnt_rows = df[df['item_cnt_day'] < 0]
negative_cnt_rows['row_number'] = negative_cnt_rows.index + 1

negative_cnt_rows.to_csv(output_file, index=False, columns=['row_number', 'date', 'date_block_num', 'shop_id', 'item_id', 'item_price', 'item_cnt_day'])

I decided to remove the anomalous(the negative values) because it was few impact to the final prediction due to the large amount of data which had the daily precision recordings.

The output_filtered.csv exceeds the recommended maximum size and does not be uploaded.

Back to Readme.md