-
Notifications
You must be signed in to change notification settings - Fork 0
Script to anonymize JSON data #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
463d42d to
8be4a16
Compare
| # Step 2: Add random time offset to "Date" column | ||
| if "Date" in df.columns: | ||
| df["Date"] = pd.to_datetime(df["Date"]) | ||
| random_offset = pd.tseries.offsets.DateOffset( | ||
| years=np.random.randint(-1000, 1000) | ||
| ) | ||
| df["Date"] += random_offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be applied to all the dataframes equally so there is still time correspondence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all reflections get shifted by a random year offset?
| # and apply a random linear transformation to numerical columns | ||
| # and replace non-empty strings with a random word | ||
| for col in df.columns: | ||
| if df[col].dtype == "object": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
object means string? is there any way to narrow this down further?
| df[col] = df[col].apply(lambda x: random_word() if x else x) | ||
| elif df[col].dtype in ["int64", "float64"]: | ||
| k = random.uniform(0.5, 10) * random.choice([-1, 1]) | ||
| df[col] = df[col] * k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's a rating type, won't this be obvious given our current fixed range? what is this protecting from?
No description provided.