"sliding window" bigtable training mode by amj · Pull Request #713 · tensorflow/minigo

amj · 2019-02-15T00:19:08Z

instead of repeatedly calling train, create repeated datasets by incrementing along the dataset window.

@gitosaurus this is kind of a first cut -- it has to do all the full key retrieval and shuffle before it can start, and it'd be great if i could make that lazy somehow.

sethtroisi · 2019-02-15T02:17:03Z

preprocessing.py



+def get_many_tpu_bt_input_tensors(games, games_nr, batch_size,
+                                       start_at, num_datasets,


nit: indenting is wrong

sethtroisi · 2019-02-15T02:17:14Z

preprocessing.py

+        # is proportionally along compared to last_game_number?  comparing
+        # timestamps?)
+        ds = games.moves_from_games(start_at + (i * window_increment),
+                                       start_at + (i * window_increment) + window_size,


nit indenting

sethtroisi · 2019-02-15T02:33:03Z

preprocessing.py

+                                       shuffle=True,
+                                       column_family=bigtable_input.TFEXAMPLE,
+                                       column='example')
+        ds = ds.repeat(1)


you can probably move the repeat and map out of this loop

gitosaurus · 2019-02-15T16:21:36Z

preprocessing.py

+                                       column='example')
+        ds = ds.repeat(1)
+        ds = ds.map(lambda row_name, s: s)
+        dataset = dataset.concatenate(ds) if dataset else ds


Regarding the general approach: if the training loop does multiple scans, I would expect to create a new dataset for each pass, rather than try to create a single enormous dataset, which I imagine would be harder to debug, inspect, etc.

yes, but multiple calls to tpuestimator.train will create new graphs :( I am not sure what a good solution for lazy evaluating of these Datasets would be. As it is, it takes a real long time to build the datasets before training even starts -- i suspect the concatenate is doing something bad as things get slower and slower.

amj

Thanks, PTAL :)

amj · 2019-02-15T19:12:45Z

preprocessing.py

+                                       column='example')
+        ds = ds.repeat(1)
+        ds = ds.map(lambda row_name, s: s)
+        dataset = dataset.concatenate(ds) if dataset else ds


yes, but multiple calls to tpuestimator.train will create new graphs :( I am not sure what a good solution for lazy evaluating of these Datasets would be. As it is, it takes a real long time to build the datasets before training even starts -- i suspect the concatenate is doing something bad as things get slower and slower.

amj · 2019-02-15T19:13:35Z

preprocessing.py

+                                       shuffle=True,
+                                       column_family=bigtable_input.TFEXAMPLE,
+                                       column='example')
+        ds = ds.repeat(1)


amj · 2019-02-15T19:15:08Z

preprocessing.py



+def get_many_tpu_bt_input_tensors(games, games_nr, batch_size,
+                                       start_at, num_datasets,


amj · 2019-02-15T19:15:10Z

preprocessing.py

+        # is proportionally along compared to last_game_number?  comparing
+        # timestamps?)
+        ds = games.moves_from_games(start_at + (i * window_increment),
+                                       start_at + (i * window_increment) + window_size,


sethtroisi · 2019-02-20T08:49:29Z

train.py

            self.before_weights = None


+def train_many(start_at=1000000, num_datasets=3):


can you expose moves here also.

what do you mean? number of steps?

k8s-ci-robot · 2019-03-26T21:29:54Z

@amj: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
tf-minigo-presubmit	`cdfe391`	link	`/test tf-minigo-presubmit`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

jacksona added 10 commits February 14, 2019 11:21

first cut at training with BT over many blocks

d40fa04

bad at math

32fdca9

try it this way, concat inside get_many

39807ee

syntax

3a3f2a9

typo

145a6fc

another typo

68f3e79

help i cant type

7389196

move batching inside the loop

b68bc5e

collapse from key,data to just data. make rotation always on

936638a

fix double concat.

9091cca

amj requested a review from gitosaurus February 15, 2019 00:19

k8s-ci-robot added the size/M label Feb 15, 2019

sethtroisi reviewed Feb 15, 2019

View reviewed changes

gitosaurus reviewed Feb 15, 2019

View reviewed changes

amj commented Feb 15, 2019

View reviewed changes

PR comments.

0cab4e9

sethtroisi approved these changes Feb 20, 2019

View reviewed changes

jacksona added 6 commits March 19, 2019 14:15

Merge branch 'master' into bt-running-window

82f5a8b

extract to flags

fc0e3a0

add moar params

8518517

fix main code branch

b29875a

lint

e90ce64

Update flags correctly

cdfe391



		def get_many_tpu_bt_input_tensors(games, games_nr, batch_size,
		start_at, num_datasets,

		self.before_weights = None


		def train_many(start_at=1000000, num_datasets=3):

Comments

Conversation

amj commented Feb 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Mar 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants