1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
| +-----------------------+----------------------------------------------------+ | Parameter | Value | +-----------------------+----------------------------------------------------+ | input dataset prefix | data/tiny_shakespeare | | output log file | NULL | | batch size B | 4 | | sequence length T | 1024 | | learning rate | 0.000300 | | val_loss_every | 20 | | val_max_batches | 20 | | sample_every | 20 | | genT | 64 | +-----------------------+----------------------------------------------------+ | device | NVIDIA H800 | | TF32 | enabled | +-----------------------+----------------------------------------------------+ | max_sequence_length T | 1024 | | vocab_size V | 50257 | | num_layers L | 12 | | num_heads NH | 12 | | channels C | 768 | | num_parameters | 124439808 | +-----------------------+----------------------------------------------------+ | train_num_batches | 74 | | val_num_batches | 20 | +-----------------------+----------------------------------------------------+ allocated 474 MiB for model parameters allocated 5706 MiB for activations val loss 4.506288 allocated 474 MiB for parameter gradients allocated 252 MiB for activation gradients allocated 474 MiB for AdamW optimizer state m allocated 474 MiB for AdamW optimizer state v step 1/74: train loss 4.367558 (42.291931 ms, 96850 tok/s) step 2/74: train loss 4.435496 (36.873423 ms, 111082 tok/s) step 3/74: train loss 4.346745 (36.917661 ms, 110949 tok/s) step 4/74: train loss 3.916155 (36.930912 ms, 110909 tok/s) step 5/74: train loss 3.576688 (36.882907 ms, 111054 tok/s) step 6/74: train loss 3.752822 (36.804162 ms, 111291 tok/s) step 7/74: train loss 3.543940 (36.823446 ms, 111233 tok/s) step 8/74: train loss 3.691794 (36.872353 ms, 111085 tok/s) step 9/74: train loss 3.292197 (36.917963 ms, 110948 tok/s) step 10/74: train loss 3.420270 (36.912048 ms, 110966 tok/s) step 11/74: train loss 3.839151 (36.832944 ms, 111204 tok/s) step 12/74: train loss 3.459940 (36.855074 ms, 111138 tok/s) step 13/74: train loss 3.617168 (36.834898 ms, 111198 tok/s) step 14/74: train loss 3.230776 (36.871742 ms, 111087 tok/s) step 15/74: train loss 3.669937 (36.886197 ms, 111044 tok/s) step 16/74: train loss 3.859764 (36.910485 ms, 110971 tok/s) step 17/74: train loss 3.851466 (36.913714 ms, 110961 tok/s) step 18/74: train loss 3.920131 (36.908567 ms, 110976 tok/s) step 19/74: train loss 3.639019 (36.922885 ms, 110933 tok/s) step 20/74: train loss 3.733764 (36.988406 ms, 110737 tok/s) val loss 3.687801 generating: --- O, my cousin: that is so.
<|endoftext|>O<|endoftext|>Trussell, thy father's son, son of the Roman king Hardsley, heir<|endoftext|>LUTHER, for whom shall Scotland draw her throne, except for England? In England, consulate marcius: And bind together Egbert --- step 21/74: train loss 3.719804 (36.868217 ms, 111098 tok/s) step 22/74: train loss 3.586144 (36.921155 ms, 110939 tok/s) step 23/74: train loss 3.551655 (36.905093 ms, 110987 tok/s) step 24/74: train loss 3.351520 (36.939467 ms, 110884 tok/s) step 25/74: train loss 3.454527 (36.997209 ms, 110711 tok/s) step 26/74: train loss 3.761025 (36.983948 ms, 110750 tok/s) step 27/74: train loss 3.779032 (36.952961 ms, 110843 tok/s) step 28/74: train loss 3.636410 (36.875919 ms, 111075 tok/s) step 29/74: train loss 3.448576 (36.891939 ms, 111026 tok/s) step 30/74: train loss 3.574333 (36.936213 ms, 110893 tok/s) step 31/74: train loss 3.509148 (36.920664 ms, 110940 tok/s) step 32/74: train loss 3.362097 (36.908129 ms, 110978 tok/s) step 33/74: train loss 3.421195 (36.975713 ms, 110775 tok/s) step 34/74: train loss 3.684764 (36.974622 ms, 110778 tok/s) step 35/74: train loss 3.381419 (36.919138 ms, 110945 tok/s) step 36/74: train loss 3.401418 (36.891895 ms, 111027 tok/s) step 37/74: train loss 3.812751 (36.899238 ms, 111005 tok/s) step 38/74: train loss 3.623131 (36.911724 ms, 110967 tok/s) step 39/74: train loss 3.489853 (36.926676 ms, 110922 tok/s) step 40/74: train loss 3.137516 (36.902834 ms, 110994 tok/s) val loss 3.635424 generating: --- Diademorns, God, thou wilt be king: It's busy with summer, like day before. Is it sorrows, is it stories, Waters, and naked lamentations? Were they such a office, for marriage? Who wilt party the day chronicling, And --- step 41/74: train loss 3.476893 (36.845553 ms, 111166 tok/s) step 42/74: train loss 3.330724 (36.863462 ms, 111112 tok/s) step 43/74: train loss 3.477123 (36.838678 ms, 111187 tok/s) step 44/74: train loss 3.366669 (36.889697 ms, 111033 tok/s) step 45/74: train loss 3.979407 (36.901669 ms, 110997 tok/s) step 46/74: train loss 3.866721 (36.905043 ms, 110987 tok/s) step 47/74: train loss 3.774495 (36.912187 ms, 110966 tok/s) step 48/74: train loss 3.962839 (36.842677 ms, 111175 tok/s) step 49/74: train loss 4.036259 (36.851610 ms, 111148 tok/s) step 50/74: train loss 3.857388 (36.853193 ms, 111143 tok/s) step 51/74: train loss 3.604754 (36.845790 ms, 111166 tok/s) step 52/74: train loss 3.579455 (36.831378 ms, 111209 tok/s) step 53/74: train loss 3.824139 (36.857464 ms, 111130 tok/s) step 54/74: train loss 3.766292 (36.894130 ms, 111020 tok/s) step 55/74: train loss 3.487747 (36.934630 ms, 110898 tok/s) step 56/74: train loss 3.151821 (36.968007 ms, 110798 tok/s) step 57/74: train loss 3.344814 (36.882516 ms, 111055 tok/s) step 58/74: train loss 3.522471 (36.902438 ms, 110995 tok/s) step 59/74: train loss 3.373972 (37.128153 ms, 110320 tok/s) step 60/74: train loss 3.433309 (36.883408 ms, 111052 tok/s) val loss 3.529820 generating: --- 01:<|endoftext|>Look, I am an o'erthompson by blood, I can live with my blood, saying to the ford, thou liest! That doeth think, in Wynldota's my noble soul; I'll do it here for York: and though thy lordship do --- step 61/74: train loss 3.323022 (36.834574 ms, 111199 tok/s) step 62/74: train loss 3.263776 (36.836349 ms, 111194 tok/s) step 63/74: train loss 3.288344 (36.908309 ms, 110977 tok/s) step 64/74: train loss 3.792862 (36.868552 ms, 111097 tok/s) step 65/74: train loss 3.561376 (36.875281 ms, 111077 tok/s) step 66/74: train loss 3.339633 (36.894187 ms, 111020 tok/s) step 67/74: train loss 3.232719 (36.929932 ms, 110912 tok/s) step 68/74: train loss 3.424346 (36.951805 ms, 110847 tok/s) step 69/74: train loss 3.259733 (36.961079 ms, 110819 tok/s) step 70/74: train loss 3.071201 (36.967520 ms, 110799 tok/s) step 71/74: train loss 3.048391 (36.890885 ms, 111030 tok/s) step 72/74: train loss 3.058076 (36.883778 ms, 111051 tok/s) step 73/74: train loss 3.697609 (36.878680 ms, 111066 tok/s) step 74/74: train loss 3.497812 (36.910939 ms, 110969 tok/s) val loss 3.515306 generating: --- A faint noise was heard in the air. Gentlemen, I have fairly said, than should be the report he shall hear, and you'll that have only no prison made for you.
<|endoftext|>RICHARD VINCENTIO: Why, how you sounded.
<|endoftext|>CAM<|endoftext|>K AS --- total average iteration time: 36.974027 ms
|