Fortune Telling Collection - Comprehensive fortune-telling - Deep language model -GPT

Deep language model -GPT

OpenAI put forward GPT(Generative Pre-Training) model in 20 18. The model adopts pre-training and fine-tuning training mode, which can be used for tasks such as classification, reasoning, question and answer, similarity and so on.

GPT is put forward on the basis of transformer, but it is slightly different:

The embedding vector of each word in a sentence

It is a single transformer, and the output of the last layer is H L.

According to the output of the last layer, connect a matrix W to generate the dimension of 1, and then calculate the probability of each word by softmax to maximize the probability. The loss L 1 (C) is obtained. Note here that when calculating P(u), the embedding vector W e of dictionary words is used, which is also a common skill in language model.

Given Text 1SEPText2, the normal transformer only keeps the decoder that masks self-attention, so that every position in the last layer can output a probability; Then use the corresponding next word to calculate the loss.

Use a small amount of marking data to fine-tune model parameters.

Take the output h l of the last word in the previous step as the input of downstream supervised learning.

According to the supervision label, calculate the loss and get L 2 (C).

The sum of L 2 (C) and L 2 (C) is the loss after doing it, as shown in the following figure:

One-way converter, cannot use the semantics of the words after the current word. But the translated scene seems useless, just don't know what the word is. Is that really the case?