encoding을 왜 할까? 컴퓨터가 자연어를 이해할 수 있을까? 아니다. 따라서 컴퓨터가 이해할 수 있도록, 단어 자체의 정보는 보존한 채 자연어를 숫자로 바꾸는 방법이 encoding이다.
그렇다면 positional encoding에서 보존하고자 하는 정보는 무엇일까? 자리다. 자리가 왜 중요한가? 자리가 의미를 반영하는 경우가 있기 때문이다. 부정어구가 대표적이다. 아래 두 문장을 보면 알 수 있다. 부정어구 안 이 위치하는 곳에 따라 음식의 재료가 바뀐다.

김밥에 오이가 안 들었고 샌드위치에 오이가 들었다.  
김밥에 오이가 들었고 샌드위치에 오이가 안 들었다.

위치가 중요함을 알았으니 위치 정보를 컴퓨터가 이해할 수 있는 언어로 바뀐 단어에 주입하는 방법에 대해 알아보자.

Examples: PyTorch

Embedding

Docstring

class Embedding(Module):
    r"""A simple lookup table that stores embeddings of a fixed dictionary and size.

    This module is often used to store word embeddings and retrieve them using indices.
    The input to the module is a list of indices, and the output is the corresponding
    word embeddings.

pytorch의 모든 신경망은 nn 을 상속받음으로써 시작된다. Module은 nn.Module로 공식문서에서 별도로 import 했다. docstring에 의하면 본 모듈은 word embedding 또는 index들을 사용하여 word embedding을 검색하는데 사용된다. 이 모듈을 향한 입력 은 index들로 구성된 list 이고, 이 모듈의 결과값은 word embedding에 해당한다.

Arguments

instance를 생성할 때 입력할 params

    Args:
        num_embeddings (int): size of the dictionary of embeddings
        embedding_dim (int): the size of each embedding vector
        padding_idx (int, optional): If specified, the entries at :attr:`padding_idx` do not contribute to the gradient;
                                     therefore, the embedding vector at :attr:`padding_idx` is not updated during training,
                                     i.e. it remains as a fixed "pad". For a newly constructed Embedding,
                                     the embedding vector at :attr:`padding_idx` will default to all zeros,
                                     but can be updated to another value to be used as the padding vector.
        max_norm (float, optional): If given, each embedding vector with norm larger than :attr:`max_norm`
                                    is renormalized to have norm :attr:`max_norm`.
        norm_type (float, optional): The p of the p-norm to compute for the :attr:`max_norm` option. Default ``2``.
        scale_grad_by_freq (bool, optional): If given, this will scale gradients by the inverse of frequency of
                                                the words in the mini-batch. Default ``False``.
        sparse (bool, optional): If ``True``, gradient w.r.t. :attr:`weight` matrix will be a sparse tensor.
                                 See Notes for more details regarding sparse gradients.

num_embeddings : int 값으로, embedding 될 단어쌍 (dictionary) 의 최대 값을 말한다.
embedding_dim : int 값으로, 각 embedding 벡터의 길이를 말한다.

왜 num_embeddings 에는 임베딩 될 값의 크기보다 더 큰 값을 넣어야 하는가?

instance인 embedding 을 어떻게 쓸 지 생각해보면 좋다. 우리는 어떤 값을 숫자로 표현할 것이고 input 값에 9가 들어가든 len(‘가나다라마바사’) 가 들어가든 아무런 상관이 없다. 단지 숫자로 변환 할 때 모듈이 nn.Embedding 이고 instance를 생성할 때 미리 parameter를 준비해 놔야 하는 점만 중요하게 여기면 된다.

input = torch.LongTensor([[1,2,4,5],[4,3,2,len('가나다라마바사')]])
embedding = nn.Embedding(10, 3)

embedding(input[1][-1])
# tensor([ 0.2074,  0.0673, -0.1462], grad_fn=<EmbeddingBackward0>)

nn.Embedding 을 통해 인스턴스를 생성할 때 (num_embeddings * embedding_eim) 모양의 파라미터가 생긴다. embedding(input) 을 통해 input 값을 임베딩 하면 미리 만들어둔 파라미터에 임베딩 된 값이 걸리게 되는 셈이다.

reference
- https://discuss.pytorch.kr/t/embedding/942

Attributes

    Attributes:
        weight (Tensor): the learnable weights of the module of shape (num_embeddings, embedding_dim)
                         initialized from :math:`\mathcal{N}(0, 1)`

이전에 instance를 생성할 때 num_embeddings, embedding_dim 을 할당하면 그 형태의 파라미터가 생성된다고 적은 바 있다. 이 형태에 따라 랜덤하게 가중치를 할당하는 역할을 한다. 최초에는 0부터 1 사이의 값으로 할당되며 이 Tensor는 직접적으로 학습되는 값이다.

Shape

    Shape:
        - Input: :math:`(*)`, IntTensor or LongTensor of arbitrary shape containing the indices to extract
        - Output: :math:`(*, H)`, where `*` is the input shape and :math:`H=\text{embedding\_dim}`

    .. note::
        Keep in mind that only a limited number of optimizers support
        sparse gradients: currently it's :class:`optim.SGD` (`CUDA` and `CPU`),
        :class:`optim.SparseAdam` (`CUDA` and `CPU`) and :class:`optim.Adagrad` (`CPU`)

    .. note::
        When :attr:`max_norm` is not ``None``, :class:`Embedding`'s forward method will modify the
        :attr:`weight` tensor in-place. Since tensors needed for gradient computations cannot be
        modified in-place, performing a differentiable operation on ``Embedding.weight`` before
        calling :class:`Embedding`'s forward method requires cloning ``Embedding.weight`` when
        :attr:`max_norm` is not ``None``. For example::

            n, d, m = 3, 5, 7
            embedding = nn.Embedding(n, d, max_norm=True)
            W = torch.randn((m, d), requires_grad=True)
            idx = torch.tensor([1, 2])
            a = embedding.weight.clone() @ W.t()  # weight must be cloned for this to be differentiable
            b = embedding(idx) @ W.t()  # modifies weight in-place
            out = (a.unsqueeze(0) + b.unsqueeze(1))
            loss = out.sigmoid().prod()
            loss.backward()

Examples

    Examples::

        >>> # an Embedding module containing 10 tensors of size 3
        >>> embedding = nn.Embedding(10, 3)
        >>> # a batch of 2 samples of 4 indices each
        >>> input = torch.LongTensor([[1,2,4,5],[4,3,2,9]])
        >>> # xdoctest: +IGNORE_WANT("non-deterministic")
        >>> embedding(input)
        tensor([[[-0.0251, -1.6902,  0.7172],
                 [-0.6431,  0.0748,  0.6969],
                 [ 1.4970,  1.3448, -0.9685],
                 [-0.3677, -2.7265, -0.1685]],

                [[ 1.4970,  1.3448, -0.9685],
                 [ 0.4362, -0.4004,  0.9400],
                 [-0.6431,  0.0748,  0.6969],
                 [ 0.9124, -2.3616,  1.1151]]])


        >>> # example with padding_idx
        >>> embedding = nn.Embedding(10, 3, padding_idx=0)
        >>> input = torch.LongTensor([[0,2,0,5]])
        >>> embedding(input)
        tensor([[[ 0.0000,  0.0000,  0.0000],
                 [ 0.1535, -2.0309,  0.9315],
                 [ 0.0000,  0.0000,  0.0000],
                 [-0.1655,  0.9897,  0.0635]]])

        >>> # example of changing `pad` vector
        >>> padding_idx = 0
        >>> embedding = nn.Embedding(3, 3, padding_idx=padding_idx)
        >>> embedding.weight
        Parameter containing:
        tensor([[ 0.0000,  0.0000,  0.0000],
                [-0.7895, -0.7089, -0.0364],
                [ 0.6778,  0.5803,  0.2678]], requires_grad=True)
        >>> with torch.no_grad():
        ...     embedding.weight[padding_idx] = torch.ones(3)
        >>> embedding.weight
        Parameter containing:
        tensor([[ 1.0000,  1.0000,  1.0000],
                [-0.7895, -0.7089, -0.0364],
                [ 0.6778,  0.5803,  0.2678]], requires_grad=True)
    """
    __constants__ = ['num_embeddings', 'embedding_dim', 'padding_idx', 'max_norm',
                     'norm_type', 'scale_grad_by_freq', 'sparse']