๐Ÿฆ„AI/NLP

[NLP] 1. Word Embadding (1)

mingyung 2024. 8. 6. 19:00

Word Vectors

์–ด๋–ค ๋ฌธ์žฅ์ด ์ฃผ์–ด์กŒ์„๋•Œ, ๋ฌธ์žฅ์˜ ์–ด๋–ค ๋ถ€๋ถ„์ด ๋ฌธ์žฅ์˜ ์˜๋ฏธ๋ฅผ ๋” ์ค‘์š”ํ•˜๊ฒŒ ํ‘œํ˜„ํ•˜๋Š”์ง€๋ฅผ ์–ด๋–ป๊ฒŒ ์•Œ ์ˆ˜ ์žˆ์„๊นŒ?

์‚ฌ๋žŒ์˜ ์–ธ์–ด๋ผ๋Š” ๊ฒƒ์€ ๋‹ค์–‘ํ•œ ์˜๋ฏธ์™€ ๋‰˜์•™์Šค๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ์–ธ์–ด์˜ ์ •๋ณด๋ฅผ ์ž˜ ํฌํ•จํ•˜๊ฒŒ ํ‘œํ˜„ํ•˜๊ณ , ์ด๋ฅผ ๋‹ค๋ฃจ๋Š”๊ฒƒ์€ ์ •๋ง ์–ด๋ ค์šด ์ผ์ด๋‹ค.

one hot vector

word๋ฅผ ๋‹ค๋ฃจ๋Š” ๊ฐ€์žฅ ์‰ฝ๊ณ  ๊ธฐ๋ณธ์ ์ธ ๋ฐฉ๋ฒ•์€ ๋‹จ์–ด๋ฅผ ์„œ๋กœ ์˜์กดํ•˜์ง€ ์•Š๋Š” ๊ฐœ๋ณ„์˜ ๊ฐœ์ฒด๋กœ ์ƒ๊ฐํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
์ฆ‰, ๋‹จ์–ด๋ฅผ one hot vector๋ฅผ ํ†ตํ•ด์„œ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

 

์˜ˆ๋ฅผ ๋“ค์–ด์„œ finite set {coffee, cafe, tea}์˜ ๊ฐœ๋ณ„ ๋‹จ์–ด๋ฅผ ๋‹ค์Œ์ฒ˜๋Ÿผ 1-hot vector๋กœ ํ‘œํ˜„ํ•œ๋‹ค.

์ „ํ†ต์ ๋”˜ NLP์—์„œ ๋‹จ์–ด๋ฅผ ์ด์™€ ๊ฐ™์€ ํ˜•ํƒœ๋กœ ํ‘œํ˜„ํ•˜์—ฌ ์‚ฌ์šฉํ•œ๋‹ค.

 

์ด๋ ‡๊ฒŒ ํ‘œํ˜„ํ•˜๊ฒŒ ๋˜๋ฉด, ๋‘ ๋‹จ์–ด ์‚ฌ์ด์˜ L1, L2 distance๋“ฑ์˜ ๋‹ค๋ฅธ similarity๋ฅผ ๊ณ„์‚ฐํ• ๋•Œ ํ•ญ์ƒ 0์ด ๋‚˜์˜ค๊ฒŒ ๋œ๋‹ค.

ํ•˜์ง€๋งŒ ์‹ค์ œ๋กœ ์šฐ๋ฆฌ๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” ๋‹จ์–ด๋“ค์€ ๊ทธ ์‚ฌ์ด์˜ ์—ฐ๊ด€์„ฑ์„ ์–ด๋Š์ •๋„ ๊ฐ€์ง€๊ฒŒ ๋˜๋Š”๊ฒƒ์ด ๋‹น์—ฐํ•˜๊ณ , ์—ฐ๊ด€์„ฑ์˜ ์ •๋„๋˜ํ•œ ์„œ๋กœ ๋‹ค๋ฅด๋‹ค.
๋”ฐ๋ผ์„œ ๋‹จ์–ด๋ฅผ independant word vector๋กœ ์ƒ๊ฐํ•˜๋Š” one hot vector๋ณด๋‹ค ๋” ๋‚˜์€ ํ‘œํ˜„๋ฐฉ์‹์ด ํ•„์š”ํ•˜๋‹ค.

 

์ข…ํ•ฉํ•˜์—ฌ one hot vector๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค.

1. sparsity ๋ฌธ์ œ: one hot vector์˜ sparsity problem. ์˜ค์ง ํ•œ๊ฐœ์˜ element๋งŒ 1์ด๊ณ , ๋‚˜๋จธ์ง€๋Š” 0์ด๋‹ค.

2. scalability ๋ฌธ์ œ: voca size๊ฐ€ ์ฆ๊ฐ€ํ•˜๋ฉด, ๋ฒกํ„ฐ์˜ dim๋˜ํ•œ linearly ์ฆ๊ฐ€ํ•จ

3. curse of dimensionality ๋ฌธ์ œ: high dimensional data๋Š” computational complexity, overfitting์„ ์ผ์œผํ‚ด

4. OOV ๋ฌธ์ œ: vector๋กœ ํ‘œํ˜„๋˜์ง€ ์•Š์€ ์ƒˆ๋กœ์šด ๋‹จ์–ด๋Š” UNK๋กœ ํ•จ๊ป˜ ๋ฌถ์–ด์•ผ ํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•จ

5. Fixed Vocab ๋ฌธ์ œ: ์ƒˆ๋กœ์šด ๋‹จ์–ด๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š”๋ฐ์— ๋งค์šฐ ํฐ ๋น„์šฉ ๋ฐœ์ƒ

5. Limited information๋ฌธ์ œ: word๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์•Œ ์ˆ˜ ์—†๋‹ค

 

๋ฌธ์ œ๊ฐ€ ์ •๋ง ๋งŽ์ง€๋งŒ, ์ œ์ผ ํฐ ๋ฌธ์ œ๋Š” limited information ๋ฌธ์ œ์ด๋‹ค. ์ด๊ฑธ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋“ฑ์žฅํ•œ๊ฒŒ word2vec ์ด๋‹ค.

+WordNet

WordNet์€ ํ”„๋ฆฐ์Šคํ„ด๋Œ€ํ•™์—์„œ ๋‹จ์–ด๋ฅผ syno/antonyms, hypo/hypernyms, + ๋‹ค๋ฅธ ๋‹จ์–ด์™€์˜ ๊ด€๊ณ„๋“ฑ์„ ๋ฏธ๋ฆฌ ์ •์˜ํ•˜์—ฌ ์ œ๊ณตํ•˜๋Š” ์–ธ์–ด ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋‹ค.

๋”ฐ๋ผ์„œ ๋ฏธ๋ฆฌ ์ •์˜ํ•ด๋‘” ์–ธ์–ด, ๋‹จ์–ด์—๋งŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ณ , ์‚ฌ๋žŒ์ด annotationํ•œ ๊ฒƒ ์ด๋ฏ€๋กœ ์ฃผ๊ด€์ ์ธ ํŒ๋‹จ์ด ๊ฐœ์ž…๋œ๋‹ค๋Š” ์ ์ด ์žˆ๋‹ค.


์ถ”๊ฐ€๋กœ ์ƒˆ๋กœ์šด ๋‹จ์–ด์™€ ๋œป์„ ๊ณ„์†ํ•ด์„œ ์ถ”๊ฐ€ํ•˜๊ณ  ์ˆ˜์ •ํ•ด์•ผํ•˜๋Š”๋ฐ, ์ด์—๋Š” ๋น„์šฉ์ด ๋„ˆ๋ฌด ๋งŽ์ด ๋“ ๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค.

๋‹ค๋ฅธ ๊ด€์ ์—์„œ๋Š”, ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์— ํšจ์œจ์ ์ด์ง€ ๋ชปํ•œ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋งŒ์•ฝ ๋‹จ์–ด์— ๋Œ€ํ•œ ๋ชจ๋“  ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด, ๊ฐœ๋ณ„ ๋‹จ์–ด๋“ค์€ ๋งค์šฐ ๋†’์€ dimension์„ ๊ฐ€์ง€๊ฒŒ ๋œ๋‹ค.(neural model๋“ค์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด์ง€ ์•Š๋Š” ๊ตฌ์กฐ...์ฆ‰, tradeoff๋ฐœ์ƒ)


Distributional sementics (Word2Vec)

๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•๋ก ์˜ ํ•œ๊ณ„์ ์€ ํฌ๊ฒŒ ๋‘๊ฐ€์ง€๋กœ ์ƒ๊ฐํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค.

  1. sparsity ๋ฌธ์ œ
  2. semantic meanings of the words ๋ฌธ์ œ

์ด ๋‘๊ฐ€์ง€์˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ ๋‹จ์–ด๊ฐ€ ์“ฐ๋ฏผ ๋ฌธ๋งฅ์„ ํ†ตํ•ด ์ดํ•ดํ•˜๋ ค๋Š” ์ ‘๊ทผ๋ฒ•์ด ๋‚˜์™”๋‹ค. ํ•œ๊ตญ์–ด๋กœ๋Š” ๋ถ„ํฌ ์˜๋ฏธ๋ก ์ด๋ผ๊ณ  ํ•œ๋‹ค.

Distributional sementics์˜ ์•„์ด๋””์–ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

A word's meaning is gien by the words that frequently appear close-by

์ฆ‰, ๋‹จ์–ด์˜ ์˜๋ฏธ๋Š” ๊ทธ ๋‹จ์–ด๊ฐ€ ์ฃผ๋กœ ์–ด๋–ค ๋‹จ์–ด์™€ ํ•จ๊ผ ๋‚˜ํƒ€๋‚˜๋Š”์ง€์— ๋”ฐ๋ผ์„œ ๊ฒฐ์ •๋œ๋‹ค.

์ด๋Ÿฐ ์ ‘๊ทผ๋ฒ• ํ•˜์—์„œ ๊ตฌ๊ธ€์—์„œ๋Š” ๋‹จ์–ด๋ฅผ ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ์ธ Word2Vec๋ฅผ ์ œ์‹œํ–ˆ๋‹ค.

 

Word2Vec

Word2Vec ๋ชจ๋ธ์€ probabillistic mode์ด๊ณ , ๊ณ ์ •๋œ vacabulary์—์„œ ๋‹จ์–ด๋ฅผ vocabulary size๋ณด๋‹ค ์ž‘์€ low dimension์˜ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•œ๋‹ค. (์ฃผ๋กœ window size๋กœ 2~4 word๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค)

์ฆ‰, one hot vector๋กœ ํ‘œํ˜„๋˜๋˜ ๊ฒƒ์„ ์ž‘์€ ํฌ๊ธฐ์˜ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

 

๋‘๊ฐ€์ง€์˜ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ๋‹ค.

  1. continuous Bag of Words (CBOW): ์ฃผ๋ณ€ ๋‹จ์–ด๋“ค๋กœ ์ค‘๊ฐ„์˜ ๋‹จ์–ด ์˜ˆ์ธก
  2. Skip-gram algorithm: ์ค‘๊ฐ„ ๋‹จ์–ด๋กœ ์ฃผ๋ณ€ ๋‹จ์–ด๋“ค ์˜ˆ์ธก

 

skipgram word2vec

์ค‘์‹ฌ ๋‹จ์–ด c๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ์ฃผ๋ณ€ ๋‹จ์–ด๋กœ ์˜ค๋Š” o๋ฅผ ์•Œ๊ณ ์‹ถ์€๊ฒƒ์ด skipgram word2vec์ด๋‹ค.

skipgram ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

์—ฌ๊ธฐ์„œ u์˜ ๊ฒฝ์šฐ word๊ฐ€ context(outside)๋กœ ์‚ฌ์šฉ๋˜์—ˆ์„ ๋•Œ์˜ parameter, v์™ ๊ฒฝ์šฐ word๊ฐ€ center๋กœ ์‚ฌ์šฉ๋˜์—ˆ์„ ๋•Œ์˜ parameter์ด๋‹ค

์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ˜•ํƒœ๊ฐ€ ์ƒ๋‹นํžˆ ์ต์ˆ™ํ•œ๋ฐ, softmax์™€ ๊ฐ™์€ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š”๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

์‹์„ ์กฐ๊ธˆ ๋” ์‚ดํŽด๋ณด๋ฉด

  • ๋ถ„์ž: dotproduct๋ฅผ ํ†ตํ•ด์„œ o์™€ c์˜ score๋ฅผ ์–ป๋Š”๋‹ค.
  • ๋ถ„๋ชจ: dot product๋ฅผ ํ†ตํ•ด์„œ vocabulary์˜ ๋ชจ๋“  word์™€ c์˜ ๊ฐœ๋ณ„ score๋ฅผ ์–ป๋Š”๋‹ค
  • softmax๋ฅผ ์ ์šฉํ•œ๋‹ค - ์ด๋ฅผ ํ†ตํ•ด์„œ probability distribution์„ ์–ป๊ฒŒ ๋œ๋‹ค. (์ค‘์‹ฌ ๋‹จ์–ด c์ผ๋•Œ ์ฃผ๋ณ€๋‹จ์–ด๊ฐ€ o๊ฐ€ ๋‚˜์˜ฌ ํ™•๋ฅ ์„ ๊ตฌํ•œ๋‹ค.)

(softmax : amplifies largest x, but also assigns some prob to smaller x)

์—ฌ๊ธฐ๊นŒ์ง€๊ฐ€ ํ™•๋ฅ ์— ๋Œ€ํ•œ ์ด์•ผ๊ธฐ๊ณ , ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š”๊ฒƒ์€ ์ด๊ฒŒ ์ตœ์ ํ™”๋œ ๋ชจ๋ธ์„ ์–ป๋Š” ๊ฒƒ์ด๋‹ค.

๋”ฐ๋ผ์„œ Likelihood๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • ๋ชจ๋“  ์œ„์น˜ t์— ๋Œ€ํ•ด์„œ window size(m)๋งŒํผ์˜ ์ฃผ๋ณ€ ๋‹จ์–ด๋“ค์— ๋Œ€ํ•œ ํ™•๋ฅ ์„ ๊ตฌํ•˜์—ฌ likelihood๋ฅผ ๊ตฌํ•œ๋‹ค.
  • ์—ฌ๊ธฐ์„œ ์„ธํƒ€๋Š” ๋ชจ๋ธ์˜ ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋งํ•œ๋‹ค. (๋‹จ์–ด ํ•œ๊ฐœ๋‹น ๋‘๊ฐœ์˜(u,v) d-dimension vector์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์กด์žฌํ•œ๋‹ค)

Likelihood๋ฅผ ํ†ตํ•ด ์–ป๋Š” Objective Function(Loss)์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

Likelihod๋ฅผ maximizeํ•˜๋Š”๊ฒƒ์€ Objective function์„ minimizeํ•˜๋Š”๊ฒƒ๊ณผ ๊ฐ™๋‹ค.

 

์œ„์—์„œ ์ฒ˜์Œ ์•Œ์•„๋ณธ representation u,v๋ฅผ ์ ์šฉํ•˜๋ฉด gradient๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ๋œ๋‹ค

  • ๋งˆ๊ด€์ฐฐ๋œ ๋‹จ์–ด์˜ ๋ฒกํ„ฐ์—์„œ ์˜ˆ์ƒ๋˜๋Š” ๋ชจ๋“  context vector์˜ weighted ํ‰๊ท ์„ ๋บ€ ๊ฐ’์ด ๋œ๋‹ค.
  • ๋”ฐ๋ผ์„œ ํ•™์Šต์˜ ๊ณผ์ •์—์„œ ๋ชจ๋ธ์€ ์‹ค์ œ ๊ด€์ฐฐ๋œ ๋‹จ์–ด์™€ ๋” ๋น„์Šทํ•œ ๋‹จ์–ด๋กœ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต๋œ๋‹ค.

๋‚˜๋จธ์ง€ ๊ณ„์‚ฐ์€ ๋‹ค์Œํฌ์ŠคํŒ…์—์„œ ๋” ์‚ดํŽด๋ณด์ž

 

Word2Vec Problem

๋‹ค๋งŒ ์ด Word2Vec์—๋Š” ๋ช‡๊ฐ€์ง€ ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค.

  1. OOV ๋ฌธ์ œ: 
  2. global co-occurrence ignore
  3. relationship beyond window size

OOV Problem

Word2vec์—์„œ๋Š” Top K words๋งŒ mappingํ•˜๊ณ , ๋‹ค๋ฅธ ๋‹จ์–ด๋Š” ๋ชจ๋‘ unk ๋กœ ๋ถ„๋ฅ˜ํ•˜์—ฌ์„œ representationํ•œ๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. 

๋”ฐ๋ผ์„œ ํ›ˆ๋ จ์ค‘์— ๋“œ๋ฌผ๊ฒŒ ๋“ฑ์žฅํ–ˆ๊ฑฐ๋‚˜, ๋“ฑ์žฅํ•˜์ง€ ์•Š์€ ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ๋Š” ๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ ํŒŒ์•…ํ•˜์ง€ ๋ชฉํ•œ๋‹ค.

 

top-k word์— ํฌํ•จ๋˜์ง€ ์•Š๋Š” ๋‹จ์–ด๋“ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์œ ํ˜•์˜ ๋‹จ์–ด๋“ค์ด๋‹ค.

- compoound words

- derived words

- plurals

- verbs conjugtions

- new words with predictable words

 

์ด๋Ÿฐ ๊ฒฝ์šฐ์— ๋Œ€ํ•ด์„œ๋Š” Subword technique ๋“ค์„ ์‚ฌ์šฉํ•œ๋‹ค.

๋Œ€ํ‘œ์ ์ธ ๋ชจ๋ธ๋กœ FastText๊ฐ€ ์žˆ๋‹ค.

 

FastText

fasttext๋Š” word2vec์˜ ๋ฐœ์ „๋œ ๋ชจ๋ธ์ด๋‹ค.

๋‹จ์–ด ์ „์ฒด๋ฅผ ๋ฒกํ„ฐ๋กœ ๋ฐ”๊พธ๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ, ๋‹จ์–ด๋ฅผ ์—ฌ๋Ÿฌ ์กฐ๊ฐ์˜ subword๋กœ ๋‚˜๋ˆ„์—์„œ ์ฒ˜๋ฆฌํ•œ๋‹ค. ์ฆ‰, ๋‹จ์–ด๋ฅผ character n-gram ํ˜•ํƒœ๋กœ ๋ถ„๋ฆฌํ•ด์„œ ์ง‘ํ•ฉ์œผ๋กœ ํ‘œํ˜„ํ•œ๋‹ค.

 

์ด๋ ‡๊ฒŒ ๋˜๋ฉด OOV๋ฌธ์ œ๋ฅผ ์กฐ๊ธˆ ํ•ด๊ฒฐํ•ด๋ณผ ์ˆ˜ ์žˆ๊ณ , ๋ฌธ๋ฒ•์  ๋ณ€ํ˜•๋„ ์•ฝ๊ฐ„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

 

Skip-Gram with Negative Sampling (SGNS)

Skipgram์˜ softmax๋ฅผ ๋‹ค์‹œ tkfvu ๋ณด์ž.

๋‹จ์–ด๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ๋ถ„์ž๋ถ€๋ถ„์„ ๊ณ„์‚ฐํ•˜๋Š”๊ฑด ์‚ฌ์‹ค ๋น„์šฉ์ด ๋งŽ์ด ๋“ค์ง€ ์•Š๋Š”๋‹ค.

๋ฌธ์ œ๋Š” ๋ถ„๋ชจ์ธ๋ฐ, ๋ชจ๋“  ๋‹จ์–ด๋“ค์— ๋Œ€ํ•œ score๋ฅผ ๊ตฌํ•ด์„œ ๋”ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ normalizeํ•˜๊ณ  ์žˆ๋‹ค.

์ด ๋ถ€๋ถ„์—์„œ computingํ•˜๋Š”๋ฐ์— ๋น„์šฉ์ด ๋งŽ์ด ๋“ ๋‹ค๋Š” ๋ฌธ์ œ์ ์ด ์žˆ๋‹ค.

 

์ฆ‰, ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‹จ์–ด์™€์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด์„œ V์˜ ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ํ™•์ธํ•ด์•ผ ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ๋‹จ์–ด์˜ ํ’€์ด ์ปค์ง€๋ฉด ๊ทธ๋งŒํผ ๊ณ„์‚ฐ๋น„์šฉ๋„ ์ปค์ง€๊ฒŒ ๋œ๋‹ค. 

 

๊ทธ๋Ÿผ ์šฐ๋ฆฌ๋Š” ์ด ๋ฌธ์ œ์ ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋ถ„๋ชจ๋ถ€๋ถ„์„ ์—†์• ๋ฒ„๋ฆฌ๊ณ  ์‹ถ์–ด์ง„๋‹ค.

์ด ๋ถ„๋ชจ๋ถ€๋ถ„์˜ ์—ญํ• ์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

  • ํ™•๋ฅ ๋ก ์ ์ธ ๊ด€์ ์—์„œ, ๋ถ„๋ชจ ๋ถ€๋ถ„์€ ๋ชจ๋“  score๊ฐ’์˜ ํ•ฉ์ด 1์ด ๋˜๋„๋ก normalizeํ•˜๋Š” ์—ญํ• ์„ ํ•œ๋‹ค. ์—ฌ๊ธฐ์„œ exponential ์€ ๊ฐœ๋ณ„ score๊ฐ’์ด ํ•ญ์ƒ 0ํ˜น์€ ์–‘์ˆ˜๊ฐ€ ๋˜๋„๋ก ํ•œ๋‹ค.
  • learning์˜ ๊ด€์ ์—์„œ, ๋ถ„๋ชจ ๋ถ€๋ถ„์€ ๊ด€์ฐฐ๋˜์ง€ ์•Š์€ ๋‹จ์–ด๋“ค์— ๋Œ€ํ•œ ์Šค์ฝ”์–ด๋“ค์„ ๋‚ฎ์ถ”๋Š” ์—ญํ• ์„ ํ•˜๊ฒŒ ๋œ๋‹ค. ์ฆ‰, ๋ถ„์ž๋Š” ๋ชจ๋ธ์ด o์™€ c์˜ ์œ ์‚ฌ์„ฑ์„ ๋†’์ด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต๋˜๊ณ , ๋ถ„๋ชจ๋Š” ๋‚˜๋จธ์ง€ ๋ชจ๋“  ๋‹จ์–ด๋“ค์˜ ์œ ์‚ฌ์„ฑ์„ ๋‚ฎ์ถ”๊ฒŒ ์••๋ฐ•ํ•˜๋Š” ์—ญํ• ์„ ํ•œ๋‹ค.

๋„ค๊ฑฐํ‹ฐ๋ธŒ ์ƒ˜ํ”Œ๋ง์€ ๋ถ„๋ชจ๋ฅผ ํ†ตํ•ด ํ•ญ์ƒ ๋‹ค๋ฅธ ๋ชจ๋“  ๋‹จ์–ด์— ๋Œ€ํ•œ score๋ฅผ ๋‚ฎ์ถœ ํ•„์š”๊ฐ€ ์—†๋‹ค๋Š” ์ ์— ์ฐฉ์•ˆํ•œ๋‹ค. 

๋‹ค๋งŒ, ์‹ค์ œ์˜ SGNS์˜ objective function์€ ์•ฝ๊ฐ„ ๋‹ค๋ฅธ์ ์ด ์กด์žฌํ•œ๋‹ค.

 

ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

Train Binary Logistic Regressing for a true pair & several noise pairs
์ฆ‰, ์‹ค์ œ ๋‹จ์–ด ์Œ๊ดด ๋ช‡๊ฐœ์˜ ๋…ธ์ด์ฆˆ ์Œ์„ ๋Œ€์ƒ์œผ๋กœ Binary Logistic Regression์„ ํ•™์Šตํ•œ๋‹ค. 

SGNS ๊ณผ์ •

1. ์ค‘์‹ฌ๋‹จ์–ด C์™€ ์ฃผ๋ณ€ ๋‹จ์–ด๋“ค์„ ์ถ”์ถœํ•˜์—ฌ true pair๋กœ ์‚ฌ์šฉํ•œ๋‹ค

2. ์ค‘์‹ฌ๋‹จ์–ด C์™€ ๋ฌด์ž‘์œ„ ๋‹จ์–ด๋“ค์„ ์ถ”์ถœํ•˜์—ฌ negative noise pair๋กœ ์‚ฌ์šฉํ•œ๋‹ค.

3. Binary Regression์„ ํ†ตํ•ด true pair์™€๋Š” ๊ฐ€๊นŒ์›Œ์ง€๊ฒŒ(1), negative pair(0)์™€๋Š” ๋ฉ€์–ด์ง€๊ฒŒ ํ•™์Šตํ•œ๋‹ค. 

 

๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ Objection Function์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

์ด๋ ‡๊ฒŒ ๋˜๋ฉด, ๋ชจ๋“  ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ ์Šค์ฝ”์–ด๋ฅผ ๊ณ„์‚ฐํ•˜์ง€ ์•Š๊ณ , ๋ช‡๊ฐœ์˜ ๋…ธ์ด์ฆˆ ์ƒ˜ํ”Œ๊ณผ true pair์— ๋Œ€ํ•ด์„œ๋งŒ ํ•™์Šตํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณ„์‚ฐ์˜ ํšจ์œจ์„ฑ์ด ๋†’์•„์ง„๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค!