[python] 머신러닝을 위한 Python(3)

IT/언어

[python] 머신러닝을 위한 Python(3)

개발자 두더지 2020. 6. 3. 23:35

728x90

1. Data Structure - Collections

- List, Tuple, Dict에 대한 Python Built-in 확장 자료 구조(모듈)

- 편의성, 실행 효율 등을 사용자에게 제공함

- 아래의 모듈이 존재함

from collections import deque
from collections import Counter
from collections import OrderedDict
from collections import defaultdict
from collections import namedtuple

1) deque

- Stack과 Queue를 지원하는 모듈
- List에 비해 효율적인 자료 저장 방식을 지원
- 효율적 메모리 구조로 처리 속도 향상
- reverse,rotate queue 등 Linked List의 특성을 지원
- 기존 list 형태의 함수를 모두 지원

from collections import deque

deque_list = deque()
for i in range(5):
    deque_list.append(i)
print(deque_list)

deque_list.appendleft(10)
print(deque_list)

deque_list.rotate(2)
print(deque_list)

deque_list.rotate(2)
print(deque_list)

>> deque([0, 1, 2, 3, 4])
   deque([10, 0, 1, 2, 3, 4])
   deque([3, 4, 10, 0, 1, 2])
   deque([1, 2, 3, 4, 10, 0])

deque

general list

from collections import deque
import time

start_time = time.clock()
deque_list = deque()

#Stack
for i in range(10000):
for i in range(10000):
deque_list.append(i)
deque_list.pop()

print(time.clock() - start_time, "seconds")

>> 16.265926500000003 seconds

from collections import deque
import time

start_time = time.clock()
just_list = []

for i in range(10000):
for i in range(10000):
just_list.append(i)
just_list.pop()

print(time.clock() - start_time, "seconds")

>> 40.185325999999996 seconds

2) OrderedDict

- Dict와 달리, 데이터를 입력한 순서대로 dict를 반환 (원래 dict 타입은 데이터를 저장한 순서대로 반환하지 않는다.)

Dict	OrderedDict
d = {} d['x'] = 100 d['y'] = 200 d['z'] = 300 d['l'] = 500 for k, v in d.items(): print(k,v)	from collections import OrderedDict d = OrderedDict() d['x'] = 100 d['y'] = 200 d['z'] = 300 d['l'] = 500 for k, v in d.items(): print(k,v)

- Dict type의 값을, value 또는 ket 값으로 정렬할 때 사용 가능 (정렬)

from collections import OrderedDict

d = OrderedDict()

d['x'] = 100
d['y'] = 200
d['z'] = 300
d['l'] = 500

for k,v in OrderedDict(sorted(d.items(),key=lambda t: t[0])).items():
    print(k,v)

>> l 500
   x 100
   y 200
   z 300

for k,v in OrderedDict(sorted(d.items(),key=lambda t: t[1])).items():
    print(k,v)
>> x 100
   y 200
   z 300
   l 500

3) defaultdict

- Dict type의 값에 기본 값을 지정, 신규값 생성시 사용하는 방법

# first키 값이 지정되어 있지 않기 때문에 당연히 오류가 발생한다.
d = dict()
print(d["first"])
>> Error

# first 키 값이 지정되어 있지 않아도 기본 값 0이므로 일단 화면에 0이 출력된다.
# 따라서 초기값 지정없이 사용하고 싶을 때 defaultdict를 사용하면 편하다.
from collections import defaultdict
d = defaultdict(object) # Default dictianar 생성
d = defaultdict(lambda:0) # Default 값을 0으로 설정
print(d["first"])
>> 0

글자를 count할 때도 defalutdict의 사용여부에 따라 결과가 달라진다.

from collections import defaultdict
from collections import OrderedDict

d = dict()
print(d["first"])

# Reference from
# https://dongyeopblog.wordpress.com/2016/04/08/python-defaultdict-%EC%82%AC%EC%9A%A9%ED%95%98%EA%B8%B0/

d = defaultdict(object)     # Default dictionary를 생성
d = defaultdict(lambda: 1)  # Default 값을 0으로 설정합
print(d["first"])


text = """A press release is the quickest and easiest way to get free publicity. If well written, a press release can result in multiple published articles about your firm and its products. And that can mean new prospects contacting you asking you to sell to them. Talk about low-hanging fruit!
What's more, press releases are cost effective. If the release results in an article that (for instance) appears to recommend your firm or your product, that article is more likely to drive prospects to contact you than a comparable paid advertisement.
However, most press releases never accomplish that. Most press releases are just spray and pray. Nobody reads them, least of all the reporters and editors for whom they're intended. Worst case, a badly-written press release simply makes your firm look clueless and stupid.
For example, a while back I received a press release containing the following sentence: "Release 6.0 doubles the level of functionality available, providing organizations of all sizes with a fast-to-deploy, highly robust, and easy-to-use solution to better acquire, retain, and serve customers."
Translation: "The new release does more stuff." Why the extra verbiage? As I explained in the post "Why Marketers Speak Biz Blab", the BS words are simply a way to try to make something unimportant seem important. And, let's face it, a 6.0 release of a product probably isn't all that important.
As a reporter, my immediate response to that press release was that it's not important because it expended an entire sentence saying absolutely nothing. And I assumed (probably rightly) that the company's marketing team was a bunch of idiots.""".lower().split()

print(text)

word_count = {}
for word in text:
    if word in word_count.keys():
        word_count[word] += 1
    else:
        word_count[word] = 0
print(word_count)

word_count = defaultdict(object)     # Default dictionary를 생성
word_count = defaultdict(lambda: 0)  # Default 값을 0으로 설정합
for word in text:
    word_count[word] += 1
for i, v in OrderedDict(sorted(
        word_count.items(), key=lambda t: t[1], reverse=True)).items():
    print(i, v)

4) Counter

- Sequence type의 data element들의 갯수를 dict형태로 변환

from collections import Counter

c = Counter() # a new, empty counter
c = Counter('gallahad') # a new counter from an iterable
print(c)

>> Counter({'a': 3, 'l': 2, 'g': 1, 'h': 1, 'd': 1})

- Dict type, keyword parameter 등도 모두 처리 가능

from collections import Counter

c = Counter({'red':4, 'blue':2})
print(c)
print(list(c.elements()))

>> Counter({'red': 4, 'blue': 2})
   ['red', 'red', 'red', 'red', 'blue', 'blue']

from collections import Counter

c = Counter(cats=4, dogs=8)
print(c)
print(list(c.elements()))

>> Counter({'dogs': 8, 'cats': 4})
   ['cats', 'cats', 'cats', 'cats', 'dogs', 'dogs', 'dogs', 'dogs', 'dogs', 'dogs', 'dogs', 'dogs']

5) namedtuple

- Tuple 형태로 Data 구조체를 저장하는 방법

- 저장되는 data의 variable을 사전에 지정해서 저장함

from collections import namedtuple

Point = namedtuple('Point',['x','y'])
p = Point(11, y=22)
print(p[0] + p[1])

x, y = p
print(x, y)
print(p.x + p.y)
print(Point(x=11, y=12))

>> 33
   11 22
   33
   Point(x=11, y=12)

2. Pythonic Code - Linear algebra codes

선형대수의 여러가지 간단한 기법들을 python code로 표현하는 방법에 대해 알아보자.

1) Vector representation of python

- Vector를 파이썬으로 표시하는 다양한 방법이 존재

- 최선의 방법은 없음

- 값의 변경 유무, 속성값 유무에 따라 선택할 수 있음

- 이 수업에서는 기본적으로 list로 vector 연산

vector_a = [1, 2, 10] #List로 표현했을 경우
vector_b = (1, 2, 10) #Tuple로 표현했을 경우
vector_c = {'x':1, 'y':1, 'z':10} #dict 표현했을 경우

print(vector_a, vector_b, vector_c)

>> [1, 2, 10] (1, 2, 10) {'x': 1, 'y': 1, 'z': 10}

(1) Vector의 계산

u = [2,2]
v = [2,3]
z = [3,5]

result = [sum(t) for t in zip(u,v,z)]
print(result)

>>[7, 10]

(2) Vector의 계산 : Scalar - Vector product

u = [1,2,3]
v = [4,4,4]
alpha = 2

result = [alpha*sum(t) for t in zip(u,v)] #2([1,2,3]+[4,4,4]) = 2[5, 6, 7]
print(result)

>> [10, 12, 14]

Numpy에서 더 간략하게 작성할 수 있지만, Python만으로도 작성할 수 있다는 사실을 알아두는 정도면 충분하다.

2) Matrix representaion of Python

- Matrix 역시 Python으로 표시하는 다양한 방법이 존재

matrix_a = [[3,4], [4,5]] #List로 표현했을 경우
matrix_b = [(3,6), (4,5)] #Tuple로 표현했을 경우
matrix_c = {(0,0):3, (0,1):6, (1,0):4, (1,1):5 } #dict로 표현했을 경우

- 특히 dict로 표현할 때는 무궁무진한 방법이 있음

- 이 수업에서는 기본적으로 two-dimensional list 형태로 표현

- [[1번째 row], [2번째 row], [3번째 row]]

(1) Matrix의 계산 : Matrix addition

matrix_a = [[3,6], [4,5]]
matrix_b = [[5,8], [6,7]]
result = [[sum(row) for row in zip(*t)] 
                    for t in zip(matrix_a,matrix_b)]
# 두 번째 zip에서 같은 인덱스끼리 tuple형태로 묶이게 되고, *t에 의해 각 요소로 unpacking된다. 

print(result)

>> [[8, 14], [10, 12]]

(2) Matrix의 계산 : Scalar-Matrix Product

matrix_a = [[3,6], [4,5]]
alpha = 4
result = [[alpha*element for element in t] for t in matrix_a]
print(result)

>> [[12, 24], [16, 20]]

(3) Matrix의 계산: Matrix Transpose

matrix_a = [[1,2,3], [4,5,6]]
result = [[element for element in t] for t in zip(*matrix_a)] 
# Asterist(*)에 의해 matrix_a의 요소가 2개로 unpacking
print(result)

>> [[1, 4], [2, 5], [3, 6]]

(4) Matrix의 계산: Matrix Product

matrix_a = [[1, 1, 2], [2, 1, 1]]
matrix_b = [[1, 1], [2, 1], [1, 3]]
# matrix_a에서는 row값을 가져오고, (즉 row_a에는 [1,1,2]) 
# matrix_b에서는 unpacking이 된 상태에서 zip으로 묶어줬으므로 column 값을 가져온다.(즉 column_b에는 [1,2,1])
result = [[sum(a * b for a, b in zip(row_a, column_b))
          for column_b in zip(*matrix_b)] for row_a in matrix_a]
print(result)

위의 코드가 어렵게 느껴질 수 있는데, 핵심은 matrix_b 에서 column값을 뽑을 때 zip으로 나눠서 Asterisk(*)로 unpacking해 준 다음 다시 zip으로 묶어주면 된다는 것이다.

참고자료

https://www.edwith.org/aipython/lecture/23091/

https://www.edwith.org/aipython/lecture/22957/

728x90

저작자표시 비영리 동일조건

'IT > 언어' 카테고리의 다른 글

[python] 머신러닝을 위한 Python(5) ; Data handling - Numerical Python (0)	2020.06.04
[python] 머신러닝을 위한 Python(4) (0)	2020.06.04
[python] 머신러닝을 위한 Python(2) (0)	2020.06.03
[python] 머신러닝을 위한 Python(1) (0)	2020.06.03
[python] 언더스코어(_, underscore,アンダースコア) 사용 정리 (0)	2020.05.28

현재글[python] 머신러닝을 위한 Python(3)

매일 꾸준히, 더 깊이