Web scraping in python
1. Write a program (using nltk toolkit in python
environment) to tokenize
a) Sentence
b) Multiple sentences
c) A paragraph
d) Information of a complete web page
Python code
import nltk
from nltk.tokenize import word_tokenize
from nltk.text import Text
mystring = "I am currently studying in XYZ UNIVERSITY atDELHI. Cuurently in 4th year studying placements and tyring for the masters too as i was tooearly to start the programming. Currently tyring to ops as datascientist at any product based copany as a data scientist i need to learn all the python programmming at the professionally and we have to be good at statistics and the machine learnng models which model will suit for which problem"
tokens = word_tokenize(mystring)
tokens
O/p:
['I', 'am', 'currently',
'studying', 'in', 'XYZ', 'UNIVERSITY', 'at', 'DELHI', '.', 'Cuurently', 'in',
'4th', 'year', 'studying', 'placements', 'and', 'tyring', 'for', 'the',
'masters', 'too', 'as', 'i', 'was', 'too', 'early', 'to', 'start', 'the',
'programming', '.', 'Currently', 'tyring', 'to', 'ops', 'as', 'datascientist',
'at', 'any', 'product', 'based', 'copany', 'as', 'a', 'data', 'scientist', 'i',
'need', 'to', 'learn', 'all', 'the', 'python', 'programmming', 'at', 'the',
'professionally', 'and', 'we', 'have', 'to', 'be', 'good', 'at', 'statistics',
'and', 'the', 'machine', 'learnng', 'models', 'which', 'model', 'will', 'suit',
'for', 'which', 'problem']
Python code
nltk.download('punkt')
O/p:
[nltk_data] Downloading package punkt to
[nltk_data]
C:\Users\Viru\AppData\AMK\nltk_data...
[nltk_data]
Package punkt is already up-to-date!
True
Tokenizing multiple sentences
Python code
print("Enter multiple sentences: ")
lines = []
tok = []
while True:
line = input()
if line:
lines.append(line)
else:
break
for t in lines:
t = word_tokenize(t)
tok.append(t)
print("Tokens for multiple sentences are as follows: ",tok)
O/p:
Enter multiple sentences:
Modern humans arrived on the Indian subcontinent
from Africa no later than 55,000 years ago
Their long occupation, initially in varying forms
of isolation as hunter-gatherers, has made the region highly diverse, second
only to Africa in human genetic diversity.
In the early medieval era, Christianity, Islam,
Judaism, and Zoroastrianism became established on India's southern and western
coasts
Tokens for multiple sentences are
as follows: [['Modern', 'humans',
'arrived', 'on', 'the', 'Indian',…
Tokenizing the Paragraph
Python code
##Information of a complete web page
#bs4 module to crawl webpage
from bs4 import BeautifulSoup
import urllib.request
#requesting php.net for information
response = urllib.request.urlopen('https://en.wikipedia.org/wiki/India')
html = response.read()
#cleaning grabbed text
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
tokens_web = word_tokenize(text)
print("Tokens for this web page are: ",tokens_web[:])
#declare a dictionary
word_freq = {}
for tok in tokens_web:
tok = tok.split()
for t in tok:
if t in word_freq:
word_freq[t] += 1
else:
word_freq[t] = 1
O/p:
Tokens for this web
page are: ['India', '-',
'Wikipediadocument.documentElement.className=', "''", 'client-js',
"''", ';', 'RLCONF=', '{', '``', 'wgBreakFrames', "''",
':', 'false', ',', "''", 'wgSeparatorTransformTable', "''",
':', '[', '``', "''", ',', "''", "''", ']', ',',
"''", 'wgDigitTransformTable', "'
Write a program to do stop word removal and stemming from a paragraph. Prepare a table that includes “Word” and “frequency (2 columns). Print the frequency of words (terms) start with A,B,S,D,E(You can use nltk tool). Find maximum frequency terms/term.
Python code
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import PorterStemmer
import pandas as pd
import re
from nltk.corpus import inaugural
##Using Obama's inaugural speech
Obama = inaugural.words(fileids = '2009-Obama.txt')
##class to print headers in color
Obama
['My', 'fellow', 'citizens',
':', 'I', 'stand', 'here', ...]
nltk.download('inaugural')
nltk.download('stopwords')
class color:
BOLD = '\033[1m'
END = '\033[0m'
example_sent = Obama
##stopword removal
stop_words = set(stopwords.words('english'))
filtered_sentence = [w for w in Obama if not w in stop_words]
##stemming with porter and snowball
processed = []
stemmer = SnowballStemmer("english")
processed = [stemmer.stem(i) for i in filtered_sentence]
processed2 = []
ps = PorterStemmer()
processed2 = [ps.stem(i) for i in filtered_sentence]
##Plotting distribution of token
fd = nltk.FreqDist(Obama)
fd.plot(50,cumulative=False)
print("\nFiltered sentence after stopword removal is (first 5):",filtered_sentence[1:5])
print("\nWords after Snowball stemming (first 25):",processed[1:25])
print("\nWords after Porter stemming (first 25): ",processed2[1:25])
Filtered sentence after stopword
removal is (first 5): ['fellow', 'citizens', ':', 'I']
Words after Snowball stemming
(first 25): ['fellow', 'citizen', ':', 'i', 'stand', 'today', 'humbl', 'task',
'us', ',', 'grate', 'trust', 'bestow', ',', 'mind', 'sacrific', 'born',
'ancestor', '.', 'i', 'thank', 'presid', 'bush', 'servic']
Words after Porter stemming (first
25): ['fellow', 'citizen', ':', 'I',
'stand', 'today', 'humbl', 'task', 'us', ',', 'grate', 'trust', 'bestow', ',',
'mind', 'sacrific', 'born', 'ancestor', '.', 'I', 'thank', 'presid', 'bush',
'servic']
Python code
filtered_sentence = [w for w in Obama if not w in stop_words]
filtered_sentence
O/p :
['My', 'fellow',
'citizens', ':', 'I', 'stand', 'today', 'humbled', 'task', 'us', ',',
'grateful', 'trust', 'bestowed', ',', 'mindful', 'sacrifices', 'borne',
'ancestors', '.', 'I', 'thank', 'President', 'Bush', 'service', 'nation', ',',
'well', 'generosity', 'cooperation', 'shown', 'throughout', 'transition', '.',
'Forty', '-', 'four', 'Americans', 'taken', 'presidential', 'oath', '.', 'The',
'words', 'spoken', 'rising', 'tides', 'prosperity', 'still', 'waters', 'peace',
'.', 'Yet', ',', 'every', 'often', 'oath', 'taken', 'amidst', 'gathering',
'clouds', 'raging', 'storms', '.', 'At', 'moments', ',', 'America', 'carried',
'simply', 'skill', 'vision', 'high', 'office', ',', 'We', 'People', 'remained',
'faithful', 'ideals', 'forbearers', ',', 'true', 'founding', 'documents', '.',
'So', '.', 'So', 'must', 'generation', 'Americans', '.', 'That', 'midst',
'crisis', 'well', 'understood', '.', 'Our', 'nation', 'war', ',', 'far', '-',
'reaching', 'network', 'violence', 'hatred', '.', 'Our', 'economy', 'badly',
'weakened', ',', 'consequence', 'greed', 'irresponsibility', 'part', ',',
'also', 'collective', 'failure', 'make', 'hard', 'choices', 'prepare',
'nation', 'new', 'age', '.', 'Homes', 'lost', ';', 'jobs', 'shed', ';',
'businesses', 'shuttered', '.', 'Our', 'health', 'care', 'costly', ';',
Cosine similarity between two documents in Python
doc_trump = "Mr. Trump became president after winning the political election. Though he lost the support of some republican friends, Trump is friends with President Putin"
doc_election = "President Trump says Putin had no political interference is the election outcome. He says it was a witchhunt by political parties. He claimed President Putin is a friend who had nothing to do with the election"
documents = [doc_trump, doc_election]
# Scikit Learn
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
count_vectorizer = CountVectorizer(stop_words='english')
count_vectorizer = CountVectorizer()
sparse_matrix = count_vectorizer.fit_transform(documents)
doc_term_matrix = sparse_matrix.todense()
df = pd.DataFrame(doc_term_matrix,
columns=count_vectorizer.get_feature_names(),
index=['doc_trump', 'doc_election'])
df
# Compute Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity
print(cosine_similarity(df, df))
[[1.
0.51480485]
[0.51480485 1. ]]
# Program to measure the similarity between
# two sentences using cosine similarity.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
A ="Web mining is an intersting subject, Web mining methods are divided into three\
categories"
B ="We get to know how to mine the web using crawlers. Web mining methods are divided\
into three categories: web content mining, web structure mining and web usage mining."
# tokenization
bagOfWordsA = word_tokenize(A)
bagOfWordsB = word_tokenize(B)
# sw contains the list of stopwords
sw = stopwords.words('english')
l1 =[]
l2 =[]
# remove stop words from the string
Asw= []
Bsw = []
for i in bagOfWordsA:
if i not in sw:
Asw.append(i)
for i in bagOfWordsB:
if i not in sw:
Bsw.append(i)
# form a set containing keywords of both strings
unique = []
unique.extend(Asw)
unique.extend(Bsw)
for w in unique:
if w in Asw:
l1.append(1) # create a vector
else:
l1.append(0)
if w in Bsw:
l2.append(1)
else:
l2.append(0)
c = 0
# cosine formula
for i in range(len(unique)):
c += l1[i]*l2[i]
cosine = c / float((sum(l1)*sum(l2))**0.5)
print("similarity: ", cosine)
O/p:
similarity:
0.6154574548966636
Page rank algorithm in Python
def pagerank(G, alpha=0.85, personalization=None,
max_iter=100, tol=1.0e-6, nstart=None, weight='weight',
dangling=None):
if len(G) == 0:
return {}
if not G.is_directed(): #G.is_directed() : Return True if graph is directed.
D = G.to_directed() #A new graph with same nodes etc just that each edge is converted to 2 directed edge.
else:
D = G #If G is directed then simply copy it.
# Create a copy in (right) stochastic form
W = nx.stochastic_graph(D, weight=weight) #weighted digraph in which all of the node (out) neighbors edge weights sum to 1.
N = W.number_of_nodes() #Returns the number of nodes in graph
# Choose fixed starting vector if not given
if nstart is None: #If Starting value of PageRank iteration for each node is none
x = dict.fromkeys(W, 1.0 / N) #returns a dictionary with the W as keys and N as value.
else:
# Normalized nstart vector
s = float(sum(nstart.values())) #Calculating the sum of all the nstart values.
x = dict((k, v / s) for k, v in nstart.items()) #Making the dictionary of the values of nstart
if personalization is None: # consisting of a dictionary with a key for every graph node and nonzero personalization value for each node
p = dict.fromkeys(W, 1.0 / N) # if none then calculate by dividing 1 by the number of nodes connected to that weighted diagraph
else:
missing = set(G) - set(personalization) #get the nodes which are in G and not in personalization.
if missing: #If there exist such node then raise error
raise NetworkXError('Personalization dictionary '
'must have a value for every node. '
'Missing nodes %s' % missing)
s = float(sum(personalization.values())) #Calculate the sum of all the value of personalization dictionary.
p = dict((k, v / s) for k, v in personalization.items()) #Now we get the values and divide them by the sum of all the personalization values
if dangling is None: #links that point to any page with no outgoing links.
# Use personalization vector if dangling vector not specified
dangling_weights = p #Given the personalization dictionary
else:
missing = set(G) - set(dangling) #Those nodes which have outgoing links
if missing: #If these nodes exist then raise the error
raise NetworkXError('Dangling node dictionary '
'must have a value for every node. '
'Missing nodes %s' % missing)
s = float(sum(dangling.values())) #Now again the sum of all the values are done
dangling_weights = dict((k, v/s) for k, v in dangling.items()) # Normalization of the values is assigned with key node.
dangling_nodes = [n for n in W if W.out_degree(n, weight=weight) == 0.0]
# power iteration: make up to max_iter iterations
for _ in range(max_iter):
xlast = x
x = dict.fromkeys(xlast.keys(), 0)
danglesum = alpha * sum(xlast[n] for n in dangling_nodes)
for n in x:
# this matrix multiply looks odd because it is
# doing a left multiply x^T=xlast^T*W
for nbr in W[n]:
x[nbr] += alpha * xlast[n] * W[n][nbr][weight]
x[n] += danglesum * dangling_weights[n] + (1.0 - alpha) * p[n]
# check convergence, l1 norm
err = sum([abs(x[n] - xlast[n]) for n in x])
if err < N*tol:
return x
raise NetworkXError('pagerank: power iteration failed to converge ','in %d iterations.' % max_iter)
import networkx as nx
import matplotlib.pyplot as plt
#n = 40 i.e no of nodes to be created
#m = 15 i.e Number of edges to attach from a new node to existing nodes
G=nx.barabasi_albert_graph(40,15)
"""A graph of n nodes is grown by attaching new nodes each with m edges that
are preferentially attached to existing nodes with high degree"""
plt.title("Graph Created") #Gives title to the plot
nx.draw(G, with_labels=True) #draws the graph with labels of the nodes
plt.show() #to Print the graph
pr=pagerank(G) #Passing the graph to the function
print("--------------------page ranks--------------------")
print(pr)
O/p :
{0: 0.0444098285676655, 1:
0.010422160013275913, 2: 0.019251701070745843, 3: 0.017050904349888686, 4:
0.012626984115913564, 5: 0.017010028561449373, 6: 0.019190973832817534, 7:
0.014750175091982343, 8: 0.012677889271151953, 9: 0.021387353853218462, 10:
0.02059139528284336, 11: 0.01585175889879829, 12: 0.016954221490224394, 13:
0.023792224749603037, 14: 0.02046751232091356, 15: 0.019264096594048, 16:
0.03841673735151791, 17: 0.04185270751492763, 18: 0.0428847560523377, 19:
0.0343594341477114, 20: 0.03379156067523404, 21: 0.036930367627221246, 22:
0.03248209208033434, 23: 0.0357102393455868, 24: 0.028730560212666945, 25:
0.031110203747274526, 26: 0.031020264456615355, 27: 0.027405129610082707, 28:
0.028853648934259884, 29: 0.023263184315942462, 30: 0.023261218845960614, 31:
0.02391025398131918, 32: 0.022814027503842232, 33: 0.02537705098346099, 34:
0.023920119424245944, 35: 0.024132046322532307, 36: 0.02170870260142139, 37:
0.021131707965897122, 38: 0.02060067064703447, 39: 0.020634107588032708}
No comments:
Post a Comment