Web scraping in python
1. Write a program (using nltk toolkit in python
environment) to tokenize
a) Sentence
b) Multiple sentences
c) A paragraph
d) Information of a complete web page
Python code
import nltk
from nltk.tokenize import word_tokenize
from nltk.text import Text
mystring = "I am currently studying in XYZ UNIVERSITY atDELHI. Cuurently in 4th year studying placements and tyring for the masters too as i was tooearly to start the programming. Currently tyring to ops as datascientist at any product based copany as a data scientist i need to learn all the python programmming at the professionally and we have to be good at statistics and the machine learnng models which model will suit for which problem"
tokens = word_tokenize(mystring)
tokens
O/p:
['I', 'am', 'currently',
'studying', 'in', 'XYZ', 'UNIVERSITY', 'at', 'DELHI', '.', 'Cuurently', 'in',
'4th', 'year', 'studying', 'placements', 'and', 'tyring', 'for', 'the',
'masters', 'too', 'as', 'i', 'was', 'too', 'early', 'to', 'start', 'the',
'programming', '.', 'Currently', 'tyring', 'to', 'ops', 'as', 'datascientist',
'at', 'any', 'product', 'based', 'copany', 'as', 'a', 'data', 'scientist', 'i',
'need', 'to', 'learn', 'all', 'the', 'python', 'programmming', 'at', 'the',
'professionally', 'and', 'we', 'have', 'to', 'be', 'good', 'at', 'statistics',
'and', 'the', 'machine', 'learnng', 'models', 'which', 'model', 'will', 'suit',
'for', 'which', 'problem']
Python code
nltk.download('punkt')
O/p:
[nltk_data] Downloading package punkt to
[nltk_data]
C:\Users\Viru\AppData\AMK\nltk_data...
[nltk_data]
Package punkt is already up-to-date!
True
Tokenizing multiple sentences
Python code
print("Enter multiple sentences: ")
lines = []
tok = []
while True:
line = input()
if line:
lines.append(line)
else:
break
for t in lines:
t = word_tokenize(t)
tok.append(t)
print("Tokens for multiple sentences are as follows: ",tok)
O/p:
Enter multiple sentences:
Modern humans arrived on the Indian subcontinent
from Africa no later than 55,000 years ago
Their long occupation, initially in varying forms
of isolation as hunter-gatherers, has made the region highly diverse, second
only to Africa in human genetic diversity.
In the early medieval era, Christianity, Islam,
Judaism, and Zoroastrianism became established on India's southern and western
coasts
Tokens for multiple sentences are
as follows: [['Modern', 'humans',
'arrived', 'on', 'the', 'Indian',…
Tokenizing the Paragraph
Python code
##Information of a complete web page
#bs4 module to crawl webpage
from bs4 import BeautifulSoup
import urllib.request
#requesting php.net for information
response = urllib.request.urlopen('https://en.wikipedia.org/wiki/India')
html = response.read()
#cleaning grabbed text
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
tokens_web = word_tokenize(text)
print("Tokens for this web page are: ",tokens_web[:])
#declare a dictionary
word_freq = {}
for tok in tokens_web:
tok = tok.split()
for t in tok:
if t in word_freq:
word_freq[t] += 1
else:
word_freq[t] = 1
O/p:
Tokens for this web
page are: ['India', '-',
'Wikipediadocument.documentElement.className=', "''", 'client-js',
"''", ';', 'RLCONF=', '{', '``', 'wgBreakFrames', "''",
':', 'false', ',', "''", 'wgSeparatorTransformTable', "''",
':', '[', '``', "''", ',', "''", "''", ']', ',',
"''", 'wgDigitTransformTable', "'