Debugging Corner: Web scraping in python

Web scraping in python

1. Write a program (using nltk toolkit in python environment) to tokenize

a) Sentence

b) Multiple sentences

c) A paragraph

d) Information of a complete web page

Python code

import nltk

from nltk.tokenize import word_tokenize

from nltk.text import Text

mystring = "I am currently studying in XYZ UNIVERSITY atDELHI. Cuurently in 4th year studying placements and tyring for the masters too as i was tooearly to start the programming. Currently tyring to ops as datascientist at any product based copany as a data scientist i need to learn all the python programmming at the professionally and we have to be good at statistics and the machine learnng models which model will suit for which problem"

tokens = word_tokenize(mystring)

tokens

O/p:

['I', 'am', 'currently', 'studying', 'in', 'XYZ', 'UNIVERSITY', 'at', 'DELHI', '.', 'Cuurently', 'in', '4th', 'year', 'studying', 'placements', 'and', 'tyring', 'for', 'the', 'masters', 'too', 'as', 'i', 'was', 'too', 'early', 'to', 'start', 'the', 'programming', '.', 'Currently', 'tyring', 'to', 'ops', 'as', 'datascientist', 'at', 'any', 'product', 'based', 'copany', 'as', 'a', 'data', 'scientist', 'i', 'need', 'to', 'learn', 'all', 'the', 'python', 'programmming', 'at', 'the', 'professionally', 'and', 'we', 'have', 'to', 'be', 'good', 'at', 'statistics', 'and', 'the', 'machine', 'learnng', 'models', 'which', 'model', 'will', 'suit', 'for', 'which', 'problem']

Python code

nltk.download('punkt')

O/p:

[nltk_data] Downloading package punkt to

[nltk_data] C:\Users\Viru\AppData\AMK\nltk_data...

[nltk_data] Package punkt is already up-to-date!

True

Tokenizing multiple sentences

Python code

print("Enter multiple sentences: ")

lines = []

tok = []

while True:

line = input()

if line:

lines.append(line)

else:

break

for t in lines:

t = word_tokenize(t)

tok.append(t)

print("Tokens for multiple sentences are as follows: ",tok)

O/p:

Enter multiple sentences:

Modern humans arrived on the Indian subcontinent from Africa no later than 55,000 years ago

Their long occupation, initially in varying forms of isolation as hunter-gatherers, has made the region highly diverse, second only to Africa in human genetic diversity.

In the early medieval era, Christianity, Islam, Judaism, and Zoroastrianism became established on India's southern and western coasts

Tokens for multiple sentences are as follows: [['Modern', 'humans', 'arrived', 'on', 'the', 'Indian',…

Tokenizing the Paragraph

Python code

##Information of a complete web page

#bs4 module to crawl webpage

from bs4 import BeautifulSoup

import urllib.request

#requesting php.net for information

response = urllib.request.urlopen('https://en.wikipedia.org/wiki/India')

html = response.read()

#cleaning grabbed text

soup = BeautifulSoup(html,"html5lib")

text = soup.get_text(strip=True)

tokens_web = word_tokenize(text)

print("Tokens for this web page are: ",tokens_web[:])

#declare a dictionary

word_freq = {}

for tok in tokens_web:

tok = tok.split()

for t in tok:

if t in word_freq:

word_freq[t] += 1

else:

word_freq[t] = 1

O/p:

Tokens for this web page are: ['India', '-', 'Wikipediadocument.documentElement.className=', "''", 'client-js', "''", ';', 'RLCONF=', '{', '``', 'wgBreakFrames', "''", ':', 'false', ',', "''", 'wgSeparatorTransformTable', "''", ':', '[', '``', "''", ',', "''", "''", ']', ',', "''", 'wgDigitTransformTable', "'

Debugging Corner

Friday, September 30, 2022

Webmining

Web scraping in python

Tokenizing the Paragraph

GRAPH ALGORITHMS : FLOYD-WARSHAL -Displaying DISTANCE MATRIX and PI MATRIX using C PROGRAM

Codes

Search This Blog