sparkle tech thoughts: Extracting Web information using WSO2 Data Services

Thursday, September 22, 2011

Extracting Web information using WSO2 Data Services

This blog explain how to scrape web data using wso2 data services web harvesting feature. In this tutorial I am going to extract Top rated books from Top Rated Books - Book Movement web page along with their authors and expose those data as a data service.

Before I begin lets look at how scraping works.

When you scrape a web page you need to identify the html/xml pattern. If we look at Top Rated Books - Book Movement page you can see list of books are listed along with their details. Now if we look at the page source we can see several html tags are repeated in the same manner.

If you look at it closely you can see a wrapper element which is

<div class=”rgLayoutCenter”>

and inside that wrapper element you have some thing like this.

<div class="rgLayoutTitle">First They Killed My Father: A Daughter of Cambodia Remembers (P.S.)</div>  
<div class="rgLayoutAuthor">by Loung Ung</div>

Now our basic requirement is to extract all the books along with their authors. To do that we need to find the pattern of the book title as wel as the author.

<div class="rgLayoutTitle">First They Killed My Father: A Daughter of Cambodia Remembers (P.S.)</div>  
 <div class="rgLayoutAuthor">by Loung Ung</div>

we can easily write an xslt template to extract these information


<?xml version="1.0" encoding="ISO-8859-1"?> 
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> 
<xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/> 
<xsl:template match="/"> 
  <BookInfo> 
   <xsl:for-each select="//div[@class='rgLayoutCenter']"> 
     <Book> 
       <Title><xsl:value-of select="div[@class='rgLayoutTitle']"/></Title> 
   <Author><xsl:value-of select="div[@class='rgLayoutAuthor']"/></Author> 
     </Book> 
   </xsl:for-each> 
  </BookInfo> 
</xsl:template> 
</xsl:stylesheet>

Above xslt template describe to go inside each

<div class=”rgLayoutCenter”>

element and extract value of

<div class="rgLayoutTitle">

and

<div class="rgLayoutAuthor">

and assign it to the XML elements Title and Author inside the wrapper Book element.

Creating the web harvest data service

Now we learnt the basic concepts on web scraping/web harvesting, we will straight away create the data services using WSO2 Data services server.

To install WSO2 data services download and unzip the zip file and go to $DS_HOME/bin and start up the server from the command prompt, run bin/wso2server.bat{sh}

When the server startup is complete, access http://localhost:9443/carbon in your browser. Sign in to the server using the default credentials (username=admin, password=admin) in the right hand side corner. It will redirect you to the management console page.

To create the data service click on create in the left hand side menu under Web Services -> Add -> Create

Once you click on it you will have to fill the data service information as shown below. Lets name our data service as WebHarvestDS.

Once you click on next you will be redirected to create data-source page click on add new datasource to add our web datasource.

For the web datasource you need to have a configuration file along with the web scraping url, the scraperVariable and the HTTP method to extract data.

Also we need to provide our template.xslt file location we wrote earlier.

<?xml version="1.0" encoding="UTF-8"?>
<config>
 <var-def name='bookInfo'>
  <xslt>
   <xml>
    <html-to-xml> 
     <http method='get' url='http://www.bookmovement.com/app/readingguide/memberRecommendations.php'/>
    </html-to-xml>
   </xml>
   <stylesheet>
    <file path="/media/ntfs/web/template.xsl"/>
   </stylesheet>
  </xslt>
 </var-def>
</config>

Place the above configuration file inside the inline configuration section (or you can save the above configuration in your local machine and give it as a web harvest config file path)

Lets give our scraper variable as bookInfo and HTTP method as get and also our template location.

Creating the Query

Click on next to add a query. Give a queryId, scraper variable as webQuery and bookInfo.

To populate the data properly we need to give the group by element and row-name. This is to tell the web service how our result format should be. We also need to give output mapping to arrange our extracted data.

Query information.

QueryId – webQuery
Data Source – web
Scraper Variable-bookInfo
Grouped by element – Books
Row name – Book

output mappings

Element – Title
Element – Author

Click on Next to add operation and select the query name and give a name to our operation.

Click on finish to finish creating the data service. Go to webservices list and you can see our webservice is successfully deployed.

You can invoke the service using the try-it tool.

You can also invoke the data service using a rest call by simply typing http://localhost:9763/services/WebScraping/getBooks/ on ur browser.

26 comments:

UnknownApril 25, 2015 at 3:05 PM
Thank you very much. This is a very helpful article. Adds more detail then what is provided by WSO2. Used this as a template to scrape data from am HTML Select with DSS
ReplyDelete
Replies
AnonymousJune 3, 2015 at 6:52 AM

This is a great post about extracting web data. We all know how crucial data is for a business growth and web scraping is serving this purpose. These days a lots of companies that are providing web scraping services to get data as per your requirement and in your desired format. Loginworks has this expertise. They are very proficient in providing the web scraping services
ReplyDelete
Replies
SamuelJuly 13, 2015 at 4:59 AM
This comment has been removed by the author.
ReplyDelete
Replies
SamuelJuly 13, 2015 at 6:26 AM
Nice article about Web Scraping . I appreciate your time and effort. Web Parsing will help you to get the accurate and quality data in the required format.
ReplyDelete
Replies
shohelOctober 28, 2019 at 10:33 AM
토토먹튀
ReplyDelete
Replies
ReceapNovember 11, 2019 at 6:47 PM
Download Porn Sites
ReplyDelete
Replies
ReceapDecember 26, 2019 at 10:10 PM
buy telegram members We provide all the Telegram services (increase real & fake members - views posts)
ReplyDelete
Replies
Maurice J. BarrazaMarch 28, 2020 at 7:13 AM
Every once inside a though we pick out blogs that we study. Listed below are the most recent internet sites that and for further information please click 먹튀.
ReplyDelete
Replies
RafidApril 19, 2020 at 10:19 PM
hepsibahis Hepsibahis güncel web sitesine buradan direk giriş yap. Youwin adı ile bilinen Türkiye ve Kıbrıs üzerinden canlı bahis ve casino slot oyun hizmeti veren siteye erişim sağlayın. Hepsi bahis firması için yeni giriş adresi sadece burada.
ReplyDelete
Replies
HeroJune 9, 2020 at 5:12 PM
less than one dollar web hosting Domain hosting can be hit and miss when it comes to finding it cheap. Don’t get me wrong there are cheap web hosting options, but how about “super cheap and affordable” web hosting. Less than a dollar is “truly” affordable web hosting and this is a chance to get access to web hosting that is consistent and reliable when it comes to affordable web hosting.
ReplyDelete
Replies
ReceapJune 11, 2020 at 10:33 AM
BUY SPOTIFY PLAYS REAL PLAYS TO BOOST YOUR MUSIC.
We will promote your tracks and give you very good promotion Buy spotify plays
ReplyDelete
Replies
RafidJune 21, 2020 at 4:52 PM
Share Market News OIBNews is India's leading network by NDTV and CNN contributors. It has been serving the best out of the mess from World and India.
ReplyDelete
Replies
ReceapJuly 9, 2020 at 10:45 AM
baanpolball บ้านผลบอล (baanpolball) อัพเดทตลอด 24 ชั่วโมง ราคาบอลวันนี้ วิเคราะห์บอลแม่น ฟรีๆ ทุกคู่ 100% ทีเด็ดบอลวันนี้ 4 คู่ ราคาบอลไหล แบบเรียลไทม์
ReplyDelete
Replies
RafidJuly 18, 2020 at 6:08 PM
fastest most effective weight loss pill Authentic Option LLC offers collection of natural herbal products for skincare, hair care, and weight loss. We specialize in thermal weight loss, electrostimulation for muscle toning, cellulite removal, under-eye bags treatment and skincare products. We have an amazing range of beard growing products.
ReplyDelete
Replies
kawserSeptember 28, 2020 at 11:22 PM
Chicago bike shop
Chicago bike stores
Kids bikes in chicago
Giant road bikes
Cannonade bikes
Specialized bikes
ReplyDelete
Replies
AslamOctober 9, 2020 at 3:41 PM
Facebook Video Downloader
ReplyDelete
Replies
AR RahmanOctober 29, 2020 at 6:41 PM
sagame
ReplyDelete
Replies
HaiderDecember 21, 2020 at 5:07 AM
dankwoods,
fake dankwoods,
ReplyDelete
Replies
AslamJanuary 23, 2021 at 6:54 AM

canim sohbet Android Dear ChatCanım chat is growing day by day with the android chat service it has given to you and serves you, dear chat lovers, free of charge. You can connect and chat online at any time of the day with your smart devices on our chat site. Anonymous chat is a free chat site that allows you to chat randomly and does not charge any fee or money from our valued chat users.
ReplyDelete
Replies
JamesJanuary 27, 2021 at 10:41 AM
Casinos Now Compare the best legit online casinos that pay real money including the best crypto casinos, best mobile casinos and the best online casinos in 2021
ReplyDelete
Replies
HeroMay 16, 2021 at 2:18 AM
virtual reality in education

ReplyDelete
Replies
AR RahmanMay 21, 2021 at 2:06 AM
Commercial property management
ReplyDelete
Replies
JamesMay 31, 2021 at 4:24 AM
buy dank vapes
ReplyDelete
Replies
AslamJune 14, 2021 at 3:40 AM
motherhood CO.MOM is a fun new online magazine and social network community created by moms for moms. Read interesting articles, get advice from our agony aunt, aunt\y anne, learn how to earn an income online with our guru debbie or you can even seek medical advice from our resident physician dr samantha and relax while checking out the latest horoscopes from our trained mystic. Theres even a thriving forum community where moms of the world can get together and enjoy good conversation along with lots of other fun ways to kill some time with a cuppa when taking a break from being super mom at co.mom.
ReplyDelete
Replies
RafidJune 17, 2021 at 12:10 AM
Web Syndication
ReplyDelete
Replies
AnonymousOctober 31, 2021 at 12:52 AM
Getvirtual office in dubai- Establish Your Company's presence in the Most Prestigious Business Address and Get a new height to your business.
ReplyDelete
Replies

Add comment