Breaking into Orkut with Mechanize
Posted by Corban Brook Mon, 26 Feb 2007 16:05:28 GMT
Alternate title: Using Mechanize to Scrape Orkut for Love Letters from Lonely Brazilian Beauties and Translating their Wantings to English
Shout outs to Natai for the request.
In this tutorial we will use mechanize to login to Orkut, pull our scrap messages, we will assume the message is written in Portuguese and translate the message to English with Google Translate.
FUN!
Installation
Please see my “previous mechanize article”: Please see my previous article http://schf.uc.org/articles/2007/02/14/scraping-gmail-with-mechanize-and-hpricot for installation instructions and usage examples.
I apologize for how quick and dirty this post is.. but lack of sleep is a factor.
Explanation
Scraping with mechanize can be a quick and easy way of getting information from websites you subscribe too. Sometimes you do it because the website doesnt not offer an API for getting at the information or maybe you just find it easier to scrape. Scraping is not without its pitfalls, as mechanize does not interpret the javascript you sometimes have to hold its hand and walk it through the login process for certain sites.
First off with orkut the login form is located within an iframe so we should go directly to that link instead of http://www.orkut.com. Fill out your login credentials and submit.
The next page authenticates us and then wants to use a simple javascript redirect to another page. The link is static so I have hard coded it into the script below.
Next is a tricky part. We are sent to a bounce page which determines which google service we are using and changes our path to reflect.
Here is the Javascript
<script language="javascript">
<!--
new_url = "/Home.aspx";
try
{
url = document.location.toString();
i_page = url.indexOf("page=");
if (i_page > 0)
{
i_param = url.indexOf("&", i_page);
if (i_param > 0)
new_url = url.substring(i_page + 5, i_param);
else
new_url = url.substring(i_page + 5);
new_url = unescape( new_url );
}
if (new_url.substring(0, 7) != "http://" && new_url.substring(0, 8) != "https://")
{
if (i_page >= 0)
last_slash = url.lastIndexOf("/", i_page);
else
last_slash = url.lastIndexOf("/");
new_url = url.substring(0, last_slash) + new_url;
}
new_url = new_url.replace("https:", "http:");
}
catch (theException)
{
}
document.location = new_url;
// -->
</script>Basically this boils down to http://www.orkut.com/Home.aspx so lets just hard code that.
Now we have successfully broken into Orkut and can swipe any info we want.
The code below demonstates:
- The login proccess
- Getting your unread message count
- Reading your most recent scraps and translating them from Portuguese to English with Google Translate. (COOL!)
The code
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
# Orkut is slightly different then login for other google services.
# Orkut's login form is included in an iframe so we will need to use the iframe url instead of http://www.orkut.com
page = agent.get 'https://www.google.com/accounts/ServiceLoginBox?service=orkut&nui=2&uilel=1&skipvpage=true&continue=https%3A%2F%2Fwww.orkut.com%2FRedirLogin.aspx%3Fmsg%3D0%26page%3D%252FHome.aspx%253Fxid%253D9364704888537223147&followup=https%3A%2F%2Fwww.orkut.com%2FGLogin.aspx&hl=en-US'
# The login form isnt named but it is the first and only form on the page,.. lets get it.
form = page.forms.first
# Fill out the form with your credentials and submit
form.Email = '*** Your Account Name or Gmail Email ***'
form.Passwd = '*** Your Password ***'
page = agent.submit form
# This is the tricky part as we get bounced around with some javascript redirects.
# The first page we are sent to has a url we need to follow handily accessable in the <noscript> section.
# Simply follow the url. It is not dynamic, no session ids, etc so just hard code it, should never change.
page = agent.get 'https://www.google.com/accounts/CheckCookie?continue=https%3A%2F%2Fwww.orkut.com%2FRedirLogin.aspx%3Fmsg&service=orkut&chtml=LoginDoneHtml&skipvpage=true'
# The next page we are sent to is a little trickier.
# Basically a bunch of javascript that figures out which google service we are using and redirects us to it.
# Ill save you the trouble. It is simply http://www.orkut.com/Home.aspx.
page = agent.get 'http://www.orkut.com/Home.aspx'
# Yay! we have now broken into orkut and are ready to rob it of all its important data. Begin scraping!
# Swipe some info from our account, like check if there are any unread messages
unread_messages = page.search '//a[@href="/Messages.aspx"]/text()'
puts 'You have ' + unread_messages.to_s + '.' if unread_messages
# How about, download all your scrap messages, get mechanize to query Google Translate to convert all those love letters from lonely Brazilians women to english.
page = agent.click page.links.text(/scrapbook/)
#scraps = page.search '/html/body/table/tbody/tr/td/table/tbody/tr/td/form/table/tbody/tr/td'
scraps = page.search '//*[@id="scrap_body_"+*]'
scraps = page.search '/html/body/table:eq(1)//div[@style="overflow: auto;"][a]'
translate = WWW::Mechanize.new
# Open Google Translate
t = translate.get 'http://www.google.com/translate_t'
# Switch select box to Portuguese
# <select name=langpair>
# <option value="en|pt">English to Portuguese</option>
# ..
# </select>
# Happens to be the 23rd option in the list, or array index 22
(0..3).each do |i|
form = t.forms.first
form.fields.name('langpair').options[22].select
message_start = scraps[i].inner_html.index('<br />') + 7
message_end = scraps[i].inner_html.index('<small>') - 3
message = scraps[i].inner_html[message_start..message_end]
form.text = message
t = translate.submit form
result = t.search '//div[@id="result_box"]'
puts "#{result.inner_html}\n\n"
endOutput
All good! I added you in my list of friends of orkut! I found its profile in community. In marries that it wants new friendship, to either welcome!
Ola good day!
I am Brazilian and alive in Rio De Janeiro. Has here the Canadian woman, but this lost one. It says you have children. I want in such way I help it, but I of not know I can. can you help me? Is so important!
I have no idea what that last one is talking about.
Well I hope you enjoyed this quick and dirty example.
Adeus amigos








Hello Corbain, you wrote my name wrong…. but thats all ok, i know its not simple..
I really appreciate your code, now i’ll try some hard scraping from orkut. Later i’ll show you what im making.
and
First… our woman aren’t lonely, they really dont. Second… Never say ‘Adeus’, only if you want it forever, instead say ‘tchau’ = ‘até mais ver’ = ‘see you’(in english)....
Bye Friend…
... about de code, my mistake was in the step when you say ‘This is the tricky part as we get bounced around with some javascript redirects…..’. Instead like you did converting some url escape codes, but not at all, i did it for all one and the address became this:
‘https://www.google.com/accounts/CheckCookie?continue=https://www.orkut.com/RedirLogin.aspx?msg&service=orkut&chtml=LoginDoneHtml&skipvpage=true’
cause that mechanize can’t handle it.
I met ruby 15 days ago and i work with .net (c# and VB), maybe are the reasons i’m a layperson in the game.
see you…
This doesnt work….
It redirects to Orkut Glogin always
I tried the example above as well and it did not work. Like Jasna said, it always redirect to the Login Page.
it dosnt works.. buddy gimme some rael diamond