Extract Data from a website


We have one government website which shows all the data regarding registered companies in India.
Can we extract data of each company based on displayed details?


If that is not possible, can I can just copy and it gets arranged automatically in the way I want?


I can't access the website so I can't give you a definitive answer, but it seems to me that this can be done via multiple methods

  1. Using {urlload}
  2. Using {site: text; selector=}
  3. Copying to clipboard and then using the split function with {clipboard}

can we connect on zoom or google meet and if you can show me how it works using the methods you provided?

However, if you want to access any of the record, you can click on this link, you will see a simple form wherein, in mandatory field you can pase this value U51909MH2020PTC338144 and submit it. After this you will be able to see all the data as shown in my previous screenshot.

Hi @Pratik_Shah, thanks for this.

I already have an idea for a snippet to process this. What I need to know from you is, how do you want to insert this data in your snippet? Do you want to insert everything? Or specific values according to a different choice every time? Or the same values every time?

Let me know and I'll build it for you plus I'll explain how it works so you can learn to use the same concepts in other snippets :slight_smile:

1 Like

Thank you for all your help.

Obtaining all data will work for all my use cases.

Use 1: I want to copy all the data given in second column to the google sheet or excel sheet in a single row (i.e. transpose).

Use 2: Using clipboard, I can use the copied data, and reuse it in the way I want.

Today, I updated one of my snippets based on {clipboard} command, I am coping data from google sheet and using it in my snippets. I am really impressed with this function.

My idea is, if I can easily get this data from given govt. website, I will paste it all in airtable or google sheet from there I can use {clipboard} function and create my documents.

Hi @Pratik_Shah,

So, here's what I came up with:

It uses two methods:

  1. Pasting the data into the paragraph field
  2. Typing the snippet shortcut inside a text field on the website itself where you want to extract the data

There's a note inside the snippet that says:

The snippet is currently using the first method. For illustration purposes, I included sample data from the page you gave me as a default for the {formparagraph} command. However, you can easily change it to:

{formparagraph: name=sampledata; cols=80; rows=20; default={clipboard}}

This will automatically grab whatever is in your computer clipboard.

If you want to use the second method, you need to follow the instruction inside the snippet where it says the following:
"In here, change sampledata to rawdata if you want to use the selector"

Essentially what I'm doing in the snippet is:

  • Grabbing the data using one of the two methods
  • Splitting the data by linebreak (represented by "\n"), which turns it into a list
  • Using the map() function to break each list item into two items by tab-stop (represented by "\t"). This turns my list into a "list of lists"
  • Turning every item pair into a key and a value so that the main list becomes a keyed list
  • Using the keys in that list as values in my {formmenu} and then giving out the corresponding result.

There's quite a bit to chew on in there, so don't worry if you get stuck. I'm happy to explain anything you find unclear.

On a final note, the {urlload} command won't work in this scenario, as the url is always the same regardless of the record, so we can't pull a particular record by loading its unique URL.

Thank you @Cedric_Debono_Blaze for this. Wonderful!

I thought so that urlload won't work.

This snippet somehow solves my purpose to retrieve the data easily. But I really want to grab all right hand column data in google sheet in proper manner. You have mentioned that we can't split these data into different cells. But I saw that, your code provides bare data using comma. Is there any possibility, instead of comma, can we insert semicolon ; between all values which further can be spit in to individual values?

(Sorry didn't read the full post so not aware of the context but...) The uniqueness constraint on the URL to the urlload command is only applied on the domain name of the URL, and not to the full URL. So, a URL like: https://blaze.today/{variable} is allowed as long as https://blaze.today is not a variable In fact, this is exactly the technique I have used in all my Airtable snippets. Therefore as I see it getting a unique record should be possible as long as it's something like: https://api.airtable.com/{base_id}/{record_id}

Sorry if this repeats something you've already mentioned haha, but just wanted to clarify on that :slight_smile:

@Gaurang_Tandon - the issue is that the url remains the same regardless of the record being accessed on the website.

It's always this: https://www.mca.gov.in/mcafoportal/viewCompanyMasterData.do