An easy to implement web data extractor for WordPress. This plugin can be used to display realtime data from any websites directly into your posts, pages or sidebar. It temporarily caches the content on your website. You can use this plugin to include realtime stock quotes, cricket or soccer scores or any other generic content from public domains.
Web scraping is a practice of extracting data from another website. When doing so, you need to ensure that you have the permissions to use the content, and you need to give due credit to the original source.
Make sure when scraping content, you do not violate any copyright laws.
- Scraped output can be displayed through custom template tag, shortcode in page, post and sidebar (through a text widget).
- Configurable caching of scraped data. Cache timeout can be defined in minutes for every scraped data.
- Configurable Useragent for your scraper can be set for every scrape.
- Configurable default settings like enabling, useragent, timeout, caching, error handling.
- Multiple ways to query content – CSS Selector, XPath or Regex.
- A wide range of arguments for parsing content.
- Option to pass post arguments to a URL to be scraped.
- Dynamic conversion of scraped content to specified character encoding to scrape data from a site using different charset.
- Create scraped pages on the fly using dynamic generation of URLs to scrape or post arguments based on your page’s get or post arguments.
- Callback function for advanced parsing of scraped data.
Example code for some common use cases of the plugin
Only works on some pages. On other pages it gives an error no matter what you try to select. The author abandoned this plugin a couple years ago.
The plugin works great for me until some things like this happen Error Parsing: Query returned empty response
- Bug fix: Post request
- Bug fix: gt, lt arguments
- Added scrape importer
- replace_query and replace_with now accepted specially formatted array arguments
- Basehref bug fix
- Documentation website change
- Bug fix: Minor bug fixes.
- Enhancement: Complete code rewrite, uses PHP DOM directly for faster processing
- Enhancement: Sandbox to test and debug
- Deprecation: Dropped
- Changes: Changes in arguments
- Enhancement: Migrated caching to the Transients API.
- Enhancement: Clear and find / replace now supports selectors.
- Enhancement: Cleaner code – faster processing.
- Enhancement: More debugging data including processing time.
- Deprecation: Modules are deprecated in support of callback functions.
- Enhancement: Added
callbackfor flexible as well as advanced parsing.
- Bug fix: Fixed the issue of usage within widget.
- Enhancement: Added
removetagsto remove certain tags and content from scrape.
- Bug fix: Retains http-cache and modules on upgrade.
- Bug fix: Patched a major security issue related to useragent string settings.
- Bug fix: Added xpathdecode to handle complex xpath queries in shortcode.
- Enhancement: Added support for xpaths.
- Enhancement: Uses builtin WP_HTTP classes instead of raw cURL or Fopen.
- Enhancement: Complete overhaul of code, architecture and documentation.
- Enhancement: Reversed to filebased cache instead of MySQL tables.
- Enhancement: Introduction of special variable
___QUERY_STRING___for dynamic URLs.
- Enhancement: Upgraded the underlying phpQuery library to single file version.
- Enhancement: Option to turn off the debug information displayed as html comment.
- Milestone release: Complete overhaul of code, architecture and documentation.
- Bug fix: Multiple bug fixes addressed.