Reference Index

Model	Description
.halguru-webscraping.yaml:	Represents the configuration settings for a website crawler or scraper.
MaxLevel	The maximum level allowed for processing or operations in the website.
MaxPages	The maximum number of pages to process for the website.
Pages[].ContainsText	Only wepage contains text will be processed.
Pages[].ContainsXpath	If defined, ContainsText will be checked only in the xpath.
Pages[].ContentXPath	Only html tags from this xpath. If not empty.
Pages[].Description	Provides details or information about the web page or a specific element within the web page.
Pages[].DisabledTags[]	List of HTML tags that should be disabled or ignored during processing.
Pages[].Features[].IncludeHtml	Determines if the raw HTML representation of a specific web feature is extracted and added to the feature's output during processing.
Pages[].Features[].IncludeText	Controls whether the extracted plain text, processed via relevant scraping logic, is added to the resulting feature output.
Pages[].Features[].Name	The name of the website feature.
Pages[].Features[].NameRegex	The regular expression pattern to identify the name component of a website feature.
Pages[].Features[].NameXpath	The XPath expression used to locate the name of a specific feature within the website content.
Pages[].Features[].NormalizeWhitespaces	When enabled, consecutive whitespace characters are collapsed into a single space, facilitating cleaner and more standardized output after web scraping.
Pages[].Features[].RemoveHtmlAttributes	This property is primarily used to strip unnecessary attributes from HTML elements for cleaner and more optimized data extraction.
Pages[].Features[].RemoveHtmlTags	Determines if the raw HTML content will have tags stripped for plain text processing.
Pages[].Features[].TagName	The tag name used to identify or categorize the website feature.
Pages[].Features[].ValueRegex	The regular expression used to extract specific value matches from the HTML content of a website.
Pages[].Features[].ValueXpath	The XPath expression used to locate and extract the value of a specific feature within a website's HTML content.
Pages[].Features[]	Represents a collection of features extracted or associated with a specific web page, defining key elements or properties of interest within the page.
Pages[].Files[].Name	The name of the website feature.
Pages[].Files[].NameRegex	The regular expression pattern to identify the name component of a website feature.
Pages[].Files[].NameXpath	The XPath expression used to locate the name of a specific feature within the website content.
Pages[].Files[].TagName	The tag name used to identify or categorize the website feature.
Pages[].Files[].UrlRegex	The regular expression pattern for matching URLs associated with the website file.
Pages[].Files[].UrlXpath	The XPath expression used to extract the URL from a website file's content.
Pages[].Files[]	Represents a collection of files associated with the webpage for processing or extraction.
Pages[].IncludeHtml	Indicates whether the HTML content of the web page should be included in the output during the web scraping process.
Pages[].IncludeText	Determines whether the textual content of a web page should be included during the web scraping process.
Pages[].Name	The name of the web page or a specific element within the web page.
Pages[].NormalizeWhitespaces	Indicates whether whitespaces should be normalized in the text content extracted from an HTML node.
Pages[].RemoveHtmlAttributes	Indicates whether HTML attributes should be removed during the web scraping process.
Pages[].RemoveHtmlTags	Indicates whether HTML tags should be removed from the content of a web page during processing.
Pages[].TagName	The tag name of an HTML element or feature within a website page.
Pages[].UrlContains	Filter to process only web pages whose URL contains the specified substring.
Pages[]	The collection of website pages configuration.
StartUrl	The starting URL for the website.
UrlsStartWith[]	The collection of URL prefixes used to filter and process relevant URLs.

Last updated:		2026-01-26
Autogenerated:		Yes
AI powered:		Yes
Core version:		1.77.0