design:website_classification
                Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| design:website_classification [2025/01/02 17:44] – Figures in float (wrap) karelkubicek | design:website_classification [2025/06/23 08:50] (current) – [Adult Websites, Security, and Privacy Protection] karelkubicek | ||
|---|---|---|---|
| Line 20: | Line 20: | ||
| - Limited granularity in labels, which may not suit detailed marketing or behavioral analysis. | - Limited granularity in labels, which may not suit detailed marketing or behavioral analysis. | ||
| - Documentation includes deprecated categories, leading to potential misinterpretations. | - Documentation includes deprecated categories, leading to potential misinterpretations. | ||
| - | TODO: URL, API | + |  | 
| ==== FortiGuard ==== | ==== FortiGuard ==== | ||
| Line 30: | Line 30: | ||
| - Lower label granularity may restrict its applicability outside security domains. | - Lower label granularity may restrict its applicability outside security domains. | ||
| - Limited documentation transparency in certain sensitive categories. | - Limited documentation transparency in certain sensitive categories. | ||
| - | TODO: URL, API | + |  | 
| ==== Symantec ==== | ==== Symantec ==== | ||
| Line 39: | Line 39: | ||
| - Taxonomy is less diverse compared to marketing-oriented services. | - Taxonomy is less diverse compared to marketing-oriented services. | ||
| - Limited coverage for obscure or long-tail domains. | - Limited coverage for obscure or long-tail domains. | ||
| - | TODO: URL, API | + |  | 
| ==== Trend Micro ==== | ==== Trend Micro ==== | ||
| Line 46: | Line 46: | ||
| - Labels aligned with threat intelligence, | - Labels aligned with threat intelligence, | ||
| * **Disadvantages**: | * **Disadvantages**: | ||
| - | - (TODO: URL, API) | + | - <wrap todo> | 
| + | <wrap todo>TODO: URL, API</ | ||
| ==== Forcepoint ==== | ==== Forcepoint ==== | ||
| Line 55: | Line 56: | ||
| - Limited multi-labeling capabilities restrict nuanced classification. | - Limited multi-labeling capabilities restrict nuanced classification. | ||
| - Challenges in documenting clear and concise taxonomies. | - Challenges in documenting clear and concise taxonomies. | ||
| - | TODO: URL, API | + |  | 
| ==== Dr.Web ==== | ==== Dr.Web ==== | ||
| Line 64: | Line 65: | ||
| - Very low coverage. | - Very low coverage. | ||
| - Lack of nuanced or detailed labeling reduces utility in research or marketing. | - Lack of nuanced or detailed labeling reduces utility in research or marketing. | ||
| - | TODO: URL, API | + |  | 
| ===== Marketing and Content Discovery ===== | ===== Marketing and Content Discovery ===== | ||
| Line 83: | Line 84: | ||
| - Precision and granularity can vary, sometimes complicating results. | - Precision and granularity can vary, sometimes complicating results. | ||
| - Documentation and taxonomy definitions require improvement for research usability. | - Documentation and taxonomy definitions require improvement for research usability. | ||
| - | TODO: URL, API | + |  | 
| ===== General Classification with Human Contributions ===== | ===== General Classification with Human Contributions ===== | ||
| Line 94: | Line 95: | ||
| - Scalability issues due to reliance on human volunteers. | - Scalability issues due to reliance on human volunteers. | ||
| - Low coverage and subjective biases in labeling. | - Low coverage and subjective biases in labeling. | ||
| - | TODO: URL, API | + |  | 
| ==== DMOZ (Curlie) ==== | ==== DMOZ (Curlie) ==== | ||
| Line 103: | Line 104: | ||
| - Extremely limited scalability due to a small number of editors. | - Extremely limited scalability due to a small number of editors. | ||
| - Labels may be outdated due to infrequent updates for many categories. | - Labels may be outdated due to infrequent updates for many categories. | ||
| - | TODO: URL, API | + |  | 
| ===== Aggregated Services ===== | ===== Aggregated Services ===== | ||
| Line 114: | Line 115: | ||
| - Inconsistencies due to integration of outdated or non-standardized data. | - Inconsistencies due to integration of outdated or non-standardized data. | ||
| - Lack of direct control over taxonomies used by aggregated providers. | - Lack of direct control over taxonomies used by aggregated providers. | ||
| - | TODO: URL, API | + |  | 
| ===== Company Datasets ===== | ===== Company Datasets ===== | ||
| Compared to other services listed before, the following datasets are company-oriented instead of website-oriented. Some of them include website of the company, but this matching might be incomplete and might cause the following issues: | Compared to other services listed before, the following datasets are company-oriented instead of website-oriented. Some of them include website of the company, but this matching might be incomplete and might cause the following issues: | ||
| - | * If company owns multiple websites: | + | |
| - | ** Likely only the main website will be listed. | + |  | 
| - | ** This is especially pronounced with international versions of the website. | + | * Likely only the main website will be listed. | 
| - | * Likewise, the dataset may contain multiple companies for a given website: | + | * This is especially pronounced with international versions of the website. | 
| - | ** Because of sister companies in a corporate. | + | * Likewise, the dataset may contain multiple companies for a given website: | 
| - | ** Many small businesses list as their website social media. Sometimes, this link does not include the full path so a single-person company might indicate facebook.com as domain. | + | * Because of sister companies in a corporate. | 
| + | * Many small businesses list as their website social media. Sometimes, this link does not include the full path so a single-person company might indicate facebook.com as domain. | ||
| ==== PeopleDataLabs ==== | ==== PeopleDataLabs ==== | ||
| Line 132: | Line 134: | ||
| - Based on LinkedIn profiles that are self-reported - prone to adversarial data. | - Based on LinkedIn profiles that are self-reported - prone to adversarial data. | ||
| - Only a subset of PeopleDataLabs' | - Only a subset of PeopleDataLabs' | ||
| - | TODO: cite '' | + |  | 
| ==== Crunchbase ==== | ==== Crunchbase ==== | ||
| Line 143: | Line 145: | ||
| - URLs are extremely noisy (they are not the priority) (Source: Karel Kubicek' | - URLs are extremely noisy (they are not the priority) (Source: Karel Kubicek' | ||
| - Focuses mostly on variables useful for investments and market competitiveness. | - Focuses mostly on variables useful for investments and market competitiveness. | ||
| - | TODO: cite '' | + |  | 
| ==== Orbis ==== | ==== Orbis ==== | ||
| Line 182: | Line 184: | ||
| * [[https:// | * [[https:// | ||
| * [[https:// | * [[https:// | ||
| + | * [[https:// | ||
| - | Visit individual privacy-oriented pages for more details regarding classification of [[Privacy: | + | Visit individual privacy-oriented pages for more details regarding classification of [[Privacy: | 
| ==== Marketing Industry ==== | ==== Marketing Industry ==== | ||
| Line 233: | Line 236: | ||
| <bibtex bibliography></ | <bibtex bibliography></ | ||
| - | ====== BibTex ====== | ||
| - | <bibtex database> | ||
| - | @inproceedings{vallina2020_misshapes, | ||
| - | author = {Vallina, Pelayo and Le Pochat, Victor and Feal, \' | ||
| - | title = {Mis-shapes, | ||
| - | year = {2020}, | ||
| - | isbn = {9781450381383}, | ||
| - | publisher = {Association for Computing Machinery}, | ||
| - | address = {New York, NY, USA}, | ||
| - | url = {https:// | ||
| - | doi = {10.1145/ | ||
| - | abstract = {Domain classification services have applications in multiple areas, including cybersecurity, | ||
| - | booktitle = {Proceedings of the ACM Internet Measurement Conference}, | ||
| - | pages = {598–618}, | ||
| - | numpages = {21}, | ||
| - | location = {Virtual Event, USA}, | ||
| - | series = {IMC '20} | ||
| - | } | ||
| - | </ | ||
| + | ~~DISCUSSION~~ | ||
design/website_classification.1735839852.txt.gz · Last modified:  by karelkubicek
                
                