Fighting Phishing with Discriminative Keypoint ... - Semantic Scholar

E-Commerce Track

E d i t o r s : K . J . L i n • k l i n @ u c i .e d u Ya n Wa n g • ya n wa n g @ i c s . m q .e d u .a u

Fighting Phishing with Discriminative Keypoint Features Phishing is a form of online identity theft associated with both social engineering and technical subterfuge and is a major threat to information security and personal privacy. Here, the authors present an effective imagebased antiphishing scheme based on discriminative keypoint features in Web pages. Their invariant content descriptor, the Contrast Context Histogram (CCH), computes the similarity degree between suspicious and authentic pages. The results show that the proposed scheme achieves high accuracy and low error rates.

Kuan-Ta Chen, Chun-Rong Huang, and Chu-Song Chen Institute of Information Science, Academia Sinica Jau-Yuan Chen Columbia University

56

Published by the IEEE Computer Society

P

hishing is a form of online identity theft associated with both social engineering and technical subterfuge. Specifically, phishers attempt to trick Internet users into revealing sensitive or private information, such as their bank account and credit-card numbers. Unwary users are often lured to browse counterfeit Web sites through spoofed email, and they might easily be convinced that fake pages with hijacked brand names are authentic. When users unwittingly browse phishing pages, phishers can plant crimeware, also known as malware, on the victims’ computers. Then, through this crimeware, phishers can steal users’ private information, redirect them to malicious sites directly, or 1089-7801/09/$25.00 © 2009 IEEE

redirect them to the intended Web sites by way of phisher-controlled proxies. The Anti-Phishing Working Group (APWG) reported that the number of phishing Web pages has increased by 28 percent a month since July 2004,1 and 5 percent of users who receive phishing emails respond to such scams, according to the APWG Web site (www.antiphishing.org). More than 55,000 cases of phishing were reported to, or detected by, the APWG in April 2007,1 and up to 95 percent of phishing targets were related to financial services and Internet retailers. According to a Gartner survey, in the US in 2007, more than $3.2 billion was lost due to phishing attacks on 3.6 million people.2 Phishing has thus become a IEEE INTERNET COMPUTING

Fighting Phishing Web page snapshot

serious threat to information security and Internet privacy. To deceive users into thinking phishing sites are legitimate, fake pages are often designed to look almost the same as the official ones in both layout and content. In addition, phishers might insert an arbitrary advertisement banner that redirects users to another malicious Web site if they click on it. To address the difficulty in distinguishing between legitimate and phishing pages, we’ve developed an invariant content descriptor that measures the degree of similarity between suspicious and authentic pages.

The Proposed Scheme

Phishers can compose visually similar phishing pages in many different ways with nontext HTML elements, such as images and Flash objects (see the “Current Antiphishing Approaches” sidebar for more about such techniques). To combat this problem, we compute the similarity of phishing pages and authentic pages at their presentation level. Specifically, we treat phishing page detection as an image-matching problem. Figure 1 illustrates the flow of our proposed detection scheme, which involves two steps: image-based page matching and page classification. In the proposed scheme, we first take a snapshot of a suspect Web page and treat it as an image throughout the detection process. We use the Contrast Context Histogram (CCH) descriptors proposed in earlier work1,2 to capture invariant information around discriminative keypoints on the suspect page. We then match the descriptors with those of authentic pages that are often targeted by phishers, where the pages are stored in a database compiled by users and authoritative organizations, such as the APWG. Matching CCH descriptors yields a similarity degree for a suspect page and an authentic page. Finally, we use the similarity degree between the two pages to determine whether the suspect page is a counterfeit. If the similarity degree between a suspect page and an authentic one is greater than a certain threshold, we consider the suspect page to be a phishing page for the authentic one or genuine if it’s not a phishing page for any authentic pages in the database.

Contrast Context Histogram The computer vision and image-processing fields have used image-matching techniques for a long time. To determine whether two images are MAY/JUNE 2009

Step 1: image-based page matching Keypoint feature calculator Keypoint feature extractor

Prestored protected-page feature database

Feature matching

Step 2: phishing judgment Page similarity degree (threshhold = 0.6)

Phishing report

Figure 1. The flow of the proposed phishing-detection scheme. We first take a snapshot of a suspect page and extract its keypoint feature information. Next, we match the features with the kepyoint feature information of protected Web pages. We can then assess the suspect page to determine whether it’s a phishing page. similar, a common approach involves extracting a vector of salient features from each image and computing the distance between those vectors. We take this distance as the degree of visual difference between the two images. A color histogram, for example, which represents the distribution of the colors an image uses, is a widely used feature for image matching. However, we consider it unsuitable for computing the similarity between Web pages because such pages usually contain fewer colors than paintings or photographs. Thus, Web pages often have similar color distributions, so a color histogram isn’t a useful discriminative feature. We use the CCH descriptor3,4 because of its effectiveness and computational efficiency. Originally, it was designed to achieve scale and rotation invariance in image matching — that is, two images are considered similar even if one has undergone various types of scale or rotation transformation. However, such transformations are rarely seen in phishing pages because the pages must be very similar to the corresponding authentic pages to deceive unsuspecting users. Thus, we adapted the CCH descriptor to a more lightweight design for Web page comparisons. We call our design the LCCH descriptor hereafter. To construct L-CCH descriptors for an image, we use only gray-level information, which we obtain by averaging the red, green, and blue values of each pixel in the image. We then take 57

E-Commerce Track

(a)

(b)

Figure 2. Keypoints detected in an image. (a) Keypoints (green crosses) are points in an image that can still be detected after the image undergoes changes, such as lighting variations. (b) The log-polar coordinate system centers on a keypoint. The angle coordinate is divided into eight levels, and the distance coordinate is divided into three. The result is n = 24 subregions. the Harris-Laplacian corners5 as the images’ keypoints. Basically, the corner-detection method finds a certain number of salient points in an image; a point is considered a keypoint if it can still be detected after the image undergoes various changes, such as shifting, lighting variation, color transformation, or format conversion. Figure 2 shows an example of keypoints (green crosses) detected in an image. We use neighboring pixels’ relative brightness to describe a keypoint. By uniformly quantizing the azimuth angle and the distance coordinates, we can divide each keypoint’s neighbor region into n nonoverlapping subregions, where n = 24 (see Figure 3). The advantage of using a log-polar coordinate system is that this system is more sensitive to the image points near the center than to those points farther away. For each neighboring pixel of a keypoint, we calculate the contrast value — that is, the difference between the pixel’s and the keypoints’ gray levels. As Figure 3a shows, a subregion can contain some pixels with positive contrast values (pink) and some with negative contrast values (blue). We summarize the information in each subregion by averaging the positive and negative contrast values, respectively; so, we can describe each subregion via a twotuple contrast vector, as Figure 3b shows. We then concatenate all subregions’ contrast vectors to form a 2n-dimensional vector and define it as the L-CCH descriptor, where n is the number of subregions. Finally, to make the L-CCH 58

descriptor invariant to linear lighting changes, we normalize it to a unit-length vector. Having obtained the L-CCH descriptor for each keypoint, we can quantify the similarity between two keypoints based on the Euclidean distance between their descriptors. A short Euclidean distance indicates that the keypoints are similar in terms of neighboring information. Based on this property, and using the following steps, we find the most similar keypoint on a suspect Web page for each keypoint, K, on the authentic Web page. First, we search for the two keypoints, A and B, on the suspect page that have the shortest and the second-shortest Euclidean distances from the keypoint, K, on the authentic page. Second, we consider K and A to be a successful match if the ratio between the distance from K to A and the distance from K to B is smaller than a certain threshold (set to 0.6 in our experiments). Otherwise, we consider the keypoint K to have no corresponding keypoints on the suspect page. Figure 4 shows an example of image correspondence that the L-CCH descriptor found; a line connecting two keypoints means that a match exists between the images.

Page Similarity Degree To determine whether a suspect Web page is a phishing page, we evaluate its similarity to the potential target based on CCH descriptors. Ideally, the number of successful matches the descriptor finds will indicate the degree of similarity between the two pages. However, this isn’t always true with Web page comparisons. Two pages might have numerous keypoint matches not because they look similar but simply because they contain the same logo — for example, the logo for VeriSign, a well-known identity protection service provider. So, to judge two Web pages’ similarity, we must consider not only the matched keypoints’ number but also their spatial distribution, or locations. To take matched keypoints’ location into account, we use the k-means algorithm6 to divide them into coherent groups based on their spatial distributions. The algorithm ensures that the keypoints in a group are always in a neighboring region. Figure 5 (p. 60) shows the clustering result of the official eBay Web page (Figure 5a) and a phishing eBay page (Figure 5b), where k = 4 groups are circled using different colors. Based on the results, we match groups of keypoints between the two Web pages by voting

www.computer.org/internet/

IEEE INTERNET COMPUTING

Fighting Phishing

–19

–23

10

48 –8

–9

4

7

–34 34

4

24 –16

–7

4

2

–4

–6

3

7

–3

8

19

–3

–5

9

33 –28

–5

51 –8

–19

–97

17

27 –18 21

–2

–8

8

120

36 –44

–1

40 –18

–9

35 –22 20 –60 –82 –33 50

(a)

–58

41

–74 –48 60 57 36

48

–30 –50 –90 80

(b)

Figure 3. The lightweight Contrast Context Histogram (L-CCH) descriptor shown with the log-polar coordinate system. (a) The gray-value contrast between neighboring pixels and the keypoint (center). (b) The L-CCH descriptor with a twotuple contrast vector in each subregion. — that is, for a group of keypoints, A, on the authentic page, we consider a group of keypoints, B, on the suspect page as A’s mapping if most keypoints in A match keypoints in B. We then define a keypoint as geographically matched if its group is a mapping of its corresponding keypoint’s group. In cases in which two pages are dissimilar, the number of matched points will be small, so that the algorithm can’t even perform clustering. Figure 6 shows the matching result of pages from different sites. Although a few keypoints match, none are geographically matched because the algorithm found no clusters. Given the geographical matching information, we define the similarity degree between two Web pages as the ratio of geographically matched keypoints to all the identified keypoints on the two pages. Because phishing pages are similar to the authentic pages they try to mimic, we use the similarity degree between a suspect page and the authentic one to determine whether the suspect is indeed a counterfeit.

Performance Evaluation

According to a Secure Computing survey, more than half the phishing attacks in 2007 targeted famous Web sites such as eBay and PayPal.7 For this reason, we collected several real-life phishing Web pages that targeted the top five phishMAY/JUNE 2009

Figure 4. Sample result of image matching using the lightweight Contrast Context Histogram (L-CCH) descriptor. A line connecting two keypoints means that a match exists between the images. ing targets — namely, eBay, PayPal, Marshall & Ilsley Bank, Charter One Bank, and Bank of America. In addition, we collected 300 Web pages of well-known online bank and auction services, which are often targeted in phishing attacks, to observe the distribution of both the similarity degree between a phishing page and its corresponding authentic page, and the similarity between two unrelated Web pages. We found that the former is normally a small value around zero, whereas the latter is normally a 59

E-Commerce Track

Figure 5. Clustering and matching of eBay’s official page and a phishing page. Different clusters are circled in different colors.

Figure 6. Matching two pages from different sites. In this case, there are too few matched keypoints to perform clustering. large value around one. Based on our observations, we empirically set the threshold to 0.6 and determine that a suspect page is a phishing page if its similarity degree is higher than this threshold. The evaluation results listed in Table 1 show that our scheme achieves a high degree of accuracy that ranges between 95 and 98 percent. Moreover, the error rates — that is, the false-positive and false-negative rates — are much lower than 1 percent in most cases.

Case Studies

Although phishers endeavor to make phishing pages indistinguishable from authentic ones, they usually make some modifications to evade phishing-detection techniques. We present 60

two case studies to explain how our detection scheme works for real-world cases. In our first case, which is a typical example, the phishers add an advertisement banner to the phishing page to slightly alter the layout. Unwary users might not notice the change, but it could make antiphishing tools less effective. Figure 7 shows the authentic Bank of America login page on the left and a phishing page with an advertisement banner inserted on the right. Because the change is minor, and Internet users are accustomed to advertisements on Web pages, users might not notice the inserted banner. Even so, the banner changes the page’s aspect ratio and adds considerable red to the image, which will reduce the detection ability of antiphishing sol


IEEE INTERNET COMPUTING

Fighting Phishing Table 1. The top five phishing target sites.* Sites

Number of records

Correct rate (%)

False-negative rate (%)

False-positive rate (%)

eBay

701

96.8

0.0

0.1

PayPal

632

97.7

0.0

0.1

Marshall & Ilsley Bank

138

97.7

0.0

0.1

Charter One Bank

116

98.0

0.0

0.1

51

95.4

2.0

2.1

Bank of America

*Total number of phishing target pages: 300 pages in 74 sites

Figure 7. Case study. Differences between the phishing page (right) and the real Bank of America login page (left) include a banner ad and additional fields that have been added to the phishing page. utions based on color distributions and page layout. In contrast, our scheme’s effectiveness isn’t degraded because it’s based on local discriminative keypoints. Note that such banners not only help phishers evade antiphishing solutions but also make them money every time a banner is displayed on a user’s computer. Our second case demonstrates another common phishing strategy whereby phishers alter input forms by adding or removing fields. For example, in the Bank of American case Figure 7 shows, the phishers added an additional “Enter Passcode” field to the phishing page. Consequently, unwitting users might provide sensitive information without realizing that the authentic page doesn’t request such information. In other cases, phishers add fields that ask for more private data from users, such as credit-card numbers and Social Security numbers. Most users have trouble detecting that these modifications are fake because they don’t usually remember exactly what fields should appear on an input form. Once again, this case demonstrates our scheme’s efficacy. Even though both the adverMAY/JUNE 2009

tisement banner and the additional field alter the page layout and aspect ratio, our L-CCH descriptor still yields a near-perfect matching between the phishing and authentic pages’ keypoints. These examples demonstrate how phishers can alter an authentic page’s design to deceive unwary users. Nevertheless, to ensure that phishing pages are similar to authentic ones, phishers must preserve most of the original page’s main elements. Our scheme can detect similarities between fake and original pages regardless of the types of changes made.

B

ecause our antiphishing scheme is based purely on passive monitoring of Web pages that users browse, it’s orthogonal to other solutions. Organizations can thus freely integrate it with their existing prevention and detection schemes to combat phishing in multiple ways. Just like the endless competition between computer virus writers and antivirus software developers, phishers will certainly strive to develop countermeasures against antiphishing solutions. 61

E-Commerce Track

Current Antiphishing Approaches

R

esearchers have proposed several antiphishing techniques in recent years to counter or prevent the increasing number of attacks. Generally speaking, we can divide phishing detection and prevention techniques into two categories: email-level approaches, including authentication and content filtering, and browser-integrated tools, which usually use URL blacklists or employ Web page content analysis. The email filtering techniques commonly used to prevent phishing are quite popular in antispam solutions because they try to stop email scams from reaching target users by analyzing email contents. The challenge in designing such techniques lies in how to construct efficient filter rules and simultaneously reduce the probability of false alarms. Phishing messages are usually sent as spoofed emails; therefore, researchers have proposed numerous path-based verification methods. Current mechanisms, such as Microsoft’s Sender ID or Yahoo’s DomainKey, are designed by looking up mail sources in DNS tables. However, these solutions aren’t yet widely deployed. Currently, the companies only provide the mechanisms free of charge in their own products and services. A browser-integrated tool usually relies on a blacklist containing the URLs of malicious sites to determine whether a URL corresponds to a phishing page. In Microsoft Internet Explorer (IE) 7, for example, the address bar turns red when a malicious page loads. A blacklist’s effectiveness is strongly influenced by its coverage, credibility, and update frequency. Currently, the most well-known blacklists are those Google and Microsoft maintain for the popular browsers Mozilla Firefox and IE, respectively. However, experiments show that neither database can achieve a correct detection rate greater than 90 percent, and the worst-case scenario can be less than 60 percent.1,2 Some browser-integrated tools, such as SpoofGuard, 3 iTrustPage,4 and others, 5,6 adopt approaches other than blacklists. One approach examines a suspect page’s URL to determine if it’s a spoofed address. For example, http://fake.net/www.amazon.com/sign-in might link to a phishing page that mimics http://www.amazon.com/ sign-in as the target. Other approaches focus on analyzing

a Web page’s content, such as the HTML code, text, input fields, forms, links, and images. In the past, such content-based approaches proved effective in detecting phishing pages; however, phishers responded by compiling pages with non-HTML components, such as images, Flash objects, and Java applets. A phisher might design a fake page composed entirely of images, even if the original page contains only text information. In this case, content-based antiphishing tools can’t analyze the suspect page because its HTML code contains nothing but HTML elements. To address this problem, Anthony Fu and his colleagues proposed detecting phishing pages based on the similarity between the phishing and authentic pages at the visual appearance level,7 rather than using text-based analysis. However, this approach is susceptible to significant changes in the Web page’s aspect ratio and important colors used. The antiphishing scheme we present in the main text addresses this issue. References 1. P. Robichaux and D.L. Ganger, “Gone Phishing: Evaluating Antiphishing Tools for Windows,” 3Sharp Project Report, Sept. 2006; www.3sharp.com/ projects/antiphishing/. 2. C. Ludl et al., “On the Effectiveness of Techniques to Detect Phishing Sites,” Proc. Detection of Intrusions and Malware, and Vulnerability Assessment, LNCS 4579, Springer, 2007, pp. 20–39. 3. N. Chou et al., “Client-Side Defense Against Web-Based Identity Theft,” Proc. Network and IT Security Symposium, Internet Soc., 2004; http://crypto. stanford.edu/SpoofGuard/webspoof.pdf. 4. T. Ronda, S. Saroiu, and A. Wolman, “iTrustPage: A User-Assisted Antiphishing Tool,” Proc. ACM European Conf. Computer Systems (EuroSys 08), ACM Press, 2008, pp. 261–272. 5. W. Liu et al., “An Antiphishing Strategy Based on Visual Similarity Assessment,” IEEE Internet Computing, vol. 10, no. 2, 2006, pp. 58–65. 6. L. Wenyin et al., “Detection of Phishing Webpages Based on Visual Similarity,” Proc. World Wide Web Conf. (special interest tracks and posters), A. Ellis and T. Hagino, eds., ACM Press, 2005, pp. 1060–1061. 7. A.Y. Fu, L. Wenyin, and X. Deng, “Detecting Phishing Web Pages with Visual Similarity Assessment Based on Earth Mover’s Distance (EMD),” IEEE Trans. Dependable and Secure Computing, vol. 3, no. 4, 2006, pp. 301–311.

To combat phishers’ further countermeasures, we plan to release a browser plug-in implementing our antiphishing scheme, which not only protects innocent users from phishing attacks, but also monitors phishing activities continuously.

Science Council of Taiwan under grants NSC98-2631-001011 and NSC98-2631-001-013.

References

Acknowledgments This work was supported in part by the Taiwan Information Security Center (TWISC), National Science Council, under the grants NSC97-2219-E-001-001 and NSC97-2219-E-011-006. It was also supported by the Taiwan E-Learning and Digital Archives Programs (TELDAP) sponsored by the National 62


1. “APWG Phishing Trends Reports,” The Anti-Phishing Working Group, www.antiphishing.org/phishReports Archive.html. 2. “Gartner Survey Shows Phishing Attacks Escalated in 2007, More than $3 Billion Lost to These Attacks,” Gartner, 2007; www.gartner.com/it/page.jsp?id=565125. 3. C.-R. Huang, C.-S. Chen, and P.-C. Chung, “Contrast Context Histogram — A Discriminating Local DeIEEE INTERNET COMPUTING

Fighting Phishing

scriptor for Image Matching,” Proc. Int’l Conf. Pattern Recognition (ICPR 06), IEEE CS Press, 2006, pp. 53–56. 4. C.-R. Huang, C.-S. Chen, and P.-C. Chung, “Contrast Context Histogram — An Efficient Discriminating Local Descriptor for Object Recognition and Image Matching,” Pattern Recognition, vol. 41, no. 10, 2008, pp. 3071–3077; http://imp.iis.sinica.edu.tw/CCH/CCH.htm. 5. K. Mikolajczyk and C. Schmid, “Indexing Based on Scale Invariant Interest Points,” Proc. Int’l Conf. Computer Vision, vol. 1, IEEE Press, 2001, pp. 525–531. 6. J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000. 7. “Phishing Statistics,” Secure Computing, 2007, www. ciphertrust.com/resources/statistics/phishing.php.

Jau-Yuan Chen is a graduate student in the computer science department at Columbia University. His research interests include Internet and information security. Chen has an MS in computer science and information engineering from National Taiwan University. Contact him at [email protected].

Kuan-Ta Chen is an assistant research fellow in the Institute of Information Science and the Research Center for Information Technology Innovation at Academia Sinica. His research interests include Internet quality of experience management, network security, and online gaming. Chen has a PhD in electrical engineering from National Taiwan University. Contact him at ktchen@ iis.sinica.edu.tw.

Chu-Song Chen is a research fellow in the Institute of Information Science and the Research Center for Information Technology Innovation at Academia Sinica. His research interests include pattern recognition, computer vision, and image processing. Chen has a PhD in computer science and information engineering from National Taiwan University. Contact him at song@iis. sinica.edu.tw.

Chun-Rong Huang is a postdoctoral fellow in the Institute of Information Science at Academia Sinica. His research interests include computer vision, computer graphics, multimedia signal processing, and image processing. Huang has a PhD in electrical engineering from National Cheng Kung University. Contact him at [email protected].

RUNNING IN CIRCLES LOOKING FOR A GREAT COMPUTER JOB OR HIRE? The IEEE Computer Society Career Center is the best niche employment source for computer science and engineering jobs, with hundreds of jobs viewed by thousands of the finest scientists each month in Computer magazine and/or online!

> Software Engineer > Member of Technical Staff > Computer Scientist > Dean/Professor/Instructor > Postdoctoral Researcher > Design Engineer > Consultant

http://careers.computer.org The IEEE Computer Society Career Center is part of the Physics Today Career Network, a niche job board network for the physical sciences and engineering disciplines. Jobs and resumes are shared with four partner jobboards Physics Today Jobs and the American Association of Physics Teachers (AAPT), American Physical Society (APS), and AVS: Science and Technology of Materials, Interfaces, and Processing Career Centers.

MAY/JUNE 2009

63